Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,13 @@
.DS_Store

# Downloaded tooling (large binaries — do not commit)
*.jar

# Downloaded ontology
data-transformation/robot/pmdco.owl

# Generated transformation outputs
data-transformation/ottr/tensile.ttl
data-transformation/ottr/tensile-blank.ttl
data-transformation/robot/result.ttl
data-transformation/yarrrml/temp.rml.ttl
103 changes: 86 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,43 +27,112 @@ The **data-transformation** examples illustrate multiple strategies for generati
* Python + RDFLib programmatic graph construction
* YARRRML/RML mappings executed with the RML Mapper
* OTTR template expansion using Lutra
* ROBOT template expansion into OWL using the ROBOT tool


The examples use a very simple CSV-based data as a simple educational use case.
The examples use a very simple CSV-based data set as an educational use case.

## Example data

All examples transform the same two tensile-strength measurements:

| obj_id | value | unit |
|--------|-------|------|
| 1 | 520 | `http://qudt.org/vocab/unit/MegaPA` |
| 2 | 550 | `http://qudt.org/vocab/unit/MegaPA` |

Each measurement is modelled with the PMD / BFO / OBI / RO pattern: a specimen *has a
quality* (tensile strength) whose value is captured by a *value specification*, produced
as the *measurement datum* of a *measurement process*.

## Prerequisites

Depending on the method you want to run:

| Method | Requirements |
|--------|--------------|
| `python-template` | Python 3 (standard library only) |
| `python-rdflib` | Python 3 + [`rdflib`](https://rdflib.readthedocs.io/) (`pip install rdflib`) |
| `ottr` | Java (JRE 8+); `map.sh` downloads [Lutra](https://ottr.xyz/) automatically |
| `yarrrml` | Docker; `map.sh` pulls the `rmlio/yarrrml-parser` and `rmlio/rmlmapper-java` images |
| `robot` | Java (JRE 8+); `map.sh` downloads [ROBOT](http://robot.obolibrary.org/) and the PMD core ontology automatically |

The `ottr`, `yarrrml` and `robot` examples require internet access on first run to fetch
their tooling. Downloaded tools and generated outputs are git-ignored.

## Methods

* **`python-template/`** — fills a plain string template per CSV row and prints Turtle.
Simplest approach, no dependencies, but performs no validation.
* **`python-rdflib/`** — builds the graph programmatically with RDFLib and serialises it.
Guaranteed to be syntactically valid; good when you need logic or branching.
* **`ottr/`** — defines the triple pattern once as a reusable [OTTR](https://ottr.xyz/)
template and expands the data instances with Lutra. Two variants are provided: a named
one (stable IRIs for every individual) and a blank-node one (`map-blank.sh`).
* **`yarrrml/`** — declarative [YARRRML](https://rml.io/yarrrml/) mapping rules that are
compiled to RML and executed against the CSV with the RML Mapper.
* **`robot/`** — a [ROBOT](http://robot.obolibrary.org/) spreadsheet template (TSV) that is
expanded into a full OWL ontology, resolving labels to IRIs via the PMD core ontology.

## Usage

Each folder contains a `map.sh` or `map.py` file which can be executed to run the scripts.
Each folder contains a `map.sh` or `map.py` file which can be executed to run the example.

```
cd data-transformation/<folder>
sh run.sh
cd data-transformation/<folder>
sh map.sh
```

```
cd data-transformation/<folder>
cd data-transformation/<folder>
python map.py
```

## Comparison Table
## Comparison

At a glance:

| Criterion | Python Templates | Python + RDFLib | YARRRML/RML | OTTR |
| ----------------------- | ---------------- | --------------- | ----------- | ----- |
| Easy to start | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ |
| RDF correctness | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Reusability | ⭐ | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Maintainability | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Non-programmer friendly | ⭐ | ⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| Standards-based | ⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Large-scale KG projects | ⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Learning RDF concepts | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ |
| | python-template | python-rdflib | ottr | yarrrml | robot |
|--|-----------------|---------------|------|---------|-------|
| **Paradigm** | string templating | programmatic (RDFLib) | OTTR templates | YARRRML/RML rules | ROBOT template |
| **Runtime** | Python | Python + rdflib | Java (Lutra) | Docker | Java (ROBOT) |
| **Input** | CSV | CSV | stOTTR | CSV + YAML | TSV |
| **Output** | Turtle | Turtle | Turtle | N-Triples | OWL (Turtle) |
| **Produces** | instance data | instance data | instance data | instance data | OWL ontology |

### Comparison Table

| Criterion | Python Templates | Python + RDFLib | YARRRML/RML | OTTR | ROBOT |
| ----------------------- | ---------------- | --------------- | ----------- | ----- | ----- |
| Easy to start | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐⭐ |
| RDF correctness | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Reusability | ⭐ | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
| Maintainability | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐ |
| Non-programmer friendly | ⭐ | ⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| Standards-based | ⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Large-scale KG projects | ⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐ |
| Learning RDF concepts | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ |

## Practical Recommendation
### Strengths and weaknesses

| Method | Strengths | Weaknesses |
|--------|-----------|------------|
| python-template | trivial, no dependencies, full control | no validation → easily produces invalid RDF; scales poorly |
| python-rdflib | real graph object, guaranteed syntactically valid, good for logic/branching | imperative, pattern scattered across many `g.add()` calls |
| ottr | declarative, pattern defined once, compact data, datatype typing | Java tooling, custom stOTTR syntax |
| yarrrml / RML | W3C-aligned standard, declarative, works directly from CSV/JSON/DB, ETL-ready | heavy runtime (Docker / large jars), YAML learning curve |
| robot | produces a full OWL ontology, label→IRI resolution | tied to the OWL/ROBOT workflow, the spreadsheet gets unwieldy quickly |

### Practical Recommendation

For most real-world semantic data integration projects:

* YARRRML/RML is usually the best default choice because mappings are declarative, portable, and maintainable.
* Python + RDFLib is preferable when transformations involve substantial computation, data cleaning, external APIs, or complex business rules.
* OTTR is particularly valuable when the RDF model contains many recurring graph patterns and you want template reuse.
* ROBOT is the right choice when the goal is an OWL ontology (classes, axioms, labels) rather than just instance data, especially within the OBO / ontology-engineering ecosystem.
* Plain Python string templates are mainly useful for teaching, experimentation, and very small one-off transformations.

For a full side-by-side comparison — the RDF each one produces, the triple-level pattern,
and their differences — see
[`data-transformation/COMPARISON.md`](data-transformation/COMPARISON.md).
129 changes: 129 additions & 0 deletions data-transformation/COMPARISON.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Comparison of the data-transformation methods

A comparison of the five approaches in `data-transformation/` that convert the same
tabular tensile-strength measurement data into RDF — based on the PMD / BFO / OBI / RO
modelling pattern.

> **Use case:** 2 measurements (`obj_1` = 520 MPa, `obj_2` = 550 MPa, unit QUDT `MegaPA`).
> Expected pattern per measurement: *a specimen has a tensile strength, determined in a
> measurement process with a value + unit.*

---

## 1. Overview: paradigm & setup

| Method | Paradigm | Runtime / dependencies | Input | Run |
|--------|----------|------------------------|-------|-----|
| **python-template** | String templating | Python (stdlib only) | CSV inline | `python map.py` |
| **python-rdflib** | Programmatic graph construction | Python + `rdflib` | CSV inline | `python map.py` |
| **ottr** | Declarative ontology templates | Java + Lutra (≈44 MB) | `.stottr` files | `sh map.sh` |
| **yarrrml / RML** | Declarative mapping rules | Docker *(or Node + rmlmapper ≈184 MB)* | CSV + YAML | `sh map.sh` |
| **robot** | Spreadsheet → OWL template | Java + ROBOT (≈83 MB) + ontology (pmdco) | TSV | `sh map.sh` |

---

## 2. Results (after bug fixes)

| Method | Status | #Triples | #Measurements | Value datatype | Example namespace |
|--------|--------|---------:|--------------:|----------------|-------------------|
| python-template | ✅ valid | 22 | 2 | `"520"` (string) | `example.com/ns#` |
| python-rdflib | ✅ valid | 22 | 2 | `"520"` (string) | `example.org/` |
| ottr (named) | ✅ valid | 22 | 2 | `520` (**integer**) | `example.com/ns#` |
| yarrrml | ✅ valid | 22 | 2 | `"520"` (string) | `example.com/` |
| robot | ✅ valid (OWL) | 68 | 2 | `"520"` (string) | `example.org/` |

Validated with `rdflib` 7.6.0; the yarrrml output is N-Triples, the rest Turtle.

---

## 3. Instance pattern per measurement (✓ = triple present)

| (Subject – Predicate – Object) | py-template | py-rdflib | ottr | yarrrml | robot |
|---|:--:|:--:|:--:|:--:|:--:|
| `qual a tensile_strength` | ✓ | ✓ | ✓ | ✓ | ✓ |
| `obj has_quality qual` | ✓ | ✓ | ✓ | ✓ | ✓ |
| `obj has_role role` | ✓ | ✓ | ✓ | ✓ | ✓ |
| `datum a measurement_datum` | ✓ | ✓ | ✓ | ✓ | ✓ |
| `datum has_value_specification spec` | ✓ | ✓ | ✓ | ✓ | ✓ |
| `proc has_participant obj` | ✓ | ✓ | ✓ | ✓ | ✓ |
| `spec has_measurement_unit_label MegaPA` | ✓ | ✓ | ✓ | ✓ | ✓ |
| `proc realizes role` | ✓ | ✓ | ✓ | ✓ | **·** |
| `spec specifies_value_of qual` | ✓ | ✓ | ✓ | ✓ | **·** |
| `spec has_specified_numeric_value` (`OBI_0001937`) | ✓ | ✓ | ✓ | ✓ | **·** |
| `datum specified_output_of proc` | ✓ | **·** | ✓ | ✓ | **·** |
| `obj specified_output_of proc` | · | **✓ ⚠** | · | · | · |
| `datum specifies_value_of qual` | · | · | · | · | **✓ ⚠** |
| `spec has_specified_value` (`OBI_0002135`) | · | · | · | · | **✓ ⚠** |
| `obj a object` / `role a test_piece_role` / `proc a …process` / `spec a value_specification` | · | · | · | · | **✓** |

**Common core** (top 7 rows): identical across all methods. **python-template, ottr and
yarrrml** produce the full, consistent 11-triple pattern per measurement. **python-rdflib**
and **robot** deviate (see ⚠ and Section 5).

---

## 4. Bugs fixed

Errors found and corrected in the example files during the comparison:

### Fix 1 — `python-template/map.py` (output was invalid Turtle)
The prefix block did not declare `ex:` and `rdf:`, although the template uses them.
```diff
prefix="""
+ @prefix ex: <http://example.com/ns#> .
+ @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix tensile_strength: <https://w3id.org/pmd/tto/TTO_0000053> .
```
Before: parser aborted (`Bad syntax (Prefix "ex)`). After: 22 triples, valid.

### Fix 2 — `yarrrml/tensile.csv` (only 1 instead of 2 measurements)
```diff
obj_id,value,unit
1,520,http://qudt.org/vocab/unit/MegaPA
- #2,550,http://qudt.org/vocab/unit/MegaPA
+ 2,550,http://qudt.org/vocab/unit/MegaPA
```

### Fix 3 — `robot/map.sh` (download wrote a log file instead of the ontology)
`wget -o` = log file; the correct flag is `-O` = output file.
```diff
- wget "https://w3id.org/pmd/co/" -o pmdco.owl
+ wget --header="Accept: application/rdf+xml" -O pmdco.owl "https://w3id.org/pmd/co/"
```

### Fix 4 — `robot/template.tsv` (only covered 1 measurement)
Added six rows for the second measurement (`obj_2 … spec_2`, value `550`), matching the
27-column layout of measurement 1. The single `tensile strength` class definition
(`TTO_0000053`) is *not* duplicated. After: 68 triples, 2 measurements.

---

## 5. Remaining inconsistencies (not plain typos)

Substantive modelling differences that were **not** changed:

- **python-rdflib:** `specified_output_of` is attached to `obj` instead of `datum`
(`g.add((obj, specified_output_of, proc))`).
- **robot:** uses `OBI_0002135` (*has specified value*) instead of `OBI_0001937`
(*has specified numeric value*); attaches `specifies_value_of` to the `datum`;
`realizes` and `specified_output_of` are missing (labels not resolved in pmdco).
- **Datatype:** only **ottr** types the value as `xsd:integer`; the others use `xsd:string`.
- **Namespaces:** inconsistent (`example.com/ns#`, `example.com/`, `example.org/`).

---

## 6. When to use which

| Method | Strengths | Weaknesses |
|--------|-----------|------------|
| **python-template** | trivial, no dependencies, full control | no validation → easily produces invalid RDF; scales poorly |
| **python-rdflib** | real graph object, guaranteed syntactically valid, good for logic/branching | imperative, pattern scattered across many `g.add()` calls |
| **ottr** | declarative, pattern defined *once*, compact data, datatype typing | Java tooling, custom stOTTR syntax |
| **yarrrml / RML** | W3C-aligned standard, declarative, works directly from CSV/JSON/DB, ETL-ready | heavy runtime (Docker / large jars), YAML learning curve |
| **robot** | produces a full OWL ontology, label→IRI resolution | tied to the OWL/ROBOT workflow, the spreadsheet gets unwieldy quickly |

**Conclusion:** all hit the same semantic core, but they are **not** triple-identical —
they differ in coverage, datatypes, OWL scaffolding and individual properties. For pure
**data→RDF transformation**, **ottr** (compact/declarative) or **yarrrml/RML** (standard,
ETL) are the cleanest; **python-rdflib** is the pragmatic all-purpose choice; **robot** is
the right tool when the goal is an **ontology** (not just instance data).
2 changes: 2 additions & 0 deletions data-transformation/python-template/map.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@


prefix="""
@prefix ex: <http://example.com/ns#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix tensile_strength: <https://w3id.org/pmd/tto/TTO_0000053> .
@prefix has_quality: <http://purl.obolibrary.org/obo/RO_0000086> .
@prefix has_role: <http://purl.obolibrary.org/obo/RO_0000087> .
Expand Down
2 changes: 1 addition & 1 deletion data-transformation/robot/map.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ fi

# download recent pmdco
if [ ! -f pmdco.owl ]; then
wget "https://w3id.org/pmd/co/" -o pmdco.owl
wget --header="Accept: application/rdf+xml" -O pmdco.owl "https://w3id.org/pmd/co/"
fi

java -jar robot.jar template --input pmdco.owl --template template.tsv --output result.ttl
Expand Down
6 changes: 6 additions & 0 deletions data-transformation/robot/template.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@ http://example.org/role_1 role_1 test piece role
http://example.org/proc_1 proc_1 tensile testing process obj_1
http://example.org/datum_1 datum_1 measurement datum spec_1 qual_1 proc_1
http://example.org/spec_1 spec_1 value specification 520 http://qudt.org/vocab/unit/MegaPA
http://example.org/obj_2 obj_2 object qual_2 role_2
http://example.org/qual_2 qual_2 tensile strength
http://example.org/role_2 role_2 test piece role
http://example.org/proc_2 proc_2 tensile testing process obj_2
http://example.org/datum_2 datum_2 measurement datum spec_2 qual_2 proc_2
http://example.org/spec_2 spec_2 value specification 550 http://qudt.org/vocab/unit/MegaPA



Expand Down
2 changes: 1 addition & 1 deletion data-transformation/yarrrml/tensile.csv
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
obj_id,value,unit
1,520,http://qudt.org/vocab/unit/MegaPA
#2,550,http://qudt.org/vocab/unit/MegaPA
2,550,http://qudt.org/vocab/unit/MegaPA