From fdf2313ce2553d427c724de3945a7f03ac61b964 Mon Sep 17 00:00:00 2001 From: lepy Date: Tue, 23 Jun 2026 12:52:13 +0200 Subject: [PATCH 1/4] Fix transformation examples, add comparison and extend docs - python-template: declare missing ex: and rdf: prefixes (output was invalid Turtle) - yarrrml: enable the second measurement row in tensile.csv - robot: fix pmdco download (-O instead of -o); add second measurement to template.tsv - add COMPARISON.md comparing all five transformation methods - extend README with example data, prerequisites, methods and a comparison table - .gitignore: exclude downloaded tooling and generated outputs --- .gitignore | 12 ++ README.md | 86 ++++++++++---- data-transformation/COMPARISON.md | 129 +++++++++++++++++++++ data-transformation/python-template/map.py | 2 + data-transformation/robot/map.sh | 2 +- data-transformation/robot/template.tsv | 6 + data-transformation/yarrrml/tensile.csv | 2 +- 7 files changed, 212 insertions(+), 27 deletions(-) create mode 100644 data-transformation/COMPARISON.md diff --git a/.gitignore b/.gitignore index e43b0f9..90c1949 100644 --- a/.gitignore +++ b/.gitignore @@ -1 +1,13 @@ .DS_Store + +# Downloaded tooling (large binaries — do not commit) +*.jar + +# Downloaded ontology +data-transformation/robot/pmdco.owl + +# Generated transformation outputs +data-transformation/ottr/tensile.ttl +data-transformation/ottr/tensile-blank.ttl +data-transformation/robot/result.ttl +data-transformation/yarrrml/temp.rml.ttl diff --git a/README.md b/README.md index 4104ff1..5d1ebbb 100644 --- a/README.md +++ b/README.md @@ -27,43 +27,79 @@ The **data-transformation** examples illustrate multiple strategies for generati * Python + RDFLib programmatic graph construction * YARRRML/RML mappings executed with the RML Mapper * OTTR template expansion using Lutra +* ROBOT template expansion into OWL using the ROBOT tool -The examples use a very simple CSV-based data as a simple educational use case. +The examples use a very simple CSV-based data set as an educational use case. + +## Example data + +All examples transform the same two tensile-strength measurements: + +| obj_id | value | unit | +|--------|-------|------| +| 1 | 520 | `http://qudt.org/vocab/unit/MegaPA` | +| 2 | 550 | `http://qudt.org/vocab/unit/MegaPA` | + +Each measurement is modelled with the PMD / BFO / OBI / RO pattern: a specimen *has a +quality* (tensile strength) whose value is captured by a *value specification*, produced +as the *measurement datum* of a *measurement process*. + +## Prerequisites + +Depending on the method you want to run: + +| Method | Requirements | +|--------|--------------| +| `python-template` | Python 3 (standard library only) | +| `python-rdflib` | Python 3 + [`rdflib`](https://rdflib.readthedocs.io/) (`pip install rdflib`) | +| `ottr` | Java (JRE 8+); `map.sh` downloads [Lutra](https://ottr.xyz/) automatically | +| `yarrrml` | Docker; `map.sh` pulls the `rmlio/yarrrml-parser` and `rmlio/rmlmapper-java` images | +| `robot` | Java (JRE 8+); `map.sh` downloads [ROBOT](http://robot.obolibrary.org/) and the PMD core ontology automatically | + +The `ottr`, `yarrrml` and `robot` examples require internet access on first run to fetch +their tooling. Downloaded tools and generated outputs are git-ignored. + +## Methods + +* **`python-template/`** — fills a plain string template per CSV row and prints Turtle. + Simplest approach, no dependencies, but performs no validation. +* **`python-rdflib/`** — builds the graph programmatically with RDFLib and serialises it. + Guaranteed to be syntactically valid; good when you need logic or branching. +* **`ottr/`** — defines the triple pattern once as a reusable [OTTR](https://ottr.xyz/) + template and expands the data instances with Lutra. Two variants are provided: a named + one (stable IRIs for every individual) and a blank-node one (`map-blank.sh`). +* **`yarrrml/`** — declarative [YARRRML](https://rml.io/yarrrml/) mapping rules that are + compiled to RML and executed against the CSV with the RML Mapper. +* **`robot/`** — a [ROBOT](http://robot.obolibrary.org/) spreadsheet template (TSV) that is + expanded into a full OWL ontology, resolving labels to IRIs via the PMD core ontology. ## Usage -Each folder contains a `map.sh` or `map.py` file which can be executed to run the scripts. +Each folder contains a `map.sh` or `map.py` file which can be executed to run the example. ``` -cd data-transformation/ -sh run.sh +cd data-transformation/ +sh map.sh ``` + ``` -cd data-transformation/ +cd data-transformation/ python map.py ``` -## Comparison Table - -| Criterion | Python Templates | Python + RDFLib | YARRRML/RML | OTTR | -| ----------------------- | ---------------- | --------------- | ----------- | ----- | -| Easy to start | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | -| RDF correctness | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | -| Reusability | ⭐ | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | -| Maintainability | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | -| Non-programmer friendly | ⭐ | ⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | -| Standards-based | ⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | -| Large-scale KG projects | ⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | -| Learning RDF concepts | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | - - +## Comparison -## Practical Recommendation +At a glance: -For most real-world semantic data integration projects: +| | python-template | python-rdflib | ottr | yarrrml | robot | +|--|-----------------|---------------|------|---------|-------| +| **Paradigm** | string templating | programmatic (RDFLib) | OTTR templates | YARRRML/RML rules | ROBOT template | +| **Runtime** | Python | Python + rdflib | Java (Lutra) | Docker | Java (ROBOT) | +| **Input** | CSV | CSV | stOTTR | CSV + YAML | TSV | +| **Output** | Turtle | Turtle | Turtle | N-Triples | OWL (Turtle) | +| **Produces** | instance data | instance data | instance data | instance data | OWL ontology | -* YARRRML/RML is usually the best default choice because mappings are declarative, portable, and maintainable. -* Python + RDFLib is preferable when transformations involve substantial computation, data cleaning, external APIs, or complex business rules. -* OTTR is particularly valuable when the RDF model contains many recurring graph patterns and you want template reuse. -* Plain Python string templates are mainly useful for teaching, experimentation, and very small one-off transformations. +For a full side-by-side comparison — the RDF each one produces, the triple-level pattern, +and their differences — see +[`data-transformation/COMPARISON.md`](data-transformation/COMPARISON.md). diff --git a/data-transformation/COMPARISON.md b/data-transformation/COMPARISON.md new file mode 100644 index 0000000..550a6bb --- /dev/null +++ b/data-transformation/COMPARISON.md @@ -0,0 +1,129 @@ +# Comparison of the data-transformation methods + +A comparison of the five approaches in `data-transformation/` that convert the same +tabular tensile-strength measurement data into RDF — based on the PMD / BFO / OBI / RO +modelling pattern. + +> **Use case:** 2 measurements (`obj_1` = 520 MPa, `obj_2` = 550 MPa, unit QUDT `MegaPA`). +> Expected pattern per measurement: *a specimen has a tensile strength, determined in a +> measurement process with a value + unit.* + +--- + +## 1. Overview: paradigm & setup + +| Method | Paradigm | Runtime / dependencies | Input | Run | +|--------|----------|------------------------|-------|-----| +| **python-template** | String templating | Python (stdlib only) | CSV inline | `python map.py` | +| **python-rdflib** | Programmatic graph construction | Python + `rdflib` | CSV inline | `python map.py` | +| **ottr** | Declarative ontology templates | Java + Lutra (≈44 MB) | `.stottr` files | `sh map.sh` | +| **yarrrml / RML** | Declarative mapping rules | Docker *(or Node + rmlmapper ≈184 MB)* | CSV + YAML | `sh map.sh` | +| **robot** | Spreadsheet → OWL template | Java + ROBOT (≈83 MB) + ontology (pmdco) | TSV | `sh map.sh` | + +--- + +## 2. Results (after bug fixes) + +| Method | Status | #Triples | #Measurements | Value datatype | Example namespace | +|--------|--------|---------:|--------------:|----------------|-------------------| +| python-template | ✅ valid | 22 | 2 | `"520"` (string) | `example.com/ns#` | +| python-rdflib | ✅ valid | 22 | 2 | `"520"` (string) | `example.org/` | +| ottr (named) | ✅ valid | 22 | 2 | `520` (**integer**) | `example.com/ns#` | +| yarrrml | ✅ valid | 22 | 2 | `"520"` (string) | `example.com/` | +| robot | ✅ valid (OWL) | 68 | 2 | `"520"` (string) | `example.org/` | + +Validated with `rdflib` 7.6.0; the yarrrml output is N-Triples, the rest Turtle. + +--- + +## 3. Instance pattern per measurement (✓ = triple present) + +| (Subject – Predicate – Object) | py-template | py-rdflib | ottr | yarrrml | robot | +|---|:--:|:--:|:--:|:--:|:--:| +| `qual a tensile_strength` | ✓ | ✓ | ✓ | ✓ | ✓ | +| `obj has_quality qual` | ✓ | ✓ | ✓ | ✓ | ✓ | +| `obj has_role role` | ✓ | ✓ | ✓ | ✓ | ✓ | +| `datum a measurement_datum` | ✓ | ✓ | ✓ | ✓ | ✓ | +| `datum has_value_specification spec` | ✓ | ✓ | ✓ | ✓ | ✓ | +| `proc has_participant obj` | ✓ | ✓ | ✓ | ✓ | ✓ | +| `spec has_measurement_unit_label MegaPA` | ✓ | ✓ | ✓ | ✓ | ✓ | +| `proc realizes role` | ✓ | ✓ | ✓ | ✓ | **·** | +| `spec specifies_value_of qual` | ✓ | ✓ | ✓ | ✓ | **·** | +| `spec has_specified_numeric_value` (`OBI_0001937`) | ✓ | ✓ | ✓ | ✓ | **·** | +| `datum specified_output_of proc` | ✓ | **·** | ✓ | ✓ | **·** | +| `obj specified_output_of proc` | · | **✓ ⚠** | · | · | · | +| `datum specifies_value_of qual` | · | · | · | · | **✓ ⚠** | +| `spec has_specified_value` (`OBI_0002135`) | · | · | · | · | **✓ ⚠** | +| `obj a object` / `role a test_piece_role` / `proc a …process` / `spec a value_specification` | · | · | · | · | **✓** | + +**Common core** (top 7 rows): identical across all methods. **python-template, ottr and +yarrrml** produce the full, consistent 11-triple pattern per measurement. **python-rdflib** +and **robot** deviate (see ⚠ and Section 5). + +--- + +## 4. Bugs fixed + +Errors found and corrected in the example files during the comparison: + +### Fix 1 — `python-template/map.py` (output was invalid Turtle) +The prefix block did not declare `ex:` and `rdf:`, although the template uses them. +```diff + prefix=""" ++ @prefix ex: . ++ @prefix rdf: . + @prefix tensile_strength: . +``` +Before: parser aborted (`Bad syntax (Prefix "ex)`). After: 22 triples, valid. + +### Fix 2 — `yarrrml/tensile.csv` (only 1 instead of 2 measurements) +```diff + obj_id,value,unit + 1,520,http://qudt.org/vocab/unit/MegaPA +- #2,550,http://qudt.org/vocab/unit/MegaPA ++ 2,550,http://qudt.org/vocab/unit/MegaPA +``` + +### Fix 3 — `robot/map.sh` (download wrote a log file instead of the ontology) +`wget -o` = log file; the correct flag is `-O` = output file. +```diff +- wget "https://w3id.org/pmd/co/" -o pmdco.owl ++ wget --header="Accept: application/rdf+xml" -O pmdco.owl "https://w3id.org/pmd/co/" +``` + +### Fix 4 — `robot/template.tsv` (only covered 1 measurement) +Added six rows for the second measurement (`obj_2 … spec_2`, value `550`), matching the +27-column layout of measurement 1. The single `tensile strength` class definition +(`TTO_0000053`) is *not* duplicated. After: 68 triples, 2 measurements. + +--- + +## 5. Remaining inconsistencies (not plain typos) + +Substantive modelling differences that were **not** changed: + +- **python-rdflib:** `specified_output_of` is attached to `obj` instead of `datum` + (`g.add((obj, specified_output_of, proc))`). +- **robot:** uses `OBI_0002135` (*has specified value*) instead of `OBI_0001937` + (*has specified numeric value*); attaches `specifies_value_of` to the `datum`; + `realizes` and `specified_output_of` are missing (labels not resolved in pmdco). +- **Datatype:** only **ottr** types the value as `xsd:integer`; the others use `xsd:string`. +- **Namespaces:** inconsistent (`example.com/ns#`, `example.com/`, `example.org/`). + +--- + +## 6. When to use which + +| Method | Strengths | Weaknesses | +|--------|-----------|------------| +| **python-template** | trivial, no dependencies, full control | no validation → easily produces invalid RDF; scales poorly | +| **python-rdflib** | real graph object, guaranteed syntactically valid, good for logic/branching | imperative, pattern scattered across many `g.add()` calls | +| **ottr** | declarative, pattern defined *once*, compact data, datatype typing | Java tooling, custom stOTTR syntax | +| **yarrrml / RML** | W3C-aligned standard, declarative, works directly from CSV/JSON/DB, ETL-ready | heavy runtime (Docker / large jars), YAML learning curve | +| **robot** | produces a full OWL ontology, label→IRI resolution | tied to the OWL/ROBOT workflow, the spreadsheet gets unwieldy quickly | + +**Conclusion:** all hit the same semantic core, but they are **not** triple-identical — +they differ in coverage, datatypes, OWL scaffolding and individual properties. For pure +**data→RDF transformation**, **ottr** (compact/declarative) or **yarrrml/RML** (standard, +ETL) are the cleanest; **python-rdflib** is the pragmatic all-purpose choice; **robot** is +the right tool when the goal is an **ontology** (not just instance data). diff --git a/data-transformation/python-template/map.py b/data-transformation/python-template/map.py index 102e1c6..25b6e39 100644 --- a/data-transformation/python-template/map.py +++ b/data-transformation/python-template/map.py @@ -1,6 +1,8 @@ prefix=""" +@prefix ex: . +@prefix rdf: . @prefix tensile_strength: . @prefix has_quality: . @prefix has_role: . diff --git a/data-transformation/robot/map.sh b/data-transformation/robot/map.sh index 8d03dc4..94cb8d8 100644 --- a/data-transformation/robot/map.sh +++ b/data-transformation/robot/map.sh @@ -7,7 +7,7 @@ fi # download recent pmdco if [ ! -f pmdco.owl ]; then - wget "https://w3id.org/pmd/co/" -o pmdco.owl + wget --header="Accept: application/rdf+xml" -O pmdco.owl "https://w3id.org/pmd/co/" fi java -jar robot.jar template --input pmdco.owl --template template.tsv --output result.ttl diff --git a/data-transformation/robot/template.tsv b/data-transformation/robot/template.tsv index c177c19..4e490f1 100644 --- a/data-transformation/robot/template.tsv +++ b/data-transformation/robot/template.tsv @@ -7,6 +7,12 @@ http://example.org/role_1 role_1 test piece role http://example.org/proc_1 proc_1 tensile testing process obj_1 http://example.org/datum_1 datum_1 measurement datum spec_1 qual_1 proc_1 http://example.org/spec_1 spec_1 value specification 520 http://qudt.org/vocab/unit/MegaPA +http://example.org/obj_2 obj_2 object qual_2 role_2 +http://example.org/qual_2 qual_2 tensile strength +http://example.org/role_2 role_2 test piece role +http://example.org/proc_2 proc_2 tensile testing process obj_2 +http://example.org/datum_2 datum_2 measurement datum spec_2 qual_2 proc_2 +http://example.org/spec_2 spec_2 value specification 550 http://qudt.org/vocab/unit/MegaPA diff --git a/data-transformation/yarrrml/tensile.csv b/data-transformation/yarrrml/tensile.csv index ee41754..c8ce91a 100644 --- a/data-transformation/yarrrml/tensile.csv +++ b/data-transformation/yarrrml/tensile.csv @@ -1,3 +1,3 @@ obj_id,value,unit 1,520,http://qudt.org/vocab/unit/MegaPA -#2,550,http://qudt.org/vocab/unit/MegaPA \ No newline at end of file +2,550,http://qudt.org/vocab/unit/MegaPA \ No newline at end of file From 6edb3cb564783c801234c691254de93260a46989 Mon Sep 17 00:00:00 2001 From: lepy Date: Tue, 23 Jun 2026 12:55:54 +0200 Subject: [PATCH 2/4] Restore star comparison table and recommendations, add ROBOT column The previous commit was based on an older archive snapshot and accidentally dropped the star-rating Comparison Table and Practical Recommendation sections. Both are restored and extended with a ROBOT column / recommendation. --- README.md | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/README.md b/README.md index 5d1ebbb..4e0b323 100644 --- a/README.md +++ b/README.md @@ -100,6 +100,29 @@ At a glance: | **Output** | Turtle | Turtle | Turtle | N-Triples | OWL (Turtle) | | **Produces** | instance data | instance data | instance data | instance data | OWL ontology | +### Comparison Table + +| Criterion | Python Templates | Python + RDFLib | YARRRML/RML | OTTR | ROBOT | +| ----------------------- | ---------------- | --------------- | ----------- | ----- | ----- | +| Easy to start | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐⭐ | +| RDF correctness | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | +| Reusability | ⭐ | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | +| Maintainability | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | +| Non-programmer friendly | ⭐ | ⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | +| Standards-based | ⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | +| Large-scale KG projects | ⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | +| Learning RDF concepts | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | + +### Practical Recommendation + +For most real-world semantic data integration projects: + +* YARRRML/RML is usually the best default choice because mappings are declarative, portable, and maintainable. +* Python + RDFLib is preferable when transformations involve substantial computation, data cleaning, external APIs, or complex business rules. +* OTTR is particularly valuable when the RDF model contains many recurring graph patterns and you want template reuse. +* ROBOT is the right choice when the goal is an OWL ontology (classes, axioms, labels) rather than just instance data, especially within the OBO / ontology-engineering ecosystem. +* Plain Python string templates are mainly useful for teaching, experimentation, and very small one-off transformations. + For a full side-by-side comparison — the RDF each one produces, the triple-level pattern, and their differences — see [`data-transformation/COMPARISON.md`](data-transformation/COMPARISON.md). From d3aac2f096ac6fc192a33fb01bf9402ec2ee7eb7 Mon Sep 17 00:00:00 2001 From: lepy Date: Tue, 23 Jun 2026 12:59:45 +0200 Subject: [PATCH 3/4] Adjust ROBOT ratings: lower maintainability and large-scale KG to 2 stars --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 4e0b323..4f48017 100644 --- a/README.md +++ b/README.md @@ -107,10 +107,10 @@ At a glance: | Easy to start | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐⭐ | | RDF correctness | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | | Reusability | ⭐ | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | -| Maintainability | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | +| Maintainability | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐ | | Non-programmer friendly | ⭐ | ⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | | Standards-based | ⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | -| Large-scale KG projects | ⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | +| Large-scale KG projects | ⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐ | | Learning RDF concepts | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ### Practical Recommendation From 3363851eea32f4e5f5a92ab8689dfc6c20260270 Mon Sep 17 00:00:00 2001 From: lepy Date: Tue, 23 Jun 2026 13:03:06 +0200 Subject: [PATCH 4/4] Add strengths/weaknesses table to README --- README.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/README.md b/README.md index 4f48017..750a5a8 100644 --- a/README.md +++ b/README.md @@ -113,6 +113,16 @@ At a glance: | Large-scale KG projects | ⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐ | | Learning RDF concepts | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | +### Strengths and weaknesses + +| Method | Strengths | Weaknesses | +|--------|-----------|------------| +| python-template | trivial, no dependencies, full control | no validation → easily produces invalid RDF; scales poorly | +| python-rdflib | real graph object, guaranteed syntactically valid, good for logic/branching | imperative, pattern scattered across many `g.add()` calls | +| ottr | declarative, pattern defined once, compact data, datatype typing | Java tooling, custom stOTTR syntax | +| yarrrml / RML | W3C-aligned standard, declarative, works directly from CSV/JSON/DB, ETL-ready | heavy runtime (Docker / large jars), YAML learning curve | +| robot | produces a full OWL ontology, label→IRI resolution | tied to the OWL/ROBOT workflow, the spreadsheet gets unwieldy quickly | + ### Practical Recommendation For most real-world semantic data integration projects: