Help:WikiPathways Metabolomics
From WikiPathways
(→Labels which are also marked as metabolite) |
Current revision (08:00, 10 February 2023) (view source) (→All HMDB identifiers) |
||
(31 intermediate revisions not shown.) | |||
Line 13: | Line 13: | ||
prefix wp: <http://vocabularies.wikipathways.org/wp#> | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
- | select | + | select str(?o) where { |
?pw a wp:Pathway ; | ?pw a wp:Pathway ; | ||
<http://purl.org/pav/version> ?o . | <http://purl.org/pav/version> ?o . | ||
Line 34: | Line 34: | ||
</pre> | </pre> | ||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
== Pathway properties == | == Pathway properties == | ||
Line 63: | Line 48: | ||
} | } | ||
</pre> | </pre> | ||
- | |||
- | |||
== Latest data only == | == Latest data only == | ||
Line 94: | Line 77: | ||
</pre> | </pre> | ||
- | + | {| | |
+ | |'''Revision''' | ||
+ | |'''Count''' | ||
+ | |- | ||
+ | |67787 | ||
+ | |5790 | ||
+ | |- | ||
+ | |69675 | ||
+ | |5801 | ||
+ | |} | ||
=== List === | === List === | ||
Line 109: | Line 101: | ||
</pre> | </pre> | ||
- | [http:// | + | === All zebrafish metabolites === |
+ | |||
+ | <pre> | ||
+ | PREFIX gpml: <http://vocabularies.wikipathways.org/gpml#> | ||
+ | PREFIX dcterms: <http://purl.org/dc/terms/> | ||
+ | PREFIX dc: <http://purl.org/dc/elements/1.1/> | ||
+ | PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> | ||
+ | |||
+ | select distinct ?metabolite (str(?titleLit) as ?title) where { | ||
+ | ?metabolite a wp:Metabolite ; | ||
+ | dcterms:isPartOf ?pw . | ||
+ | ?pw dc:title ?titleLit ; | ||
+ | wp:organismName "Danio rerio" . | ||
+ | } | ||
+ | </pre> | ||
+ | |||
+ | [http://sparql.wikipathways.org/sparql?query=PREFIX+gpml%3A++++%3Chttp%3A%2F%2Fvocabularies.wikipathways.org%2Fgpml%23%3E%0D%0APREFIX+dcterms%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0D%0APREFIX+dc%3A++++++%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Felements%2F1.1%2F%3E%0D%0APREFIX+rdf%3A+++++%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E+%0D%0A%0D%0Aselect+distinct+%3Fmetabolite+%28str%28%3FtitleLit%29+as+%3Ftitle%29+where+%7B%0D%0A++%3Fmetabolite+a+wp%3AMetabolite+%3B%0D%0A++++dcterms%3AisPartOf+%3Fpw+.%0D%0A++%3Fpw+dc%3Atitle+%3FtitleLit+%3B%0D%0A++++wp%3AorganismName+%22Danio+rerio%22+.%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on Run] | ||
= Metabolic Data Sources = | = Metabolic Data Sources = | ||
Line 115: | Line 123: | ||
== Sorted by use == | == Sorted by use == | ||
- | [[Image:mbStats.png|right| | + | [[Image:mbStats.png|right|200px]] |
HMDB, ChEBI, and KEGG are the main data sources for identifiers. InChI/InChIKey should also be there but is missing. A big curation process in January 2013 ensured that "PubChem compound" is now used as data source for PubChem CIDs. | HMDB, ChEBI, and KEGG are the main data sources for identifiers. InChI/InChIKey should also be there but is missing. A big curation process in January 2013 ensured that "PubChem compound" is now used as data source for PubChem CIDs. | ||
Line 124: | Line 132: | ||
prefix dcterms: <http://purl.org/dc/terms/> | prefix dcterms: <http://purl.org/dc/terms/> | ||
- | select ?datasource count(distinct ?identifier) as ?count | + | select str(?datasource) as ?source count(distinct ?identifier) as ?count |
where { | where { | ||
?mb a wp:Metabolite ; | ?mb a wp:Metabolite ; | ||
Line 131: | Line 139: | ||
} order by desc(?count) | } order by desc(?count) | ||
</pre> | </pre> | ||
- | |||
- | |||
== All metabolites from one source == | == All metabolites from one source == | ||
Line 146: | Line 152: | ||
where { | where { | ||
?mb a wp:Metabolite ; | ?mb a wp:Metabolite ; | ||
- | dc:source " | + | dc:source "KEGG Compound" ; |
dc:identifier ?identifier . | dc:identifier ?identifier . | ||
} order by ?identifier | } order by ?identifier | ||
</pre> | </pre> | ||
- | |||
- | |||
=== All HMDB identifiers === | === All HMDB identifiers === | ||
+ | |||
+ | Return all HMDB identfiers with: | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | |||
+ | select distinct ?identifier | ||
+ | where { | ||
+ | ?mb a wp:Metabolite ; | ||
+ | dc:source "HMDB" ; | ||
+ | dc:identifier ?identifier . | ||
+ | } order by ?identifier | ||
+ | </pre> | ||
+ | |||
+ | Return all metabolites listed to have a HMDB identifier but have none: | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | |||
+ | select distinct ?identifier | ||
+ | where { | ||
+ | ?mb a wp:Metabolite ; | ||
+ | dc:source "HMDB"^^xsd:string ; | ||
+ | dc:identifier ?identifier . | ||
+ | FILTER (regex(str(?identifier),"noIdentifier")) | ||
+ | } order by ?identifier | ||
+ | </pre> | ||
At the time of writing, this showed a number of XRefs with HMDB as data source but no identifiers, which needs curation: | At the time of writing, this showed a number of XRefs with HMDB as data source but no identifiers, which needs curation: | ||
Line 177: | Line 212: | ||
http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP716_r45017 | http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP716_r45017 | ||
</pre> | </pre> | ||
+ | |||
+ | = Metabolic Pathways = | ||
+ | |||
+ | Of general interest is the number of pathways per species: | ||
<pre> | <pre> | ||
prefix wp: <http://vocabularies.wikipathways.org/wp#> | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
- | prefix | + | prefix dcterms: <http://purl.org/dc/terms/> |
- | prefix | + | prefix ncbi: <http://purl.obolibrary.org/obo/NCBITaxon_> |
- | select distinct ? | + | select distinct str(?orgName) as ?organism count(?pw) as ?pathways where { |
- | where { | + | ?pw wp:organism ?organismCode . |
- | ? | + | ?organismCode rdfs:label ?orgName |
- | + | } order by desc(?pathways) | |
- | + | ||
- | } order by ? | + | |
</pre> | </pre> | ||
- | |||
- | |||
- | |||
- | |||
== Metabolomes == | == Metabolomes == | ||
Line 213: | Line 246: | ||
</pre> | </pre> | ||
- | + | {| | |
+ | |'''Revision''' | ||
+ | |'''Count''' | ||
+ | |- | ||
+ | |67787 | ||
+ | |1972 | ||
+ | |- | ||
+ | |69675 | ||
+ | |2000 | ||
+ | |} | ||
+ | |||
+ | === Arabodopsis thaliana Metabolome === | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | prefix ncbi: <http://purl.obolibrary.org/obo/NCBITaxon_> | ||
+ | |||
+ | select distinct ?mb where { | ||
+ | ?mb a wp:Metabolite ; | ||
+ | dcterms:isPartOf ?pw . | ||
+ | ?pw wp:organism ncbi:3702 . | ||
+ | } order by ?mb | ||
+ | </pre> | ||
+ | |||
+ | {| | ||
+ | |'''Revision''' | ||
+ | |'''Count''' | ||
+ | |- | ||
+ | |69675 | ||
+ | |17 | ||
+ | |} | ||
== Pathways with the most metabolites == | == Pathways with the most metabolites == | ||
Line 230: | Line 294: | ||
} order by desc(?mbCount) | } order by desc(?mbCount) | ||
</pre> | </pre> | ||
- | |||
- | |||
== Metabolites in the most Pathways == | == Metabolites in the most Pathways == | ||
Line 250: | Line 312: | ||
} order by desc(?pwCount) | } order by desc(?pwCount) | ||
</pre> | </pre> | ||
- | |||
- | |||
= Identifier Mapping Completeness = | = Identifier Mapping Completeness = | ||
- | [[Image: | + | [[Image:fooNotInHMDB.png|right|200x]] |
Right now, the HMDB is the primary (and only) source of mappings. That raises the question how many metabolites are in WP that do not have mappings to other databases. | Right now, the HMDB is the primary (and only) source of mappings. That raises the question how many metabolites are in WP that do not have mappings to other databases. | ||
Line 262: | Line 322: | ||
== The missing mappings == | == The missing mappings == | ||
- | The next query counts all unique missing identifiers in HMDB, resulting in missing mappings to other databases: at the time of writing, this are 927 identifiers. | + | The next query counts all unique missing identifiers in HMDB, resulting in missing mappings to other databases: at the time of writing, this are 724 (was 927) identifiers. |
- | These are not unique identifiers, which is | + | These are not unique identifiers, which is 369 ([http://goo.gl/EQU1H Run]; was 404) at the time of writing. Given there are about 1400 unique metabolite identifiers, this |
- | is about 30%, which is rather significant. | + | is about 30%, which is rather significant. The major databases with unmapped resources are ChEBI and KEGG (see screenshot on the right). |
<pre> | <pre> | ||
Line 307: | Line 367: | ||
} order by ?source ?identifier | } order by ?source ?identifier | ||
</pre> | </pre> | ||
- | |||
- | |||
== ChEBI identifiers not in HMDB == | == ChEBI identifiers not in HMDB == | ||
Line 328: | Line 386: | ||
} order by ?identifier | } order by ?identifier | ||
</pre> | </pre> | ||
- | |||
- | |||
== CAS identifiers not in HMDB == | == CAS identifiers not in HMDB == | ||
+ | |||
+ | [[Image:casNotInHMDB.png|right|200x]] | ||
<pre> | <pre> | ||
Line 349: | Line 407: | ||
} order by ?identifier | } order by ?identifier | ||
</pre> | </pre> | ||
- | |||
- | |||
== Kegg compound identifiers not in HMDB == | == Kegg compound identifiers not in HMDB == | ||
Line 370: | Line 426: | ||
} order by ?identifier | } order by ?identifier | ||
</pre> | </pre> | ||
- | |||
- | |||
== PubChem-compound identifiers not in HMDB == | == PubChem-compound identifiers not in HMDB == | ||
Line 391: | Line 445: | ||
} order by ?identifier | } order by ?identifier | ||
</pre> | </pre> | ||
- | |||
- | |||
= ChemSpider = | = ChemSpider = | ||
Line 404: | Line 456: | ||
prefix cheminf: <http://semanticscience.org/resource/> | prefix cheminf: <http://semanticscience.org/resource/> | ||
- | select count(distinct ? | + | select count(distinct ?identifier) where { |
- | + | ?mb a wp:Metabolite ; | |
- | + | dc:source "Chemspider"^^xsd:string ; | |
- | + | dcterms:identifier ?identifier ; | |
+ | rdfs:label ?label ; | ||
+ | dcterms:isPartOf ?pathway . | ||
+ | ?pathway foaf:page ?page . | ||
} | } | ||
</pre> | </pre> | ||
Line 417: | Line 472: | ||
prefix cheminf: <http://semanticscience.org/resource/> | prefix cheminf: <http://semanticscience.org/resource/> | ||
- | select distinct ?csid where { | + | select distinct str(?identifier) as ?csid where { |
- | + | ?mb a wp:Metabolite ; | |
- | + | dc:source "Chemspider"^^xsd:string ; | |
- | + | dcterms:identifier ?identifier ; | |
+ | rdfs:label ?label ; | ||
+ | dcterms:isPartOf ?pathway . | ||
+ | ?pathway foaf:page ?page . | ||
} | } | ||
</pre> | </pre> | ||
Line 433: | Line 491: | ||
prefix cheminf: <http://semanticscience.org/resource/> | prefix cheminf: <http://semanticscience.org/resource/> | ||
- | select distinct str(? | + | select distinct str(?identifier) as ?csid ?page where { |
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
?mb a wp:Metabolite ; | ?mb a wp:Metabolite ; | ||
- | dc:source | + | dc:source "Chemspider"^^xsd:string ; |
dcterms:identifier ?identifier ; | dcterms:identifier ?identifier ; | ||
+ | rdfs:label ?label ; | ||
dcterms:isPartOf ?pathway . | dcterms:isPartOf ?pathway . | ||
- | + | ?pathway foaf:page ?page . | |
- | + | } order by ?csid | |
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
</pre> | </pre> | ||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
- | |||
= Curation = | = Curation = | ||
Line 511: | Line 525: | ||
} | } | ||
</pre> | </pre> | ||
- | |||
- | |||
=== Outdated HMDB identifiers === | === Outdated HMDB identifiers === | ||
Line 532: | Line 544: | ||
} order by ?identifier | } order by ?identifier | ||
</pre> | </pre> | ||
- | |||
- | |||
== Metabolites not classified as such == | == Metabolites not classified as such == | ||
Line 550: | Line 560: | ||
} order by desc(?count) | } order by desc(?count) | ||
</pre> | </pre> | ||
- | |||
- | |||
That mostly lists gene identifier sources, etc, but watch out for the metabolite identifier data sources. For example, metabolites not marked as such but with a metabolite identifier can be found this way. Down the list is CAS (but genes are chemicals too...), and a few minor more: | That mostly lists gene identifier sources, etc, but watch out for the metabolite identifier data sources. For example, metabolites not marked as such but with a metabolite identifier can be found this way. Down the list is CAS (but genes are chemicals too...), and a few minor more: | ||
Line 576: | Line 584: | ||
prefix xsd: <http://www.w3.org/2001/XMLSchema#> | prefix xsd: <http://www.w3.org/2001/XMLSchema#> | ||
- | select distinct ?pathway ?mb ?label ?identifier | + | select distinct ?pathway ?mb str(?label) as ?name str(?identifier) as ?id |
where { | where { | ||
?mb dc:source "CAS"^^xsd:string ; | ?mb dc:source "CAS"^^xsd:string ; | ||
Line 585: | Line 593: | ||
} order by ?pathway | } order by ?pathway | ||
</pre> | </pre> | ||
- | |||
- | |||
=== Non-Metabolites with PubChem identifier === | === Non-Metabolites with PubChem identifier === | ||
Line 608: | Line 614: | ||
</pre> | </pre> | ||
- | === Metabolites sometimes | + | === Metabolites sometimes marked as DataNode@Type Metabolite === |
Based on label comparisons, we can find things that are labeled the same as a data node with the same label. | Based on label comparisons, we can find things that are labeled the same as a data node with the same label. | ||
Line 632: | Line 638: | ||
} | } | ||
</pre> | </pre> | ||
- | |||
- | |||
== Metabolites with an identifier but undefined data source == | == Metabolites with an identifier but undefined data source == | ||
Line 653: | Line 657: | ||
} order by ?pathway | } order by ?pathway | ||
</pre> | </pre> | ||
- | |||
- | |||
== Metabolites with a data source but no identifier == | == Metabolites with a data source but no identifier == | ||
Line 675: | Line 677: | ||
</pre> | </pre> | ||
- | + | == Metabolites with too many labels == | |
+ | |||
+ | This is particularly caused by the metabolite URIs to be based on a non-existing identifier: | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | |||
+ | select distinct count(?label) as ?count ?pathway ?mb | ||
+ | where { | ||
+ | ?mb a wp:Metabolite ; | ||
+ | rdfs:label ?label ; | ||
+ | dcterms:isPartOf ?pathway . | ||
+ | } order by desc(?count) ?pathway ?mb limit 410 | ||
+ | </pre> | ||
+ | |||
+ | An example such entity with many labels and being both a metabolite, gene, complex, etc: | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | |||
+ | select distinct str(?label) ?type | ||
+ | where { | ||
+ | <http://bio2rdf.org/geneid:noIdentifier> a ?type ; rdfs:label ?label . | ||
+ | } order by ?label | ||
+ | </pre> | ||
== Metabolites with an Entrez Gene identifier == | == Metabolites with an Entrez Gene identifier == | ||
Line 695: | Line 725: | ||
} order by ?pathway | } order by ?pathway | ||
</pre> | </pre> | ||
- | |||
- | |||
== Metabolites as just Label == | == Metabolites as just Label == | ||
Line 768: | Line 796: | ||
FILTER (!regex(str(?labelNode), "noIdentifier", "i")) | FILTER (!regex(str(?labelNode), "noIdentifier", "i")) | ||
} LIMIT 50 OFFSET 25 | } LIMIT 50 OFFSET 25 | ||
+ | </pre> | ||
+ | |||
+ | To get the most common such labels, use (though typically times out on Virtuoso 6.1): | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | prefix xsd: <http://www.w3.org/2001/XMLSchema#> | ||
+ | |||
+ | select str(?label1) as ?labelStr count(?labelNode) as ?count | ||
+ | where { | ||
+ | ?labelNode a gpml:Label ; | ||
+ | rdfs:label ?label1 . | ||
+ | ?mb a wp:Metabolite ; | ||
+ | rdfs:label ?label2 . | ||
+ | FILTER ( ?labelNode != ?mb ) | ||
+ | FILTER ( str(?label2) = str(?label1) ) | ||
+ | FILTER (!regex(str(?mb), "noIdentifier", "i")) | ||
+ | FILTER (!regex(str(?labelNode), "noIdentifier", "i")) | ||
+ | } order by desc(?count) | ||
</pre> | </pre> |
Current revision
On this page we collect SPARQL queries to see the state of the Metabolome in WikiPathways. Triggered by User:Andra's RDF / SPARQL work, curation started with metabolites without database identifiers. But this soon led to the observation that metabolites are often not even annotated as being a metabolite (using <Label> rather than <DataNode>). Therefore, User:Egonw started at Pathway:WP1 to curate them one by one and fix these issues:
- connect lines between metabolites
- convert metabolites to use <Label> rather than <DataNode>
The reason for this is that these are some basic underlying properties we need for metabolomics research fields.
The Data
The latest revision you can look up with:
prefix wp: <http://vocabularies.wikipathways.org/wp#> select str(?o) where { ?pw a wp:Pathway ; <http://purl.org/pav/version> ?o . } order by desc(?o) limit 1
Metabolome
The following queries provide an overview of the Metabolome captures by WikiPathways.
The key type for metabolites is the wp:Metabolite. We can see all available properties with:
prefix wp: <http://vocabularies.wikipathways.org/wp#> select distinct ?p where { ?mb a wp:Metabolite ; ?p [] . }
Pathway properties
Likewise, we can get all pathway properties with:
prefix wp: <http://vocabularies.wikipathways.org/wp#> select distinct ?p where { ?mb a wp:Pathway ; ?p [] . }
Latest data only
To only get analysis of the most recent pathways, add this snippet to your SPARQL, assuming ?pathway is the used variable name:
?mb dcterms:isPartOf ?pathway . ?pathway pav:version ?version . ?mb dcterms:isPartOf ?pathway2 . ?pathway2 pav:version ?version2 . FILTER (?version2 > ?version)
However, it should be kept in mind that this is not a fool-proof solution.
All Metabolites
Count
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select count(?mb) where { ?mb a wp:Metabolite . }
Revision | Count |
67787 | 5790 |
69675 | 5801 |
List
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?mb ?label where { ?mb a wp:Metabolite ; rdfs:label ?label . }
All zebrafish metabolites
PREFIX gpml: <http://vocabularies.wikipathways.org/gpml#> PREFIX dcterms: <http://purl.org/dc/terms/> PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> select distinct ?metabolite (str(?titleLit) as ?title) where { ?metabolite a wp:Metabolite ; dcterms:isPartOf ?pw . ?pw dc:title ?titleLit ; wp:organismName "Danio rerio" . }
Metabolic Data Sources
Sorted by use
HMDB, ChEBI, and KEGG are the main data sources for identifiers. InChI/InChIKey should also be there but is missing. A big curation process in January 2013 ensured that "PubChem compound" is now used as data source for PubChem CIDs.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select str(?datasource) as ?source count(distinct ?identifier) as ?count where { ?mb a wp:Metabolite ; dc:source ?datasource ; dc:identifier ?identifier . } order by desc(?count)
All metabolites from one source
All KEGG identifiers
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?identifier where { ?mb a wp:Metabolite ; dc:source "KEGG Compound" ; dc:identifier ?identifier . } order by ?identifier
All HMDB identifiers
Return all HMDB identfiers with:
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?identifier where { ?mb a wp:Metabolite ; dc:source "HMDB" ; dc:identifier ?identifier . } order by ?identifier
Return all metabolites listed to have a HMDB identifier but have none:
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?identifier where { ?mb a wp:Metabolite ; dc:source "HMDB"^^xsd:string ; dc:identifier ?identifier . FILTER (regex(str(?identifier),"noIdentifier")) } order by ?identifier
At the time of writing, this showed a number of XRefs with HMDB as data source but no identifiers, which needs curation:
http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1002_r35260 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1119_r35265 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1250_r41240 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1266_r41328 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1285_r41669 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1304_r41670 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1310_r41659 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1339_r35269 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP167_r45138 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP2267_r53133 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP28_r38852 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP28_r38852/group/ac37a http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP295_r41324 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP337_r41644 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP495_r41327 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP59_r41653 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP678_r41165 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP716_r45017
Metabolic Pathways
Of general interest is the number of pathways per species:
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix dcterms: <http://purl.org/dc/terms/> prefix ncbi: <http://purl.obolibrary.org/obo/NCBITaxon_> select distinct str(?orgName) as ?organism count(?pw) as ?pathways where { ?pw wp:organism ?organismCode . ?organismCode rdfs:label ?orgName } order by desc(?pathways)
Metabolomes
Human Metabolome
This only returns 244 metabolites, which is not a lot at all, and does not even take account the metabolite identity. Something wrong with wp:organism? It finds 107 human pathways.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix dcterms: <http://purl.org/dc/terms/> prefix ncbi: <http://purl.obolibrary.org/obo/NCBITaxon_> select distinct ?mb where { ?mb a wp:Metabolite ; dcterms:isPartOf ?pw . ?pw wp:organism ncbi:9606 . } order by ?mb
Revision | Count |
67787 | 1972 |
69675 | 2000 |
Arabodopsis thaliana Metabolome
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix dcterms: <http://purl.org/dc/terms/> prefix ncbi: <http://purl.obolibrary.org/obo/NCBITaxon_> select distinct ?mb where { ?mb a wp:Metabolite ; dcterms:isPartOf ?pw . ?pw wp:organism ncbi:3702 . } order by ?mb
Revision | Count |
69675 | 17 |
Pathways with the most metabolites
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> prefix pav: <http://purl.org/pav/> select ?pathway count(?mb) as ?mbCount where { ?mb a wp:Metabolite ; dcterms:isPartOf ?pathway . } order by desc(?mbCount)
Metabolites in the most Pathways
With the remark that BridgeDB is not involved yet.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> prefix pav: <http://purl.org/pav/> select ?mb count(?pathway) as ?pwCount where { ?mb a wp:Metabolite ; dcterms:isPartOf ?pathway . } order by desc(?pwCount)
Identifier Mapping Completeness
Right now, the HMDB is the primary (and only) source of mappings. That raises the question how many metabolites are in WP that do not have mappings to other databases. The following queries are about that.
The missing mappings
The next query counts all unique missing identifiers in HMDB, resulting in missing mappings to other databases: at the time of writing, this are 724 (was 927) identifiers. These are not unique identifiers, which is 369 (Run; was 404) at the time of writing. Given there are about 1400 unique metabolite identifiers, this is about 30%, which is rather significant. The major databases with unmapped resources are ChEBI and KEGG (see screenshot on the right).
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select count(?source) where { ?mb a wp:Metabolite ; dc:source ?source ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) FILTER (str(?source) = "ChEBI" || str(?source) = "CAS" || str(?source) = "Kegg Compound" || str(?source) = "Chemspider" || str(?source) = "HMDB") FILTER (str(?identifier) != "") }
The full list
These are the unique identifiers missing:
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct str(?source) str(?identifier) where { ?mb a wp:Metabolite ; dc:source ?source ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) FILTER (str(?source) = "ChEBI" || str(?source) = "CAS" || str(?source) = "Kegg Compound" || str(?source) = "Chemspider" || str(?source) = "HMDB") FILTER (str(?identifier) != "") } order by ?source ?identifier
ChEBI identifiers not in HMDB
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?pathway ?identifier ?label where { ?mb a wp:Metabolite ; dc:source "ChEBI"^^xsd:string ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) } order by ?identifier
CAS identifiers not in HMDB
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?pathway ?identifier ?label where { ?mb a wp:Metabolite ; dc:source "CAS"^^xsd:string ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) } order by ?identifier
Kegg compound identifiers not in HMDB
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?pathway ?identifier ?label where { ?mb a wp:Metabolite ; dc:source "Kegg Compound"^^xsd:string ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) } order by ?identifier
PubChem-compound identifiers not in HMDB
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?pathway ?identifier ?label where { ?mb a wp:Metabolite ; dc:source "PubChem-compound"^^xsd:string ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) } order by ?identifier
ChemSpider
Unique ChemSpider IDs
They can be counted with:
prefix dcterms: <http://purl.org/dc/terms/> prefix cheminf: <http://semanticscience.org/resource/> select count(distinct ?identifier) where { ?mb a wp:Metabolite ; dc:source "Chemspider"^^xsd:string ; dcterms:identifier ?identifier ; rdfs:label ?label ; dcterms:isPartOf ?pathway . ?pathway foaf:page ?page . }
And all listed with this non-counting equivalent:
prefix dcterms: <http://purl.org/dc/terms/> prefix cheminf: <http://semanticscience.org/resource/> select distinct str(?identifier) as ?csid where { ?mb a wp:Metabolite ; dc:source "Chemspider"^^xsd:string ; dcterms:identifier ?identifier ; rdfs:label ?label ; dcterms:isPartOf ?pathway . ?pathway foaf:page ?page . }
Linking ChemSpider IDs to WikiPathway
I need to ask Andra why not all pathways have a foaf:page, but these table should be discussed with Antony:
prefix foaf: <http://xmlns.com/foaf/0.1/> prefix dcterms: <http://purl.org/dc/terms/> prefix cheminf: <http://semanticscience.org/resource/> select distinct str(?identifier) as ?csid ?page where { ?mb a wp:Metabolite ; dc:source "Chemspider"^^xsd:string ; dcterms:identifier ?identifier ; rdfs:label ?label ; dcterms:isPartOf ?pathway . ?pathway foaf:page ?page . } order by ?csid
Curation
Common wrong identifiers
PubChem-compound 1004
Wrongly used for phosphate. It is the uncharged compound. Phosphate is, instead, and particularly thinkgs like "Pi", CID 1061 for ortho-phosphate, aka [PO4]2-.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select ?pathway ?source where { ?mb dc:source ?source ; dcterms:isPartOf ?pathway ; dcterms:identifier "1004"^^xsd:string . }
Outdated HMDB identifiers
These results show HMDB identifiers used in WikiPathways but that are revoked or have become secondary identifiers.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?identifier where { ?mb a wp:Metabolite ; dc:source "HMDB"^^xsd:string ; dc:identifier ?identifier . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) } order by ?identifier
Metabolites not classified as such
One can list all data sources for non-metabolites with this query.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select ?datasource count(?identifier) as ?count where { ?mb dc:source ?datasource ; dcterms:identifier ?identifier . FILTER NOT EXISTS { ?mb a wp:Metabolite } } order by desc(?count)
That mostly lists gene identifier sources, etc, but watch out for the metabolite identifier data sources. For example, metabolites not marked as such but with a metabolite identifier can be found this way. Down the list is CAS (but genes are chemicals too...), and a few minor more:
"CTD Gene"^^<http://www.w3.org/2001/XMLSchema#string> 5 "HMDB"^^<http://www.w3.org/2001/XMLSchema#string> 4 "ChEBI"^^<http://www.w3.org/2001/XMLSchema#string> 3 "GLYCAN"^^<http://www.w3.org/2001/XMLSchema#string> 3 "COMPOUND"^^<http://www.w3.org/2001/XMLSchema#string> 3 "PubChem"^^<http://www.w3.org/2001/XMLSchema#string> 2
I would expect GLYCAN and COMPOUND to be misnomers of the matching KEGG subsets.
Non-Metabolites with CAS identifier
Note that a CAS identifier can also refer to mixtures, compound classes, etc.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select distinct ?pathway ?mb str(?label) as ?name str(?identifier) as ?id where { ?mb dc:source "CAS"^^xsd:string ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . FILTER NOT EXISTS { ?mb a wp:Metabolite } } order by ?pathway
Non-Metabolites with PubChem identifier
At the time of writing, this results in an empty set.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select distinct ?pathway ?mb ?label ?identifier where { ?mb dc:source "PubChem-compound"^^xsd:string ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb rdfs:label ?label . } FILTER NOT EXISTS { ?mb a wp:Metabolite } } order by ?pathway
Metabolites sometimes marked as DataNode@Type Metabolite
Based on label comparisons, we can find things that are labeled the same as a data node with the same label. Of course, this can give false positives, because genes can be incorrectly marked as metabolite in some pathway, but that is another SPARQL query.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select ?pathway ?nonmb ?mb ?label where { ?nonmb rdfs:label ?label . ?mb rdfs:label ?label . OPTIONAL { ?nonmb dcterms:isPartOf ?pathway . } FILTER ( ?nonmb != ?mb ) FILTER NOT EXISTS { ?nonmb a wp:Metabolite } FILTER EXISTS { ?mb a wp:Metabolite } FILTER (!regex(str(?nonmb), "noIdentifier", "i")) FILTER (!regex(str(?mb), "noIdentifier", "i")) }
Metabolites with an identifier but undefined data source
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select distinct ?pathway ?mb ?identifier where { ?mb a wp:Metabolite ; dc:source ""^^xsd:string ; dc:identifier ?identifier ; dcterms:isPartOf ?pathway . FILTER (!isIRI(?identifier)) FILTER (str(?identifier) != "") } order by ?pathway
Metabolites with a data source but no identifier
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select distinct ?pathway ?mb ?source where { ?mb a wp:Metabolite ; dcterms:identifier ""^^xsd:string ; dc:source ?source ; dcterms:isPartOf ?pathway . FILTER (str(?source) != "") FILTER (!regex(str(?pathway), "internal.wikipathways.org", "i")) } order by ?pathway
Metabolites with too many labels
This is particularly caused by the metabolite URIs to be based on a non-existing identifier:
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct count(?label) as ?count ?pathway ?mb where { ?mb a wp:Metabolite ; rdfs:label ?label ; dcterms:isPartOf ?pathway . } order by desc(?count) ?pathway ?mb limit 410
An example such entity with many labels and being both a metabolite, gene, complex, etc:
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct str(?label) ?type where { <http://bio2rdf.org/geneid:noIdentifier> a ?type ; rdfs:label ?label . } order by ?label
Metabolites with an Entrez Gene identifier
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select distinct ?pathway ?mb ?label ?identifier where { ?mb a wp:Metabolite ; rdfs:label ?label ; dc:source "Entrez Gene"^^xsd:string ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . FILTER (str(?identifier) != "") } order by ?pathway
Metabolites as just Label
Metabolites may be marked up as DataNode but not types as Metabolite. Here are some examples: ATP, CO2, ADP, Phosphate, L-glutamate, and Cholesterol.
ATP
This example shows how to find them.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select ?pathway ?source ?mb ?type where { ?mb rdfs:label "ATP"@en . ?mb a ?type . OPTIONAL { ?mb dc:source ?source . } OPTIONAL { ?mb dcterms:isPartOf ?pathway . } FILTER NOT EXISTS { ?mb a wp:Metabolite . } }
Metabolites also labeled as GeneProduct
Sometimes things are incorrectly marked as Metabolite, when they really are GeneProducts. We can list entities based on their label that are both annotated as Metabolite and as GeneProduct:
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select ?pathway ?mb ?gene ?label where { ?gene rdfs:label ?label . ?mb rdfs:label ?label . OPTIONAL { ?mb dcterms:isPartOf ?pathway . } FILTER ( ?gene != ?mb ) FILTER EXISTS { ?gene a wp:GeneProduct } FILTER EXISTS { ?mb a wp:Metabolite } FILTER (!regex(str(?mb), "noIdentifier", "i")) FILTER (!regex(str(?gene), "noIdentifier", "i")) }
Actually, this query does not do what I want it to do, because the FILTER only removes things from the result list, but does still allow things with "noIdentifier" to hook up things, messing up this query if there is just one URI with noIdentifier with the same label :(
Labels which are also marked as metabolite
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select ?pathway ?labelNode str(?label1) as ?labelStr ?mb str(?label2) as ?mbStr where { ?labelNode a gpml:Label ; rdfs:label ?label1 ; dcterms:isPartOf ?pathway . ?mb a wp:Metabolite ; rdfs:label ?label2 . FILTER ( ?labelNode != ?mb ) FILTER ( str(?label2) = str(?label1) ) FILTER (!regex(str(?mb), "noIdentifier", "i")) FILTER (!regex(str(?labelNode), "noIdentifier", "i")) } LIMIT 50 OFFSET 25
To get the most common such labels, use (though typically times out on Virtuoso 6.1):
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select str(?label1) as ?labelStr count(?labelNode) as ?count where { ?labelNode a gpml:Label ; rdfs:label ?label1 . ?mb a wp:Metabolite ; rdfs:label ?label2 . FILTER ( ?labelNode != ?mb ) FILTER ( str(?label2) = str(?label1) ) FILTER (!regex(str(?mb), "noIdentifier", "i")) FILTER (!regex(str(?labelNode), "noIdentifier", "i")) } order by desc(?count)