Help:WikiPathways Metabolomics
From WikiPathways
(→PubChem-compound 1004) |
Current revision (08:00, 10 February 2023) (view source) (→All HMDB identifiers) |
||
(82 intermediate revisions not shown.) | |||
Line 5: | Line 5: | ||
The reason for this is that these are some basic underlying properties we need for metabolomics research fields. | The reason for this is that these are some basic underlying properties we need for metabolomics research fields. | ||
+ | |||
+ | = The Data = | ||
+ | |||
+ | The latest revision you can look up with: | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | |||
+ | select str(?o) where { | ||
+ | ?pw a wp:Pathway ; | ||
+ | <http://purl.org/pav/version> ?o . | ||
+ | } order by desc(?o) limit 1 | ||
+ | </pre> | ||
= Metabolome = | = Metabolome = | ||
Line 20: | Line 33: | ||
} | } | ||
</pre> | </pre> | ||
+ | |||
+ | |||
+ | |||
+ | == Pathway properties == | ||
Likewise, we can get all pathway properties with: | Likewise, we can get all pathway properties with: | ||
Line 59: | Line 76: | ||
} | } | ||
</pre> | </pre> | ||
+ | |||
+ | {| | ||
+ | |'''Revision''' | ||
+ | |'''Count''' | ||
+ | |- | ||
+ | |67787 | ||
+ | |5790 | ||
+ | |- | ||
+ | |69675 | ||
+ | |5801 | ||
+ | |} | ||
=== List === | === List === | ||
Line 67: | Line 95: | ||
prefix dcterms: <http://purl.org/dc/terms/> | prefix dcterms: <http://purl.org/dc/terms/> | ||
- | select ?mb ?label where { | + | select distinct ?mb ?label where { |
?mb a wp:Metabolite ; | ?mb a wp:Metabolite ; | ||
rdfs:label ?label . | rdfs:label ?label . | ||
} | } | ||
</pre> | </pre> | ||
+ | |||
+ | === All zebrafish metabolites === | ||
+ | |||
+ | <pre> | ||
+ | PREFIX gpml: <http://vocabularies.wikipathways.org/gpml#> | ||
+ | PREFIX dcterms: <http://purl.org/dc/terms/> | ||
+ | PREFIX dc: <http://purl.org/dc/elements/1.1/> | ||
+ | PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> | ||
+ | |||
+ | select distinct ?metabolite (str(?titleLit) as ?title) where { | ||
+ | ?metabolite a wp:Metabolite ; | ||
+ | dcterms:isPartOf ?pw . | ||
+ | ?pw dc:title ?titleLit ; | ||
+ | wp:organismName "Danio rerio" . | ||
+ | } | ||
+ | </pre> | ||
+ | |||
+ | [http://sparql.wikipathways.org/sparql?query=PREFIX+gpml%3A++++%3Chttp%3A%2F%2Fvocabularies.wikipathways.org%2Fgpml%23%3E%0D%0APREFIX+dcterms%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0D%0APREFIX+dc%3A++++++%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Felements%2F1.1%2F%3E%0D%0APREFIX+rdf%3A+++++%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E+%0D%0A%0D%0Aselect+distinct+%3Fmetabolite+%28str%28%3FtitleLit%29+as+%3Ftitle%29+where+%7B%0D%0A++%3Fmetabolite+a+wp%3AMetabolite+%3B%0D%0A++++dcterms%3AisPartOf+%3Fpw+.%0D%0A++%3Fpw+dc%3Atitle+%3FtitleLit+%3B%0D%0A++++wp%3AorganismName+%22Danio+rerio%22+.%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on Run] | ||
= Metabolic Data Sources = | = Metabolic Data Sources = | ||
== Sorted by use == | == Sorted by use == | ||
+ | |||
+ | [[Image:mbStats.png|right|200px]] | ||
+ | |||
+ | HMDB, ChEBI, and KEGG are the main data sources for identifiers. InChI/InChIKey should also be there but is missing. A big curation process in January 2013 ensured that "PubChem compound" is now used as data source for PubChem CIDs. | ||
<pre> | <pre> | ||
Line 82: | Line 132: | ||
prefix dcterms: <http://purl.org/dc/terms/> | prefix dcterms: <http://purl.org/dc/terms/> | ||
- | select ?datasource count(?identifier) as ?count | + | select str(?datasource) as ?source count(distinct ?identifier) as ?count |
where { | where { | ||
?mb a wp:Metabolite ; | ?mb a wp:Metabolite ; | ||
Line 102: | Line 152: | ||
where { | where { | ||
?mb a wp:Metabolite ; | ?mb a wp:Metabolite ; | ||
- | dc:source " | + | dc:source "KEGG Compound" ; |
dc:identifier ?identifier . | dc:identifier ?identifier . | ||
- | |||
} order by ?identifier | } order by ?identifier | ||
</pre> | </pre> | ||
=== All HMDB identifiers === | === All HMDB identifiers === | ||
+ | |||
+ | Return all HMDB identfiers with: | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | |||
+ | select distinct ?identifier | ||
+ | where { | ||
+ | ?mb a wp:Metabolite ; | ||
+ | dc:source "HMDB" ; | ||
+ | dc:identifier ?identifier . | ||
+ | } order by ?identifier | ||
+ | </pre> | ||
+ | |||
+ | Return all metabolites listed to have a HMDB identifier but have none: | ||
<pre> | <pre> | ||
Line 120: | Line 186: | ||
dc:source "HMDB"^^xsd:string ; | dc:source "HMDB"^^xsd:string ; | ||
dc:identifier ?identifier . | dc:identifier ?identifier . | ||
- | FILTER ( | + | FILTER (regex(str(?identifier),"noIdentifier")) |
} order by ?identifier | } order by ?identifier | ||
+ | </pre> | ||
+ | |||
+ | At the time of writing, this showed a number of XRefs with HMDB as data source but no identifiers, which needs curation: | ||
+ | |||
+ | <pre> | ||
+ | http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1002_r35260 | ||
+ | http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1119_r35265 | ||
+ | http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1250_r41240 | ||
+ | http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1266_r41328 | ||
+ | http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1285_r41669 | ||
+ | http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1304_r41670 | ||
+ | http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1310_r41659 | ||
+ | http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1339_r35269 | ||
+ | http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP167_r45138 | ||
+ | http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP2267_r53133 | ||
+ | http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP28_r38852 | ||
+ | http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP28_r38852/group/ac37a | ||
+ | http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP295_r41324 | ||
+ | http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP337_r41644 | ||
+ | http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP495_r41327 | ||
+ | http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP59_r41653 | ||
+ | http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP678_r41165 | ||
+ | http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP716_r45017 | ||
</pre> | </pre> | ||
= Metabolic Pathways = | = Metabolic Pathways = | ||
+ | |||
+ | Of general interest is the number of pathways per species: | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | prefix ncbi: <http://purl.obolibrary.org/obo/NCBITaxon_> | ||
+ | |||
+ | select distinct str(?orgName) as ?organism count(?pw) as ?pathways where { | ||
+ | ?pw wp:organism ?organismCode . | ||
+ | ?organismCode rdfs:label ?orgName | ||
+ | } order by desc(?pathways) | ||
+ | </pre> | ||
== Metabolomes == | == Metabolomes == | ||
=== Human Metabolome === | === Human Metabolome === | ||
+ | |||
+ | This only returns 244 metabolites, which is not a lot at all, and does not even take account the metabolite identity. Something wrong with wp:organism? It finds 107 human pathways. | ||
<pre> | <pre> | ||
Line 141: | Line 245: | ||
} order by ?mb | } order by ?mb | ||
</pre> | </pre> | ||
+ | |||
+ | {| | ||
+ | |'''Revision''' | ||
+ | |'''Count''' | ||
+ | |- | ||
+ | |67787 | ||
+ | |1972 | ||
+ | |- | ||
+ | |69675 | ||
+ | |2000 | ||
+ | |} | ||
+ | |||
+ | === Arabodopsis thaliana Metabolome === | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | prefix ncbi: <http://purl.obolibrary.org/obo/NCBITaxon_> | ||
+ | |||
+ | select distinct ?mb where { | ||
+ | ?mb a wp:Metabolite ; | ||
+ | dcterms:isPartOf ?pw . | ||
+ | ?pw wp:organism ncbi:3702 . | ||
+ | } order by ?mb | ||
+ | </pre> | ||
+ | |||
+ | {| | ||
+ | |'''Revision''' | ||
+ | |'''Count''' | ||
+ | |- | ||
+ | |69675 | ||
+ | |17 | ||
+ | |} | ||
== Pathways with the most metabolites == | == Pathways with the most metabolites == | ||
Line 175: | Line 312: | ||
} order by desc(?pwCount) | } order by desc(?pwCount) | ||
</pre> | </pre> | ||
+ | |||
+ | = Identifier Mapping Completeness = | ||
+ | |||
+ | [[Image:fooNotInHMDB.png|right|200x]] | ||
+ | |||
+ | Right now, the HMDB is the primary (and only) source of mappings. That raises the question how many metabolites are in WP that do not have mappings to other databases. | ||
+ | The following queries are about that. | ||
+ | |||
+ | == The missing mappings == | ||
+ | |||
+ | The next query counts all unique missing identifiers in HMDB, resulting in missing mappings to other databases: at the time of writing, this are 724 (was 927) identifiers. | ||
+ | These are not unique identifiers, which is 369 ([http://goo.gl/EQU1H Run]; was 404) at the time of writing. Given there are about 1400 unique metabolite identifiers, this | ||
+ | is about 30%, which is rather significant. The major databases with unmapped resources are ChEBI and KEGG (see screenshot on the right). | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | |||
+ | select count(?source) | ||
+ | where { | ||
+ | ?mb a wp:Metabolite ; | ||
+ | dc:source ?source ; | ||
+ | rdfs:label ?label ; | ||
+ | dcterms:identifier ?identifier ; | ||
+ | dcterms:isPartOf ?pathway . | ||
+ | OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } | ||
+ | FILTER (!BOUND(?bridgedb)) | ||
+ | FILTER (str(?source) = "ChEBI" || str(?source) = "CAS" || str(?source) = "Kegg Compound" || str(?source) = "Chemspider" || str(?source) = "HMDB") | ||
+ | FILTER (str(?identifier) != "") | ||
+ | } | ||
+ | </pre> | ||
+ | |||
+ | === The full list === | ||
+ | |||
+ | These are the unique identifiers missing: | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | |||
+ | select distinct str(?source) str(?identifier) | ||
+ | where { | ||
+ | ?mb a wp:Metabolite ; | ||
+ | dc:source ?source ; | ||
+ | rdfs:label ?label ; | ||
+ | dcterms:identifier ?identifier ; | ||
+ | dcterms:isPartOf ?pathway . | ||
+ | OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } | ||
+ | FILTER (!BOUND(?bridgedb)) | ||
+ | FILTER (str(?source) = "ChEBI" || str(?source) = "CAS" || str(?source) = "Kegg Compound" || str(?source) = "Chemspider" || str(?source) = "HMDB") | ||
+ | FILTER (str(?identifier) != "") | ||
+ | } order by ?source ?identifier | ||
+ | </pre> | ||
+ | |||
+ | == ChEBI identifiers not in HMDB == | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | |||
+ | select distinct ?pathway ?identifier ?label | ||
+ | where { | ||
+ | ?mb a wp:Metabolite ; | ||
+ | dc:source "ChEBI"^^xsd:string ; | ||
+ | rdfs:label ?label ; | ||
+ | dcterms:identifier ?identifier ; | ||
+ | dcterms:isPartOf ?pathway . | ||
+ | OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } | ||
+ | FILTER (!BOUND(?bridgedb)) | ||
+ | } order by ?identifier | ||
+ | </pre> | ||
+ | |||
+ | == CAS identifiers not in HMDB == | ||
+ | |||
+ | [[Image:casNotInHMDB.png|right|200x]] | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | |||
+ | select distinct ?pathway ?identifier ?label | ||
+ | where { | ||
+ | ?mb a wp:Metabolite ; | ||
+ | dc:source "CAS"^^xsd:string ; | ||
+ | rdfs:label ?label ; | ||
+ | dcterms:identifier ?identifier ; | ||
+ | dcterms:isPartOf ?pathway . | ||
+ | OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } | ||
+ | FILTER (!BOUND(?bridgedb)) | ||
+ | } order by ?identifier | ||
+ | </pre> | ||
+ | |||
+ | == Kegg compound identifiers not in HMDB == | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | |||
+ | select distinct ?pathway ?identifier ?label | ||
+ | where { | ||
+ | ?mb a wp:Metabolite ; | ||
+ | dc:source "Kegg Compound"^^xsd:string ; | ||
+ | rdfs:label ?label ; | ||
+ | dcterms:identifier ?identifier ; | ||
+ | dcterms:isPartOf ?pathway . | ||
+ | OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } | ||
+ | FILTER (!BOUND(?bridgedb)) | ||
+ | } order by ?identifier | ||
+ | </pre> | ||
+ | |||
+ | == PubChem-compound identifiers not in HMDB == | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | |||
+ | select distinct ?pathway ?identifier ?label | ||
+ | where { | ||
+ | ?mb a wp:Metabolite ; | ||
+ | dc:source "PubChem-compound"^^xsd:string ; | ||
+ | rdfs:label ?label ; | ||
+ | dcterms:identifier ?identifier ; | ||
+ | dcterms:isPartOf ?pathway . | ||
+ | OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } | ||
+ | FILTER (!BOUND(?bridgedb)) | ||
+ | } order by ?identifier | ||
+ | </pre> | ||
+ | |||
+ | = ChemSpider = | ||
+ | |||
+ | == Unique ChemSpider IDs == | ||
+ | |||
+ | They can be counted with: | ||
+ | |||
+ | <pre> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | prefix cheminf: <http://semanticscience.org/resource/> | ||
+ | |||
+ | select count(distinct ?identifier) where { | ||
+ | ?mb a wp:Metabolite ; | ||
+ | dc:source "Chemspider"^^xsd:string ; | ||
+ | dcterms:identifier ?identifier ; | ||
+ | rdfs:label ?label ; | ||
+ | dcterms:isPartOf ?pathway . | ||
+ | ?pathway foaf:page ?page . | ||
+ | } | ||
+ | </pre> | ||
+ | |||
+ | And all listed with this non-counting equivalent: | ||
+ | |||
+ | <pre> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | prefix cheminf: <http://semanticscience.org/resource/> | ||
+ | |||
+ | select distinct str(?identifier) as ?csid where { | ||
+ | ?mb a wp:Metabolite ; | ||
+ | dc:source "Chemspider"^^xsd:string ; | ||
+ | dcterms:identifier ?identifier ; | ||
+ | rdfs:label ?label ; | ||
+ | dcterms:isPartOf ?pathway . | ||
+ | ?pathway foaf:page ?page . | ||
+ | } | ||
+ | </pre> | ||
+ | |||
+ | == Linking ChemSpider IDs to WikiPathway == | ||
+ | |||
+ | I need to ask Andra why not all pathways have a foaf:page, but these table should be discussed with Antony: | ||
+ | |||
+ | <pre> | ||
+ | prefix foaf: <http://xmlns.com/foaf/0.1/> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | prefix cheminf: <http://semanticscience.org/resource/> | ||
+ | |||
+ | select distinct str(?identifier) as ?csid ?page where { | ||
+ | ?mb a wp:Metabolite ; | ||
+ | dc:source "Chemspider"^^xsd:string ; | ||
+ | dcterms:identifier ?identifier ; | ||
+ | rdfs:label ?label ; | ||
+ | dcterms:isPartOf ?pathway . | ||
+ | ?pathway foaf:page ?page . | ||
+ | } order by ?csid | ||
+ | </pre> | ||
+ | |||
+ | |||
= Curation = | = Curation = | ||
Line 185: | Line 512: | ||
CID 1061 for ortho-phosphate, aka [PO4]2-. | CID 1061 for ortho-phosphate, aka [PO4]2-. | ||
- | + | <pre> | |
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | prefix xsd: <http://www.w3.org/2001/XMLSchema#> | ||
- | + | select ?pathway ?source | |
+ | where { | ||
+ | ?mb dc:source ?source ; | ||
+ | dcterms:isPartOf ?pathway ; | ||
+ | dcterms:identifier "1004"^^xsd:string . | ||
+ | } | ||
+ | </pre> | ||
+ | |||
+ | === Outdated HMDB identifiers === | ||
+ | |||
+ | These results show HMDB identifiers used in WikiPathways but that are revoked or have become secondary identifiers. | ||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | |||
+ | select distinct ?identifier | ||
+ | where { | ||
+ | ?mb a wp:Metabolite ; | ||
+ | dc:source "HMDB"^^xsd:string ; | ||
+ | dc:identifier ?identifier . | ||
+ | OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } | ||
+ | FILTER (!BOUND(?bridgedb)) | ||
+ | } order by ?identifier | ||
+ | </pre> | ||
+ | |||
+ | == Metabolites not classified as such == | ||
+ | |||
+ | One can list all data sources for non-metabolites with this query. | ||
<pre> | <pre> | ||
prefix wp: <http://vocabularies.wikipathways.org/wp#> | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
Line 197: | Line 556: | ||
where { | where { | ||
?mb dc:source ?datasource ; | ?mb dc:source ?datasource ; | ||
- | + | dcterms:identifier ?identifier . | |
FILTER NOT EXISTS { ?mb a wp:Metabolite } | FILTER NOT EXISTS { ?mb a wp:Metabolite } | ||
} order by desc(?count) | } order by desc(?count) | ||
</pre> | </pre> | ||
- | That mostly lists gene identifier sources, etc, but watch out for the metabolite identifier data sources. For example, metabolites not marked as such but with a metabolite identifier can be found this way. | + | That mostly lists gene identifier sources, etc, but watch out for the metabolite identifier data sources. For example, metabolites not marked as such but with a metabolite identifier can be found this way. Down the list is CAS (but genes are chemicals too...), and a few minor more: |
+ | |||
+ | <pre> | ||
+ | "CTD Gene"^^<http://www.w3.org/2001/XMLSchema#string> 5 | ||
+ | "HMDB"^^<http://www.w3.org/2001/XMLSchema#string> 4 | ||
+ | "ChEBI"^^<http://www.w3.org/2001/XMLSchema#string> 3 | ||
+ | "GLYCAN"^^<http://www.w3.org/2001/XMLSchema#string> 3 | ||
+ | "COMPOUND"^^<http://www.w3.org/2001/XMLSchema#string> 3 | ||
+ | "PubChem"^^<http://www.w3.org/2001/XMLSchema#string> 2 | ||
+ | </pre> | ||
+ | |||
+ | I would expect GLYCAN and COMPOUND to be misnomers of the matching KEGG subsets. | ||
=== Non-Metabolites with CAS identifier === | === Non-Metabolites with CAS identifier === | ||
Line 214: | Line 584: | ||
prefix xsd: <http://www.w3.org/2001/XMLSchema#> | prefix xsd: <http://www.w3.org/2001/XMLSchema#> | ||
- | select distinct ?pathway ?mb ?identifier | + | select distinct ?pathway ?mb str(?label) as ?name str(?identifier) as ?id |
where { | where { | ||
?mb dc:source "CAS"^^xsd:string ; | ?mb dc:source "CAS"^^xsd:string ; | ||
- | + | rdfs:label ?label ; | |
+ | dcterms:identifier ?identifier ; | ||
dcterms:isPartOf ?pathway . | dcterms:isPartOf ?pathway . | ||
FILTER NOT EXISTS { ?mb a wp:Metabolite } | FILTER NOT EXISTS { ?mb a wp:Metabolite } | ||
- | |||
} order by ?pathway | } order by ?pathway | ||
</pre> | </pre> | ||
Line 226: | Line 596: | ||
=== Non-Metabolites with PubChem identifier === | === Non-Metabolites with PubChem identifier === | ||
- | + | At the time of writing, this results in an empty set. | |
<pre> | <pre> | ||
Line 234: | Line 604: | ||
prefix xsd: <http://www.w3.org/2001/XMLSchema#> | prefix xsd: <http://www.w3.org/2001/XMLSchema#> | ||
- | select distinct ?pathway ?mb ?identifier | + | select distinct ?pathway ?mb ?label ?identifier |
where { | where { | ||
- | ?mb dc:source "PubChem"^^xsd:string ; | + | ?mb dc:source "PubChem-compound"^^xsd:string ; |
- | + | dcterms:identifier ?identifier ; | |
dcterms:isPartOf ?pathway . | dcterms:isPartOf ?pathway . | ||
+ | OPTIONAL { ?mb rdfs:label ?label . } | ||
FILTER NOT EXISTS { ?mb a wp:Metabolite } | FILTER NOT EXISTS { ?mb a wp:Metabolite } | ||
- | |||
} order by ?pathway | } order by ?pathway | ||
+ | </pre> | ||
+ | |||
+ | === Metabolites sometimes marked as DataNode@Type Metabolite === | ||
+ | |||
+ | Based on label comparisons, we can find things that are labeled the same as a data node with the same label. | ||
+ | Of course, this can give false positives, because genes can be incorrectly marked as metabolite in some pathway, | ||
+ | but that is another SPARQL query. | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | prefix xsd: <http://www.w3.org/2001/XMLSchema#> | ||
+ | |||
+ | select ?pathway ?nonmb ?mb ?label | ||
+ | where { | ||
+ | ?nonmb rdfs:label ?label . | ||
+ | ?mb rdfs:label ?label . | ||
+ | OPTIONAL { ?nonmb dcterms:isPartOf ?pathway . } | ||
+ | FILTER ( ?nonmb != ?mb ) | ||
+ | FILTER NOT EXISTS { ?nonmb a wp:Metabolite } | ||
+ | FILTER EXISTS { ?mb a wp:Metabolite } | ||
+ | FILTER (!regex(str(?nonmb), "noIdentifier", "i")) | ||
+ | FILTER (!regex(str(?mb), "noIdentifier", "i")) | ||
+ | } | ||
</pre> | </pre> | ||
Line 261: | Line 656: | ||
FILTER (str(?identifier) != "") | FILTER (str(?identifier) != "") | ||
} order by ?pathway | } order by ?pathway | ||
+ | </pre> | ||
+ | |||
+ | == Metabolites with a data source but no identifier == | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | prefix xsd: <http://www.w3.org/2001/XMLSchema#> | ||
+ | |||
+ | select distinct ?pathway ?mb ?source | ||
+ | where { | ||
+ | ?mb a wp:Metabolite ; | ||
+ | dcterms:identifier ""^^xsd:string ; | ||
+ | dc:source ?source ; | ||
+ | dcterms:isPartOf ?pathway . | ||
+ | FILTER (str(?source) != "") | ||
+ | FILTER (!regex(str(?pathway), "internal.wikipathways.org", "i")) | ||
+ | } order by ?pathway | ||
+ | </pre> | ||
+ | |||
+ | == Metabolites with too many labels == | ||
+ | |||
+ | This is particularly caused by the metabolite URIs to be based on a non-existing identifier: | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | |||
+ | select distinct count(?label) as ?count ?pathway ?mb | ||
+ | where { | ||
+ | ?mb a wp:Metabolite ; | ||
+ | rdfs:label ?label ; | ||
+ | dcterms:isPartOf ?pathway . | ||
+ | } order by desc(?count) ?pathway ?mb limit 410 | ||
+ | </pre> | ||
+ | |||
+ | An example such entity with many labels and being both a metabolite, gene, complex, etc: | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | |||
+ | select distinct str(?label) ?type | ||
+ | where { | ||
+ | <http://bio2rdf.org/geneid:noIdentifier> a ?type ; rdfs:label ?label . | ||
+ | } order by ?label | ||
</pre> | </pre> | ||
Line 271: | Line 715: | ||
prefix xsd: <http://www.w3.org/2001/XMLSchema#> | prefix xsd: <http://www.w3.org/2001/XMLSchema#> | ||
- | select distinct ?pathway ?mb ?identifier | + | select distinct ?pathway ?mb ?label ?identifier |
where { | where { | ||
?mb a wp:Metabolite ; | ?mb a wp:Metabolite ; | ||
+ | rdfs:label ?label ; | ||
dc:source "Entrez Gene"^^xsd:string ; | dc:source "Entrez Gene"^^xsd:string ; | ||
- | + | dcterms:identifier ?identifier ; | |
dcterms:isPartOf ?pathway . | dcterms:isPartOf ?pathway . | ||
- | |||
FILTER (str(?identifier) != "") | FILTER (str(?identifier) != "") | ||
} order by ?pathway | } order by ?pathway | ||
+ | </pre> | ||
+ | |||
+ | == Metabolites as just Label == | ||
+ | |||
+ | Metabolites may be marked up as DataNode but not types as Metabolite. Here are some examples: ATP, CO2, ADP, Phosphate, L-glutamate, and Cholesterol. | ||
+ | |||
+ | === ATP === | ||
+ | |||
+ | This example shows how to find them. | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | prefix xsd: <http://www.w3.org/2001/XMLSchema#> | ||
+ | |||
+ | select ?pathway ?source ?mb ?type | ||
+ | where { | ||
+ | ?mb rdfs:label "ATP"@en . | ||
+ | ?mb a ?type . | ||
+ | OPTIONAL { ?mb dc:source ?source . } | ||
+ | OPTIONAL { ?mb dcterms:isPartOf ?pathway . } | ||
+ | FILTER NOT EXISTS { ?mb a wp:Metabolite . } | ||
+ | } | ||
+ | </pre> | ||
+ | |||
+ | == Metabolites also labeled as GeneProduct == | ||
+ | |||
+ | Sometimes things are incorrectly marked as Metabolite, when they really are GeneProducts. We can list | ||
+ | entities based on their label that are both annotated as Metabolite and as GeneProduct: | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | prefix xsd: <http://www.w3.org/2001/XMLSchema#> | ||
+ | |||
+ | select ?pathway ?mb ?gene ?label | ||
+ | where { | ||
+ | ?gene rdfs:label ?label . | ||
+ | ?mb rdfs:label ?label . | ||
+ | OPTIONAL { ?mb dcterms:isPartOf ?pathway . } | ||
+ | FILTER ( ?gene != ?mb ) | ||
+ | FILTER EXISTS { ?gene a wp:GeneProduct } | ||
+ | FILTER EXISTS { ?mb a wp:Metabolite } | ||
+ | FILTER (!regex(str(?mb), "noIdentifier", "i")) | ||
+ | FILTER (!regex(str(?gene), "noIdentifier", "i")) | ||
+ | } | ||
+ | </pre> | ||
+ | |||
+ | Actually, this query does not do what I want it to do, because the FILTER only removes things from the result list, but does still allow things with "noIdentifier" to hook up things, messing up this query if there is just one URI with noIdentifier with the same label :( | ||
+ | |||
+ | == Labels which are also marked as metabolite == | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | prefix xsd: <http://www.w3.org/2001/XMLSchema#> | ||
+ | |||
+ | select ?pathway ?labelNode str(?label1) as ?labelStr ?mb str(?label2) as ?mbStr | ||
+ | where { | ||
+ | ?labelNode a gpml:Label ; | ||
+ | rdfs:label ?label1 ; | ||
+ | dcterms:isPartOf ?pathway . | ||
+ | ?mb a wp:Metabolite ; | ||
+ | rdfs:label ?label2 . | ||
+ | FILTER ( ?labelNode != ?mb ) | ||
+ | FILTER ( str(?label2) = str(?label1) ) | ||
+ | FILTER (!regex(str(?mb), "noIdentifier", "i")) | ||
+ | FILTER (!regex(str(?labelNode), "noIdentifier", "i")) | ||
+ | } LIMIT 50 OFFSET 25 | ||
+ | </pre> | ||
+ | |||
+ | To get the most common such labels, use (though typically times out on Virtuoso 6.1): | ||
+ | |||
+ | <pre> | ||
+ | prefix wp: <http://vocabularies.wikipathways.org/wp#> | ||
+ | prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> | ||
+ | prefix dcterms: <http://purl.org/dc/terms/> | ||
+ | prefix xsd: <http://www.w3.org/2001/XMLSchema#> | ||
+ | |||
+ | select str(?label1) as ?labelStr count(?labelNode) as ?count | ||
+ | where { | ||
+ | ?labelNode a gpml:Label ; | ||
+ | rdfs:label ?label1 . | ||
+ | ?mb a wp:Metabolite ; | ||
+ | rdfs:label ?label2 . | ||
+ | FILTER ( ?labelNode != ?mb ) | ||
+ | FILTER ( str(?label2) = str(?label1) ) | ||
+ | FILTER (!regex(str(?mb), "noIdentifier", "i")) | ||
+ | FILTER (!regex(str(?labelNode), "noIdentifier", "i")) | ||
+ | } order by desc(?count) | ||
</pre> | </pre> |
Current revision
On this page we collect SPARQL queries to see the state of the Metabolome in WikiPathways. Triggered by User:Andra's RDF / SPARQL work, curation started with metabolites without database identifiers. But this soon led to the observation that metabolites are often not even annotated as being a metabolite (using <Label> rather than <DataNode>). Therefore, User:Egonw started at Pathway:WP1 to curate them one by one and fix these issues:
- connect lines between metabolites
- convert metabolites to use <Label> rather than <DataNode>
The reason for this is that these are some basic underlying properties we need for metabolomics research fields.
The Data
The latest revision you can look up with:
prefix wp: <http://vocabularies.wikipathways.org/wp#> select str(?o) where { ?pw a wp:Pathway ; <http://purl.org/pav/version> ?o . } order by desc(?o) limit 1
Metabolome
The following queries provide an overview of the Metabolome captures by WikiPathways.
The key type for metabolites is the wp:Metabolite. We can see all available properties with:
prefix wp: <http://vocabularies.wikipathways.org/wp#> select distinct ?p where { ?mb a wp:Metabolite ; ?p [] . }
Pathway properties
Likewise, we can get all pathway properties with:
prefix wp: <http://vocabularies.wikipathways.org/wp#> select distinct ?p where { ?mb a wp:Pathway ; ?p [] . }
Latest data only
To only get analysis of the most recent pathways, add this snippet to your SPARQL, assuming ?pathway is the used variable name:
?mb dcterms:isPartOf ?pathway . ?pathway pav:version ?version . ?mb dcterms:isPartOf ?pathway2 . ?pathway2 pav:version ?version2 . FILTER (?version2 > ?version)
However, it should be kept in mind that this is not a fool-proof solution.
All Metabolites
Count
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select count(?mb) where { ?mb a wp:Metabolite . }
Revision | Count |
67787 | 5790 |
69675 | 5801 |
List
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?mb ?label where { ?mb a wp:Metabolite ; rdfs:label ?label . }
All zebrafish metabolites
PREFIX gpml: <http://vocabularies.wikipathways.org/gpml#> PREFIX dcterms: <http://purl.org/dc/terms/> PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> select distinct ?metabolite (str(?titleLit) as ?title) where { ?metabolite a wp:Metabolite ; dcterms:isPartOf ?pw . ?pw dc:title ?titleLit ; wp:organismName "Danio rerio" . }
Metabolic Data Sources
Sorted by use
HMDB, ChEBI, and KEGG are the main data sources for identifiers. InChI/InChIKey should also be there but is missing. A big curation process in January 2013 ensured that "PubChem compound" is now used as data source for PubChem CIDs.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select str(?datasource) as ?source count(distinct ?identifier) as ?count where { ?mb a wp:Metabolite ; dc:source ?datasource ; dc:identifier ?identifier . } order by desc(?count)
All metabolites from one source
All KEGG identifiers
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?identifier where { ?mb a wp:Metabolite ; dc:source "KEGG Compound" ; dc:identifier ?identifier . } order by ?identifier
All HMDB identifiers
Return all HMDB identfiers with:
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?identifier where { ?mb a wp:Metabolite ; dc:source "HMDB" ; dc:identifier ?identifier . } order by ?identifier
Return all metabolites listed to have a HMDB identifier but have none:
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?identifier where { ?mb a wp:Metabolite ; dc:source "HMDB"^^xsd:string ; dc:identifier ?identifier . FILTER (regex(str(?identifier),"noIdentifier")) } order by ?identifier
At the time of writing, this showed a number of XRefs with HMDB as data source but no identifiers, which needs curation:
http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1002_r35260 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1119_r35265 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1250_r41240 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1266_r41328 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1285_r41669 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1304_r41670 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1310_r41659 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1339_r35269 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP167_r45138 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP2267_r53133 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP28_r38852 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP28_r38852/group/ac37a http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP295_r41324 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP337_r41644 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP495_r41327 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP59_r41653 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP678_r41165 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP716_r45017
Metabolic Pathways
Of general interest is the number of pathways per species:
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix dcterms: <http://purl.org/dc/terms/> prefix ncbi: <http://purl.obolibrary.org/obo/NCBITaxon_> select distinct str(?orgName) as ?organism count(?pw) as ?pathways where { ?pw wp:organism ?organismCode . ?organismCode rdfs:label ?orgName } order by desc(?pathways)
Metabolomes
Human Metabolome
This only returns 244 metabolites, which is not a lot at all, and does not even take account the metabolite identity. Something wrong with wp:organism? It finds 107 human pathways.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix dcterms: <http://purl.org/dc/terms/> prefix ncbi: <http://purl.obolibrary.org/obo/NCBITaxon_> select distinct ?mb where { ?mb a wp:Metabolite ; dcterms:isPartOf ?pw . ?pw wp:organism ncbi:9606 . } order by ?mb
Revision | Count |
67787 | 1972 |
69675 | 2000 |
Arabodopsis thaliana Metabolome
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix dcterms: <http://purl.org/dc/terms/> prefix ncbi: <http://purl.obolibrary.org/obo/NCBITaxon_> select distinct ?mb where { ?mb a wp:Metabolite ; dcterms:isPartOf ?pw . ?pw wp:organism ncbi:3702 . } order by ?mb
Revision | Count |
69675 | 17 |
Pathways with the most metabolites
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> prefix pav: <http://purl.org/pav/> select ?pathway count(?mb) as ?mbCount where { ?mb a wp:Metabolite ; dcterms:isPartOf ?pathway . } order by desc(?mbCount)
Metabolites in the most Pathways
With the remark that BridgeDB is not involved yet.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> prefix pav: <http://purl.org/pav/> select ?mb count(?pathway) as ?pwCount where { ?mb a wp:Metabolite ; dcterms:isPartOf ?pathway . } order by desc(?pwCount)
Identifier Mapping Completeness
Right now, the HMDB is the primary (and only) source of mappings. That raises the question how many metabolites are in WP that do not have mappings to other databases. The following queries are about that.
The missing mappings
The next query counts all unique missing identifiers in HMDB, resulting in missing mappings to other databases: at the time of writing, this are 724 (was 927) identifiers. These are not unique identifiers, which is 369 (Run; was 404) at the time of writing. Given there are about 1400 unique metabolite identifiers, this is about 30%, which is rather significant. The major databases with unmapped resources are ChEBI and KEGG (see screenshot on the right).
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select count(?source) where { ?mb a wp:Metabolite ; dc:source ?source ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) FILTER (str(?source) = "ChEBI" || str(?source) = "CAS" || str(?source) = "Kegg Compound" || str(?source) = "Chemspider" || str(?source) = "HMDB") FILTER (str(?identifier) != "") }
The full list
These are the unique identifiers missing:
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct str(?source) str(?identifier) where { ?mb a wp:Metabolite ; dc:source ?source ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) FILTER (str(?source) = "ChEBI" || str(?source) = "CAS" || str(?source) = "Kegg Compound" || str(?source) = "Chemspider" || str(?source) = "HMDB") FILTER (str(?identifier) != "") } order by ?source ?identifier
ChEBI identifiers not in HMDB
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?pathway ?identifier ?label where { ?mb a wp:Metabolite ; dc:source "ChEBI"^^xsd:string ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) } order by ?identifier
CAS identifiers not in HMDB
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?pathway ?identifier ?label where { ?mb a wp:Metabolite ; dc:source "CAS"^^xsd:string ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) } order by ?identifier
Kegg compound identifiers not in HMDB
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?pathway ?identifier ?label where { ?mb a wp:Metabolite ; dc:source "Kegg Compound"^^xsd:string ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) } order by ?identifier
PubChem-compound identifiers not in HMDB
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?pathway ?identifier ?label where { ?mb a wp:Metabolite ; dc:source "PubChem-compound"^^xsd:string ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) } order by ?identifier
ChemSpider
Unique ChemSpider IDs
They can be counted with:
prefix dcterms: <http://purl.org/dc/terms/> prefix cheminf: <http://semanticscience.org/resource/> select count(distinct ?identifier) where { ?mb a wp:Metabolite ; dc:source "Chemspider"^^xsd:string ; dcterms:identifier ?identifier ; rdfs:label ?label ; dcterms:isPartOf ?pathway . ?pathway foaf:page ?page . }
And all listed with this non-counting equivalent:
prefix dcterms: <http://purl.org/dc/terms/> prefix cheminf: <http://semanticscience.org/resource/> select distinct str(?identifier) as ?csid where { ?mb a wp:Metabolite ; dc:source "Chemspider"^^xsd:string ; dcterms:identifier ?identifier ; rdfs:label ?label ; dcterms:isPartOf ?pathway . ?pathway foaf:page ?page . }
Linking ChemSpider IDs to WikiPathway
I need to ask Andra why not all pathways have a foaf:page, but these table should be discussed with Antony:
prefix foaf: <http://xmlns.com/foaf/0.1/> prefix dcterms: <http://purl.org/dc/terms/> prefix cheminf: <http://semanticscience.org/resource/> select distinct str(?identifier) as ?csid ?page where { ?mb a wp:Metabolite ; dc:source "Chemspider"^^xsd:string ; dcterms:identifier ?identifier ; rdfs:label ?label ; dcterms:isPartOf ?pathway . ?pathway foaf:page ?page . } order by ?csid
Curation
Common wrong identifiers
PubChem-compound 1004
Wrongly used for phosphate. It is the uncharged compound. Phosphate is, instead, and particularly thinkgs like "Pi", CID 1061 for ortho-phosphate, aka [PO4]2-.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select ?pathway ?source where { ?mb dc:source ?source ; dcterms:isPartOf ?pathway ; dcterms:identifier "1004"^^xsd:string . }
Outdated HMDB identifiers
These results show HMDB identifiers used in WikiPathways but that are revoked or have become secondary identifiers.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?identifier where { ?mb a wp:Metabolite ; dc:source "HMDB"^^xsd:string ; dc:identifier ?identifier . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) } order by ?identifier
Metabolites not classified as such
One can list all data sources for non-metabolites with this query.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select ?datasource count(?identifier) as ?count where { ?mb dc:source ?datasource ; dcterms:identifier ?identifier . FILTER NOT EXISTS { ?mb a wp:Metabolite } } order by desc(?count)
That mostly lists gene identifier sources, etc, but watch out for the metabolite identifier data sources. For example, metabolites not marked as such but with a metabolite identifier can be found this way. Down the list is CAS (but genes are chemicals too...), and a few minor more:
"CTD Gene"^^<http://www.w3.org/2001/XMLSchema#string> 5 "HMDB"^^<http://www.w3.org/2001/XMLSchema#string> 4 "ChEBI"^^<http://www.w3.org/2001/XMLSchema#string> 3 "GLYCAN"^^<http://www.w3.org/2001/XMLSchema#string> 3 "COMPOUND"^^<http://www.w3.org/2001/XMLSchema#string> 3 "PubChem"^^<http://www.w3.org/2001/XMLSchema#string> 2
I would expect GLYCAN and COMPOUND to be misnomers of the matching KEGG subsets.
Non-Metabolites with CAS identifier
Note that a CAS identifier can also refer to mixtures, compound classes, etc.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select distinct ?pathway ?mb str(?label) as ?name str(?identifier) as ?id where { ?mb dc:source "CAS"^^xsd:string ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . FILTER NOT EXISTS { ?mb a wp:Metabolite } } order by ?pathway
Non-Metabolites with PubChem identifier
At the time of writing, this results in an empty set.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select distinct ?pathway ?mb ?label ?identifier where { ?mb dc:source "PubChem-compound"^^xsd:string ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb rdfs:label ?label . } FILTER NOT EXISTS { ?mb a wp:Metabolite } } order by ?pathway
Metabolites sometimes marked as DataNode@Type Metabolite
Based on label comparisons, we can find things that are labeled the same as a data node with the same label. Of course, this can give false positives, because genes can be incorrectly marked as metabolite in some pathway, but that is another SPARQL query.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select ?pathway ?nonmb ?mb ?label where { ?nonmb rdfs:label ?label . ?mb rdfs:label ?label . OPTIONAL { ?nonmb dcterms:isPartOf ?pathway . } FILTER ( ?nonmb != ?mb ) FILTER NOT EXISTS { ?nonmb a wp:Metabolite } FILTER EXISTS { ?mb a wp:Metabolite } FILTER (!regex(str(?nonmb), "noIdentifier", "i")) FILTER (!regex(str(?mb), "noIdentifier", "i")) }
Metabolites with an identifier but undefined data source
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select distinct ?pathway ?mb ?identifier where { ?mb a wp:Metabolite ; dc:source ""^^xsd:string ; dc:identifier ?identifier ; dcterms:isPartOf ?pathway . FILTER (!isIRI(?identifier)) FILTER (str(?identifier) != "") } order by ?pathway
Metabolites with a data source but no identifier
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select distinct ?pathway ?mb ?source where { ?mb a wp:Metabolite ; dcterms:identifier ""^^xsd:string ; dc:source ?source ; dcterms:isPartOf ?pathway . FILTER (str(?source) != "") FILTER (!regex(str(?pathway), "internal.wikipathways.org", "i")) } order by ?pathway
Metabolites with too many labels
This is particularly caused by the metabolite URIs to be based on a non-existing identifier:
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct count(?label) as ?count ?pathway ?mb where { ?mb a wp:Metabolite ; rdfs:label ?label ; dcterms:isPartOf ?pathway . } order by desc(?count) ?pathway ?mb limit 410
An example such entity with many labels and being both a metabolite, gene, complex, etc:
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct str(?label) ?type where { <http://bio2rdf.org/geneid:noIdentifier> a ?type ; rdfs:label ?label . } order by ?label
Metabolites with an Entrez Gene identifier
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select distinct ?pathway ?mb ?label ?identifier where { ?mb a wp:Metabolite ; rdfs:label ?label ; dc:source "Entrez Gene"^^xsd:string ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . FILTER (str(?identifier) != "") } order by ?pathway
Metabolites as just Label
Metabolites may be marked up as DataNode but not types as Metabolite. Here are some examples: ATP, CO2, ADP, Phosphate, L-glutamate, and Cholesterol.
ATP
This example shows how to find them.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select ?pathway ?source ?mb ?type where { ?mb rdfs:label "ATP"@en . ?mb a ?type . OPTIONAL { ?mb dc:source ?source . } OPTIONAL { ?mb dcterms:isPartOf ?pathway . } FILTER NOT EXISTS { ?mb a wp:Metabolite . } }
Metabolites also labeled as GeneProduct
Sometimes things are incorrectly marked as Metabolite, when they really are GeneProducts. We can list entities based on their label that are both annotated as Metabolite and as GeneProduct:
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select ?pathway ?mb ?gene ?label where { ?gene rdfs:label ?label . ?mb rdfs:label ?label . OPTIONAL { ?mb dcterms:isPartOf ?pathway . } FILTER ( ?gene != ?mb ) FILTER EXISTS { ?gene a wp:GeneProduct } FILTER EXISTS { ?mb a wp:Metabolite } FILTER (!regex(str(?mb), "noIdentifier", "i")) FILTER (!regex(str(?gene), "noIdentifier", "i")) }
Actually, this query does not do what I want it to do, because the FILTER only removes things from the result list, but does still allow things with "noIdentifier" to hook up things, messing up this query if there is just one URI with noIdentifier with the same label :(
Labels which are also marked as metabolite
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select ?pathway ?labelNode str(?label1) as ?labelStr ?mb str(?label2) as ?mbStr where { ?labelNode a gpml:Label ; rdfs:label ?label1 ; dcterms:isPartOf ?pathway . ?mb a wp:Metabolite ; rdfs:label ?label2 . FILTER ( ?labelNode != ?mb ) FILTER ( str(?label2) = str(?label1) ) FILTER (!regex(str(?mb), "noIdentifier", "i")) FILTER (!regex(str(?labelNode), "noIdentifier", "i")) } LIMIT 50 OFFSET 25
To get the most common such labels, use (though typically times out on Virtuoso 6.1):
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select str(?label1) as ?labelStr count(?labelNode) as ?count where { ?labelNode a gpml:Label ; rdfs:label ?label1 . ?mb a wp:Metabolite ; rdfs:label ?label2 . FILTER ( ?labelNode != ?mb ) FILTER ( str(?label2) = str(?label1) ) FILTER (!regex(str(?mb), "noIdentifier", "i")) FILTER (!regex(str(?labelNode), "noIdentifier", "i")) } order by desc(?count)