Help:WikiPathways Metabolomics
From WikiPathways
(→Identifier Mapping Completeness) |
(→Curation) |
||
Line 363: | Line 363: | ||
[http://goo.gl/VUyAF Run] | [http://goo.gl/VUyAF Run] | ||
+ | |||
+ | = ChemSpider = | ||
+ | |||
+ | == Unique ChemSpider IDs == | ||
+ | |||
+ | They can be counted with: | ||
+ | |||
+ | <pre> | ||
+ | select count(distinct ?csid) where { | ||
+ | [] <http://semanticscience.org/resource/CHEMINF_000200> ?Concept . | ||
+ | ?Concept ?p ?csid ; | ||
+ | a <http://semanticscience.org/resource/CHEMINF_000405> . | ||
+ | } | ||
+ | </pre> | ||
+ | |||
+ | And all listed with this non-counting equivalent: | ||
+ | |||
+ | <pre> | ||
+ | select distinct str(?csid) where { | ||
+ | [] <http://semanticscience.org/resource/CHEMINF_000200> ?Concept . | ||
+ | ?Concept ?p ?csid ; | ||
+ | a <http://semanticscience.org/resource/CHEMINF_000405> . | ||
+ | } | ||
+ | </pre> | ||
= Curation = | = Curation = |
Revision as of 15:41, 10 February 2013
On this page we collect SPARQL queries to see the state of the Metabolome in WikiPathways. Triggered by User:Andra's RDF / SPARQL work, curation started with metabolites without database identifiers. But this soon led to the observation that metabolites are often not even annotated as being a metabolite (using <Label> rather than <DataNode>). Therefore, User:Egonw started at Pathway:WP1 to curate them one by one and fix these issues:
- connect lines between metabolites
- convert metabolites to use <Label> rather than <DataNode>
The reason for this is that these are some basic underlying properties we need for metabolomics research fields.
Contents
|
Metabolome
The following queries provide an overview of the Metabolome captures by WikiPathways.
The key type for metabolites is the wp:Metabolite. We can see all available properties with:
prefix wp: <http://vocabularies.wikipathways.org/wp#> select distinct ?p where { ?mb a wp:Metabolite ; ?p [] . }
Likewise, we can get all pathway properties with:
prefix wp: <http://vocabularies.wikipathways.org/wp#> select distinct ?p where { ?mb a wp:Pathway ; ?p [] . }
Latest data only
To only get analysis of the most recent pathways, add this snippet to your SPARQL, assuming ?pathway is the used variable name:
?mb dcterms:isPartOf ?pathway . ?pathway pav:version ?version . ?mb dcterms:isPartOf ?pathway2 . ?pathway2 pav:version ?version2 . FILTER (?version2 > ?version)
However, it should be kept in mind that this is not a fool-proof solution.
All Metabolites
Count
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select count(?mb) where { ?mb a wp:Metabolite . }
List
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?mb ?label where { ?mb a wp:Metabolite ; rdfs:label ?label . }
Metabolic Data Sources
Sorted by use
HMDB, ChEBI, and KEGG are the main data sources for identifiers. InChI/InChIKey should also be there but is missing. A big curation process in January 2013 ensured that "PubChem compound" is now used as data source for PubChem CIDs.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select ?datasource count(distinct ?identifier) as ?count where { ?mb a wp:Metabolite ; dc:source ?datasource ; dc:identifier ?identifier . } order by desc(?count)
All metabolites from one source
All KEGG identifiers
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?identifier where { ?mb a wp:Metabolite ; dc:source "Kegg Compound"^^xsd:string ; dc:identifier ?identifier . } order by ?identifier
All HMDB identifiers
At the time of writing, this showed a number of XRefs with HMDB as data source but no identifiers, which needs curation:
http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1002_r35260 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1119_r35265 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1250_r41240 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1266_r41328 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1285_r41669 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1304_r41670 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1310_r41659 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP1339_r35269 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP167_r45138 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP2267_r53133 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP28_r38852 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP28_r38852/group/ac37a http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP295_r41324 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP337_r41644 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP495_r41327 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP59_r41653 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP678_r41165 http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP716_r45017
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?identifier where { ?mb a wp:Metabolite ; dc:source "HMDB"^^xsd:string ; dc:identifier ?identifier . } order by ?identifier
Metabolic Pathways
Metabolomes
Human Metabolome
This only returns 244 metabolites, which is not a lot at all, and does not even take account the metabolite identity. Something wrong with wp:organism? It finds 107 human pathways.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix dcterms: <http://purl.org/dc/terms/> prefix ncbi: <http://purl.obolibrary.org/obo/NCBITaxon_> select distinct ?mb where { ?mb a wp:Metabolite ; dcterms:isPartOf ?pw . ?pw wp:organism ncbi:9606 . } order by ?mb
Pathways with the most metabolites
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> prefix pav: <http://purl.org/pav/> select ?pathway count(?mb) as ?mbCount where { ?mb a wp:Metabolite ; dcterms:isPartOf ?pathway . } order by desc(?mbCount)
Metabolites in the most Pathways
With the remark that BridgeDB is not involved yet.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> prefix pav: <http://purl.org/pav/> select ?mb count(?pathway) as ?pwCount where { ?mb a wp:Metabolite ; dcterms:isPartOf ?pathway . } order by desc(?pwCount)
Identifier Mapping Completeness
Right now, the HMDB is the primary (and only) source of mappings. That raises the question how many metabolites are in WP that do not have mappings to other databases. The following queries are about that.
The missing mappings
The next query counts all unique missing identifiers in HMDB, resulting in missing mappings to other databases: at the time of writing, this are 927 identifiers. These are not unique identifiers, which is 404 (Run) at the time of writing. Given there are about 1400 unique metabolite identifiers, this is about 30%, which is rather significant.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select count(?source) where { ?mb a wp:Metabolite ; dc:source ?source ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) FILTER (str(?source) = "ChEBI" || str(?source) = "CAS" || str(?source) = "Kegg Compound" || str(?source) = "Chemspider" || str(?source) = "HMDB") FILTER (str(?identifier) != "") }
The full list
These are the unique identifiers missing:
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct str(?source) str(?identifier) where { ?mb a wp:Metabolite ; dc:source ?source ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) FILTER (str(?source) = "ChEBI" || str(?source) = "CAS" || str(?source) = "Kegg Compound" || str(?source) = "Chemspider" || str(?source) = "HMDB") FILTER (str(?identifier) != "") } order by ?source ?identifier
ChEBI identifiers not in HMDB
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?pathway ?identifier ?label where { ?mb a wp:Metabolite ; dc:source "ChEBI"^^xsd:string ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) } order by ?identifier
CAS identifiers not in HMDB
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?pathway ?identifier ?label where { ?mb a wp:Metabolite ; dc:source "CAS"^^xsd:string ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) } order by ?identifier
Kegg compound identifiers not in HMDB
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?pathway ?identifier ?label where { ?mb a wp:Metabolite ; dc:source "Kegg Compound"^^xsd:string ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) } order by ?identifier
PubChem-compound identifiers not in HMDB
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?pathway ?identifier ?label where { ?mb a wp:Metabolite ; dc:source "PubChem-compound"^^xsd:string ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) } order by ?identifier
ChemSpider
Unique ChemSpider IDs
They can be counted with:
select count(distinct ?csid) where { [] <http://semanticscience.org/resource/CHEMINF_000200> ?Concept . ?Concept ?p ?csid ; a <http://semanticscience.org/resource/CHEMINF_000405> . }
And all listed with this non-counting equivalent:
select distinct str(?csid) where { [] <http://semanticscience.org/resource/CHEMINF_000200> ?Concept . ?Concept ?p ?csid ; a <http://semanticscience.org/resource/CHEMINF_000405> . }
Curation
Common wrong identifiers
PubChem-compound 1004
Wrongly used for phosphate. It is the uncharged compound. Phosphate is, instead, and particularly thinkgs like "Pi", CID 1061 for ortho-phosphate, aka [PO4]2-.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select ?pathway ?source where { ?mb dc:source ?source ; dcterms:isPartOf ?pathway ; dcterms:identifier "1004"^^xsd:string . }
Outdated HMDB identifiers
These results show HMDB identifiers used in WikiPathways but that are revoked or have become secondary identifiers.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select distinct ?identifier where { ?mb a wp:Metabolite ; dc:source "HMDB"^^xsd:string ; dc:identifier ?identifier . OPTIONAL { ?mb wp:bdbHmdb ?bridgedb . } FILTER (!BOUND(?bridgedb)) } order by ?identifier
Metabolites not classified as such
One can list all data sources for non-metabolites with this query.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> select ?datasource count(?identifier) as ?count where { ?mb dc:source ?datasource ; dcterms:identifier ?identifier . FILTER NOT EXISTS { ?mb a wp:Metabolite } } order by desc(?count)
That mostly lists gene identifier sources, etc, but watch out for the metabolite identifier data sources. For example, metabolites not marked as such but with a metabolite identifier can be found this way. Down the list is CAS (but genes are chemicals too...), and a few minor more:
"CTD Gene"^^<http://www.w3.org/2001/XMLSchema#string> 5 "HMDB"^^<http://www.w3.org/2001/XMLSchema#string> 4 "ChEBI"^^<http://www.w3.org/2001/XMLSchema#string> 3 "GLYCAN"^^<http://www.w3.org/2001/XMLSchema#string> 3 "COMPOUND"^^<http://www.w3.org/2001/XMLSchema#string> 3 "PubChem"^^<http://www.w3.org/2001/XMLSchema#string> 2
I would expect GLYCAN and COMPOUND to be misnomers of the matching KEGG subsets.
Non-Metabolites with CAS identifier
Note that a CAS identifier can also refer to mixtures, compound classes, etc.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select distinct ?pathway ?mb ?label ?identifier where { ?mb dc:source "CAS"^^xsd:string ; rdfs:label ?label ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . FILTER NOT EXISTS { ?mb a wp:Metabolite } } order by ?pathway
Non-Metabolites with PubChem identifier
At the time of writing, this results in an empty set.
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select distinct ?pathway ?mb ?label ?identifier where { ?mb dc:source "PubChem-compound"^^xsd:string ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . OPTIONAL { ?mb rdfs:label ?label . } FILTER NOT EXISTS { ?mb a wp:Metabolite } } order by ?pathway
Metabolites with an identifier but undefined data source
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select distinct ?pathway ?mb ?identifier where { ?mb a wp:Metabolite ; dc:source ""^^xsd:string ; dc:identifier ?identifier ; dcterms:isPartOf ?pathway . FILTER (!isIRI(?identifier)) FILTER (str(?identifier) != "") } order by ?pathway
Metabolites with a data source but no identifier
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select distinct ?pathway ?mb ?source where { ?mb a wp:Metabolite ; dcterms:identifier ""^^xsd:string ; dc:source ?source ; dcterms:isPartOf ?pathway . FILTER (str(?source) != "") FILTER (!regex(str(?pathway), "internal.wikipathways.org", "i")) } order by ?pathway
Metabolites with an Entrez Gene identifier
prefix wp: <http://vocabularies.wikipathways.org/wp#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix dcterms: <http://purl.org/dc/terms/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select distinct ?pathway ?mb ?label ?identifier where { ?mb a wp:Metabolite ; rdfs:label ?label ; dc:source "Entrez Gene"^^xsd:string ; dcterms:identifier ?identifier ; dcterms:isPartOf ?pathway . FILTER (str(?identifier) != "") } order by ?pathway