Help:WikiPathways Metabolomics

From WikiPathways

(Difference between revisions)
Jump to: navigation, search
(Metabolites not classified as such)
Current revision (08:00, 10 February 2023) (view source)
(All HMDB identifiers)
 
(57 intermediate revisions not shown.)
Line 5: Line 5:
The reason for this is that these are some basic underlying properties we need for metabolomics research fields.
The reason for this is that these are some basic underlying properties we need for metabolomics research fields.
 +
 +
= The Data =
 +
 +
The latest revision you can look up with:
 +
 +
<pre>
 +
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
 +
 +
select str(?o) where {
 +
  ?pw a wp:Pathway ;
 +
    <http://purl.org/pav/version> ?o .
 +
} order by desc(?o) limit 1
 +
</pre>
= Metabolome =
= Metabolome =
Line 21: Line 34:
</pre>
</pre>
-
[http://goo.gl/yQt6F Run]
+
 
 +
 
 +
== Pathway properties ==
Likewise, we can get all pathway properties with:
Likewise, we can get all pathway properties with:
Line 33: Line 48:
}
}
</pre>
</pre>
-
 
-
[http://goo.gl/EEWXX Run]
 
== Latest data only ==
== Latest data only ==
Line 64: Line 77:
</pre>
</pre>
-
[http://goo.gl/8gzVR Run]
+
{|
 +
|'''Revision'''
 +
|'''Count'''
 +
|-
 +
|67787
 +
|5790
 +
|-
 +
|69675
 +
|5801
 +
|}
=== List ===
=== List ===
Line 79: Line 101:
</pre>
</pre>
-
[http://goo.gl/fGzTa Run]
+
=== All zebrafish metabolites ===
 +
 
 +
<pre>
 +
PREFIX gpml:    <http://vocabularies.wikipathways.org/gpml#>
 +
PREFIX dcterms: <http://purl.org/dc/terms/>
 +
PREFIX dc:      <http://purl.org/dc/elements/1.1/>
 +
PREFIX rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
 +
 
 +
select distinct ?metabolite (str(?titleLit) as ?title) where {
 +
  ?metabolite a wp:Metabolite ;
 +
    dcterms:isPartOf ?pw .
 +
  ?pw dc:title ?titleLit ;
 +
    wp:organismName "Danio rerio" .
 +
}
 +
</pre>
 +
 
 +
[http://sparql.wikipathways.org/sparql?query=PREFIX+gpml%3A++++%3Chttp%3A%2F%2Fvocabularies.wikipathways.org%2Fgpml%23%3E%0D%0APREFIX+dcterms%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0D%0APREFIX+dc%3A++++++%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Felements%2F1.1%2F%3E%0D%0APREFIX+rdf%3A+++++%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E+%0D%0A%0D%0Aselect+distinct+%3Fmetabolite+%28str%28%3FtitleLit%29+as+%3Ftitle%29+where+%7B%0D%0A++%3Fmetabolite+a+wp%3AMetabolite+%3B%0D%0A++++dcterms%3AisPartOf+%3Fpw+.%0D%0A++%3Fpw+dc%3Atitle+%3FtitleLit+%3B%0D%0A++++wp%3AorganismName+%22Danio+rerio%22+.%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on Run]
= Metabolic Data Sources =
= Metabolic Data Sources =
Line 85: Line 123:
== Sorted by use ==
== Sorted by use ==
-
[[Image:mbStats.png|right|400px]]
+
[[Image:mbStats.png|right|200px]]
HMDB, ChEBI, and KEGG are the main data sources for identifiers. InChI/InChIKey should also be there but is missing. A big curation process in January 2013 ensured that "PubChem compound" is now used as data source for PubChem CIDs.
HMDB, ChEBI, and KEGG are the main data sources for identifiers. InChI/InChIKey should also be there but is missing. A big curation process in January 2013 ensured that "PubChem compound" is now used as data source for PubChem CIDs.
Line 94: Line 132:
prefix dcterms:  <http://purl.org/dc/terms/>
prefix dcterms:  <http://purl.org/dc/terms/>
-
select ?datasource count(?identifier) as ?count
+
select str(?datasource) as ?source count(distinct ?identifier) as ?count
where {
where {
   ?mb a wp:Metabolite ;
   ?mb a wp:Metabolite ;
Line 101: Line 139:
} order by desc(?count)
} order by desc(?count)
</pre>
</pre>
-
 
-
[http://goo.gl/5Roiv Run]
 
== All metabolites from one source ==
== All metabolites from one source ==
Line 116: Line 152:
where {
where {
   ?mb a wp:Metabolite ;
   ?mb a wp:Metabolite ;
-
     dc:source "Kegg Compound"^^xsd:string ;
+
     dc:source "KEGG Compound" ;
     dc:identifier ?identifier .
     dc:identifier ?identifier .
} order by ?identifier
} order by ?identifier
</pre>
</pre>
-
 
-
[http://goo.gl/950JM Run]
 
=== All HMDB identifiers ===
=== All HMDB identifiers ===
 +
 +
Return all HMDB identfiers with:
 +
 +
<pre>
 +
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
 +
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
 +
prefix dcterms:  <http://purl.org/dc/terms/>
 +
 +
select distinct ?identifier
 +
where {
 +
  ?mb a wp:Metabolite ;
 +
    dc:source "HMDB" ;
 +
    dc:identifier ?identifier .
 +
} order by ?identifier
 +
</pre>
 +
 +
Return all metabolites listed to have a HMDB identifier but have none:
 +
 +
<pre>
 +
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
 +
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
 +
prefix dcterms:  <http://purl.org/dc/terms/>
 +
 +
select distinct ?identifier
 +
where {
 +
  ?mb a wp:Metabolite ;
 +
    dc:source "HMDB"^^xsd:string ;
 +
    dc:identifier ?identifier .
 +
  FILTER (regex(str(?identifier),"noIdentifier"))
 +
} order by ?identifier
 +
</pre>
At the time of writing, this showed a number of XRefs with HMDB as data source but no identifiers, which needs curation:
At the time of writing, this showed a number of XRefs with HMDB as data source but no identifiers, which needs curation:
Line 147: Line 212:
http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP716_r45017
http://www.hmdb.ca/metabolites/noIdentifier http://rdf.wikipathways.org/Pathway/WP716_r45017
</pre>
</pre>
 +
 +
= Metabolic Pathways =
 +
 +
Of general interest is the number of pathways per species:
<pre>
<pre>
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
-
prefix rdfs:   <http://www.w3.org/2000/01/rdf-schema#>
+
prefix dcterms: <http://purl.org/dc/terms/>
-
prefix dcterms: <http://purl.org/dc/terms/>
+
prefix ncbi:   <http://purl.obolibrary.org/obo/NCBITaxon_>
-
select distinct ?identifier
+
select distinct str(?orgName) as ?organism count(?pw) as ?pathways  where {
-
where {
+
   ?pw wp:organism ?organismCode .
-
   ?mb a wp:Metabolite ;
+
  ?organismCode rdfs:label ?orgName
-
    dc:source "HMDB"^^xsd:string ;
+
} order by desc(?pathways)
-
    dc:identifier ?identifier .
+
-
} order by ?identifier
+
</pre>
</pre>
-
 
-
[http://goo.gl/bZVkB Run]
 
-
 
-
= Metabolic Pathways =
 
== Metabolomes ==
== Metabolomes ==
Line 183: Line 246:
</pre>
</pre>
-
[http://goo.gl/rjhu8 Run]
+
{|
 +
|'''Revision'''
 +
|'''Count'''
 +
|-
 +
|67787
 +
|1972
 +
|-
 +
|69675
 +
|2000
 +
|}
 +
 
 +
=== Arabodopsis thaliana Metabolome ===
 +
 
 +
<pre>
 +
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
 +
prefix dcterms: <http://purl.org/dc/terms/>
 +
prefix ncbi:    <http://purl.obolibrary.org/obo/NCBITaxon_>
 +
 
 +
select distinct ?mb where {
 +
  ?mb a wp:Metabolite ;
 +
    dcterms:isPartOf ?pw .
 +
  ?pw wp:organism ncbi:3702 .
 +
} order by ?mb
 +
</pre>
 +
 
 +
{|
 +
|'''Revision'''
 +
|'''Count'''
 +
|-
 +
|69675
 +
|17
 +
|}
== Pathways with the most metabolites ==
== Pathways with the most metabolites ==
Line 200: Line 294:
} order by desc(?mbCount)
} order by desc(?mbCount)
</pre>
</pre>
-
 
-
[http://goo.gl/Tf2v3 Run]
 
== Metabolites in the most Pathways ==
== Metabolites in the most Pathways ==
Line 221: Line 313:
</pre>
</pre>
-
[http://goo.gl/VSli7 Run]
+
= Identifier Mapping Completeness =
 +
 
 +
[[Image:fooNotInHMDB.png|right|200x]]
 +
 
 +
Right now, the HMDB is the primary (and only) source of mappings. That raises the question how many metabolites are in WP that do not have mappings to other databases.
 +
The following queries are about that.
 +
 
 +
== The missing mappings ==
 +
 
 +
The next query counts all unique missing identifiers in HMDB, resulting in missing mappings to other databases: at the time of writing, this are 724 (was 927) identifiers.
 +
These are not unique identifiers, which is 369 ([http://goo.gl/EQU1H Run]; was 404) at the time of writing. Given there are about 1400 unique metabolite identifiers, this
 +
is about 30%, which is rather significant. The major databases with unmapped resources are ChEBI and KEGG (see screenshot on the right).
 +
 
 +
<pre>
 +
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
 +
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
 +
prefix dcterms:  <http://purl.org/dc/terms/>
 +
 
 +
select count(?source)
 +
where {
 +
  ?mb a wp:Metabolite ;
 +
    dc:source ?source ;
 +
    rdfs:label ?label ;
 +
    dcterms:identifier ?identifier ;
 +
    dcterms:isPartOf ?pathway .
 +
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
 +
  FILTER (!BOUND(?bridgedb))
 +
  FILTER (str(?source) = "ChEBI" || str(?source) = "CAS" || str(?source) = "Kegg Compound" || str(?source) = "Chemspider" || str(?source) = "HMDB")
 +
  FILTER (str(?identifier) != "")
 +
}
 +
</pre>
 +
 
 +
=== The full list ===
 +
 
 +
These are the unique identifiers missing:
 +
 
 +
<pre>
 +
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
 +
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
 +
prefix dcterms:  <http://purl.org/dc/terms/>
 +
 
 +
select distinct str(?source) str(?identifier)
 +
where {
 +
  ?mb a wp:Metabolite ;
 +
    dc:source ?source ;
 +
    rdfs:label ?label ;
 +
    dcterms:identifier ?identifier ;
 +
    dcterms:isPartOf ?pathway .
 +
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
 +
  FILTER (!BOUND(?bridgedb))
 +
  FILTER (str(?source) = "ChEBI" || str(?source) = "CAS" || str(?source) = "Kegg Compound" || str(?source) = "Chemspider" || str(?source) = "HMDB")
 +
  FILTER (str(?identifier) != "")
 +
} order by ?source ?identifier
 +
</pre>
 +
 
 +
== ChEBI identifiers not in HMDB ==
 +
 
 +
<pre>
 +
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
 +
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
 +
prefix dcterms:  <http://purl.org/dc/terms/>
 +
 
 +
select distinct ?pathway ?identifier ?label
 +
where {
 +
  ?mb a wp:Metabolite ;
 +
    dc:source "ChEBI"^^xsd:string ;
 +
    rdfs:label ?label ;
 +
    dcterms:identifier ?identifier ;
 +
    dcterms:isPartOf ?pathway .
 +
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
 +
  FILTER (!BOUND(?bridgedb))
 +
} order by ?identifier
 +
</pre>
 +
 
 +
== CAS identifiers not in HMDB ==
 +
 
 +
[[Image:casNotInHMDB.png|right|200x]]
 +
 
 +
<pre>
 +
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
 +
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
 +
prefix dcterms:  <http://purl.org/dc/terms/>
 +
 
 +
select distinct ?pathway ?identifier ?label
 +
where {
 +
  ?mb a wp:Metabolite ;
 +
    dc:source "CAS"^^xsd:string ;
 +
    rdfs:label ?label ;
 +
    dcterms:identifier ?identifier ;
 +
    dcterms:isPartOf ?pathway .
 +
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
 +
  FILTER (!BOUND(?bridgedb))
 +
} order by ?identifier
 +
</pre>
 +
 
 +
== Kegg compound identifiers not in HMDB ==
 +
 
 +
<pre>
 +
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
 +
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
 +
prefix dcterms:  <http://purl.org/dc/terms/>
 +
 
 +
select distinct ?pathway ?identifier ?label
 +
where {
 +
  ?mb a wp:Metabolite ;
 +
    dc:source "Kegg Compound"^^xsd:string ;
 +
    rdfs:label ?label ;
 +
    dcterms:identifier ?identifier ;
 +
    dcterms:isPartOf ?pathway .
 +
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
 +
  FILTER (!BOUND(?bridgedb))
 +
} order by ?identifier
 +
</pre>
 +
 
 +
== PubChem-compound identifiers not in HMDB ==
 +
 
 +
<pre>
 +
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
 +
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
 +
prefix dcterms:  <http://purl.org/dc/terms/>
 +
 
 +
select distinct ?pathway ?identifier ?label
 +
where {
 +
  ?mb a wp:Metabolite ;
 +
    dc:source "PubChem-compound"^^xsd:string ;
 +
    rdfs:label ?label ;
 +
    dcterms:identifier ?identifier ;
 +
    dcterms:isPartOf ?pathway .
 +
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
 +
  FILTER (!BOUND(?bridgedb))
 +
} order by ?identifier
 +
</pre>
 +
 
 +
= ChemSpider =
 +
 
 +
== Unique ChemSpider IDs ==
 +
 
 +
They can be counted with:
 +
 
 +
<pre>
 +
prefix dcterms: <http://purl.org/dc/terms/>
 +
prefix cheminf: <http://semanticscience.org/resource/>
 +
 
 +
select count(distinct ?identifier) where {
 +
  ?mb a wp:Metabolite ;
 +
    dc:source "Chemspider"^^xsd:string ;
 +
    dcterms:identifier ?identifier ;
 +
    rdfs:label ?label ;
 +
    dcterms:isPartOf ?pathway .
 +
  ?pathway foaf:page ?page .
 +
}
 +
</pre>
 +
 
 +
And all listed with this non-counting equivalent:
 +
 
 +
<pre>
 +
prefix dcterms: <http://purl.org/dc/terms/>
 +
prefix cheminf: <http://semanticscience.org/resource/>
 +
 
 +
select distinct str(?identifier) as ?csid where {
 +
  ?mb a wp:Metabolite ;
 +
    dc:source "Chemspider"^^xsd:string ;
 +
    dcterms:identifier ?identifier ;
 +
    rdfs:label ?label ;
 +
    dcterms:isPartOf ?pathway .
 +
  ?pathway foaf:page ?page .
 +
}
 +
</pre>
 +
 
 +
== Linking ChemSpider IDs to WikiPathway ==
 +
 
 +
I need to ask Andra why not all pathways have a foaf:page, but these table should be discussed with Antony:
 +
 
 +
<pre>
 +
prefix foaf:    <http://xmlns.com/foaf/0.1/>
 +
prefix dcterms: <http://purl.org/dc/terms/>
 +
prefix cheminf: <http://semanticscience.org/resource/>
 +
 
 +
select distinct str(?identifier) as ?csid ?page where {
 +
  ?mb a wp:Metabolite ;
 +
    dc:source "Chemspider"^^xsd:string ;
 +
    dcterms:identifier ?identifier ;
 +
    rdfs:label ?label ;
 +
    dcterms:isPartOf ?pathway .
 +
  ?pathway foaf:page ?page .
 +
} order by ?csid
 +
</pre>
 +
 
 +
 
= Curation =
= Curation =
Line 245: Line 525:
}
}
</pre>
</pre>
-
 
-
[http://goo.gl/0cB4z Run]
 
=== Outdated HMDB identifiers ===
=== Outdated HMDB identifiers ===
Line 266: Line 544:
} order by ?identifier
} order by ?identifier
</pre>
</pre>
-
 
-
[http://goo.gl/hc1FM Run]
 
== Metabolites not classified as such ==
== Metabolites not classified as such ==
Line 284: Line 560:
} order by desc(?count)
} order by desc(?count)
</pre>
</pre>
-
 
-
[http://goo.gl/ic0Zp Run]
 
That mostly lists gene identifier sources, etc, but watch out for the metabolite identifier data sources. For example, metabolites not marked as such but with a metabolite identifier can be found this way. Down the list is CAS (but genes are chemicals too...), and a few minor more:
That mostly lists gene identifier sources, etc, but watch out for the metabolite identifier data sources. For example, metabolites not marked as such but with a metabolite identifier can be found this way. Down the list is CAS (but genes are chemicals too...), and a few minor more:
Line 310: Line 584:
prefix xsd:    <http://www.w3.org/2001/XMLSchema#>
prefix xsd:    <http://www.w3.org/2001/XMLSchema#>
-
select distinct ?pathway ?mb ?identifier  
+
select distinct ?pathway ?mb str(?label) as ?name str(?identifier) as ?id
where {
where {
   ?mb dc:source "CAS"^^xsd:string ;
   ?mb dc:source "CAS"^^xsd:string ;
-
     dc:identifier ?identifier ;
+
     rdfs:label ?label ;
 +
    dcterms:identifier ?identifier ;
     dcterms:isPartOf ?pathway .
     dcterms:isPartOf ?pathway .
   FILTER NOT EXISTS { ?mb a wp:Metabolite }
   FILTER NOT EXISTS { ?mb a wp:Metabolite }
-
  FILTER (!isIRI(?identifier))
 
} order by ?pathway
} order by ?pathway
</pre>
</pre>
-
 
-
[http://goo.gl/PwEyl Run]
 
=== Non-Metabolites with PubChem identifier ===
=== Non-Metabolites with PubChem identifier ===
-
These might have been curated by the time of reading.
+
At the time of writing, this results in an empty set.
<pre>
<pre>
Line 332: Line 604:
prefix xsd:    <http://www.w3.org/2001/XMLSchema#>
prefix xsd:    <http://www.w3.org/2001/XMLSchema#>
-
select distinct ?pathway ?mb ?identifier  
+
select distinct ?pathway ?mb ?label ?identifier  
where {
where {
   ?mb dc:source "PubChem-compound"^^xsd:string ;
   ?mb dc:source "PubChem-compound"^^xsd:string ;
-
     dc:identifier ?identifier ;
+
     dcterms:identifier ?identifier ;
     dcterms:isPartOf ?pathway .
     dcterms:isPartOf ?pathway .
 +
  OPTIONAL { ?mb rdfs:label ?label . }
   FILTER NOT EXISTS { ?mb a wp:Metabolite }
   FILTER NOT EXISTS { ?mb a wp:Metabolite }
-
  FILTER (!isIRI(?identifier))
 
} order by ?pathway
} order by ?pathway
</pre>
</pre>
-
[http://goo.gl/2N97Q Run]
+
=== Metabolites sometimes marked as DataNode@Type Metabolite ===
 +
 
 +
Based on label comparisons, we can find things that are labeled the same as a data node with the same label.
 +
Of course, this can give false positives, because genes can be incorrectly marked as metabolite in some pathway,
 +
but that is another SPARQL query.
 +
 
 +
<pre>
 +
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
 +
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
 +
prefix dcterms:  <http://purl.org/dc/terms/>
 +
prefix xsd:    <http://www.w3.org/2001/XMLSchema#>
 +
 
 +
select ?pathway ?nonmb ?mb ?label
 +
where {
 +
  ?nonmb rdfs:label ?label .
 +
  ?mb rdfs:label ?label .
 +
  OPTIONAL { ?nonmb dcterms:isPartOf ?pathway . }
 +
  FILTER ( ?nonmb != ?mb )
 +
  FILTER NOT EXISTS { ?nonmb a wp:Metabolite }
 +
  FILTER EXISTS { ?mb a wp:Metabolite }
 +
  FILTER (!regex(str(?nonmb),  "noIdentifier", "i"))
 +
  FILTER (!regex(str(?mb),  "noIdentifier", "i"))
 +
}
 +
</pre>
== Metabolites with an identifier but undefined data source ==
== Metabolites with an identifier but undefined data source ==
Line 362: Line 657:
} order by ?pathway
} order by ?pathway
</pre>
</pre>
-
 
-
[http://goo.gl/x5DzR Run]
 
== Metabolites with a data source but no identifier ==
== Metabolites with a data source but no identifier ==
Line 376: Line 669:
where {
where {
   ?mb a wp:Metabolite ;
   ?mb a wp:Metabolite ;
-
     dc:identifier ""^^xsd:string ;
+
     dcterms:identifier ""^^xsd:string ;
     dc:source ?source ;
     dc:source ?source ;
     dcterms:isPartOf ?pathway .
     dcterms:isPartOf ?pathway .
Line 384: Line 677:
</pre>
</pre>
-
[http://goo.gl/Jkxtp Run]
+
== Metabolites with too many labels ==
 +
 
 +
This is particularly caused by the metabolite URIs to be based on a non-existing identifier:
 +
 
 +
<pre>
 +
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
 +
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
 +
prefix dcterms:  <http://purl.org/dc/terms/>
 +
 
 +
select distinct count(?label) as ?count ?pathway ?mb
 +
where {
 +
  ?mb a wp:Metabolite ;
 +
    rdfs:label ?label ;
 +
    dcterms:isPartOf ?pathway .
 +
} order by desc(?count) ?pathway ?mb limit 410
 +
</pre>
 +
 
 +
An example such entity with many labels and being both a metabolite, gene, complex, etc:
 +
 
 +
<pre>
 +
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
 +
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
 +
prefix dcterms:  <http://purl.org/dc/terms/>
 +
 
 +
select distinct str(?label) ?type
 +
where {
 +
  <http://bio2rdf.org/geneid:noIdentifier> a ?type ; rdfs:label ?label .
 +
} order by ?label
 +
</pre>
== Metabolites with an Entrez Gene identifier ==
== Metabolites with an Entrez Gene identifier ==
Line 394: Line 715:
prefix xsd:    <http://www.w3.org/2001/XMLSchema#>
prefix xsd:    <http://www.w3.org/2001/XMLSchema#>
-
select distinct ?pathway ?mb ?identifier  
+
select distinct ?pathway ?mb ?label ?identifier  
where {
where {
   ?mb a wp:Metabolite ;
   ?mb a wp:Metabolite ;
 +
    rdfs:label ?label ;
     dc:source "Entrez Gene"^^xsd:string ;
     dc:source "Entrez Gene"^^xsd:string ;
-
     dc:identifier ?identifier ;
+
     dcterms:identifier ?identifier ;
     dcterms:isPartOf ?pathway .
     dcterms:isPartOf ?pathway .
-
  FILTER (!isIRI(?identifier))
 
   FILTER (str(?identifier) != "")
   FILTER (str(?identifier) != "")
} order by ?pathway
} order by ?pathway
</pre>
</pre>
-
[http://goo.gl/FGnoC Run]
+
== Metabolites as just Label ==
 +
 
 +
Metabolites may be marked up as DataNode but not types as Metabolite. Here are some examples: ATP, CO2, ADP, Phosphate, L-glutamate, and Cholesterol.
 +
 
 +
=== ATP ===
 +
 
 +
This example shows how to find them.
 +
 
 +
<pre>
 +
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
 +
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
 +
prefix dcterms:  <http://purl.org/dc/terms/>
 +
prefix xsd:    <http://www.w3.org/2001/XMLSchema#>
 +
 
 +
select ?pathway ?source ?mb ?type
 +
where {
 +
  ?mb rdfs:label "ATP"@en .
 +
  ?mb a ?type .
 +
  OPTIONAL { ?mb dc:source ?source . }
 +
  OPTIONAL { ?mb dcterms:isPartOf ?pathway . }
 +
  FILTER NOT EXISTS { ?mb a wp:Metabolite . }
 +
}
 +
</pre>
 +
 
 +
== Metabolites also labeled as GeneProduct ==
 +
 
 +
Sometimes things are incorrectly marked as Metabolite, when they really are GeneProducts. We can list
 +
entities based on their label that are both annotated as Metabolite and as GeneProduct:
 +
 
 +
<pre>
 +
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
 +
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
 +
prefix dcterms:  <http://purl.org/dc/terms/>
 +
prefix xsd:    <http://www.w3.org/2001/XMLSchema#>
 +
 
 +
select ?pathway ?mb ?gene ?label
 +
where {
 +
  ?gene rdfs:label ?label .
 +
  ?mb rdfs:label ?label .
 +
  OPTIONAL { ?mb dcterms:isPartOf ?pathway . }
 +
  FILTER ( ?gene != ?mb )
 +
  FILTER EXISTS { ?gene a wp:GeneProduct }
 +
  FILTER EXISTS { ?mb a wp:Metabolite }
 +
  FILTER (!regex(str(?mb),  "noIdentifier", "i"))
 +
  FILTER (!regex(str(?gene),  "noIdentifier", "i"))
 +
}
 +
</pre>
 +
 
 +
Actually, this query does not do what I want it to do, because the FILTER only removes things from the result list, but does still allow things with "noIdentifier" to hook up things, messing up this query if there is just one URI with noIdentifier with the same label :(
 +
 
 +
== Labels which are also marked as metabolite ==
 +
 
 +
<pre>
 +
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
 +
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
 +
prefix dcterms:  <http://purl.org/dc/terms/>
 +
prefix xsd:    <http://www.w3.org/2001/XMLSchema#>
 +
 
 +
select ?pathway ?labelNode str(?label1) as ?labelStr ?mb  str(?label2) as ?mbStr
 +
where {
 +
  ?labelNode a gpml:Label ;
 +
    rdfs:label ?label1 ;
 +
    dcterms:isPartOf ?pathway .
 +
  ?mb a wp:Metabolite ;
 +
    rdfs:label ?label2 .
 +
  FILTER ( ?labelNode != ?mb )
 +
  FILTER ( str(?label2) = str(?label1) )
 +
  FILTER (!regex(str(?mb),  "noIdentifier", "i"))
 +
  FILTER (!regex(str(?labelNode),  "noIdentifier", "i"))
 +
} LIMIT 50 OFFSET 25
 +
</pre>
 +
 
 +
To get the most common such labels, use (though typically times out on Virtuoso 6.1):
 +
 
 +
<pre>
 +
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
 +
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
 +
prefix dcterms:  <http://purl.org/dc/terms/>
 +
prefix xsd:    <http://www.w3.org/2001/XMLSchema#>
 +
 
 +
select str(?label1) as ?labelStr count(?labelNode) as ?count
 +
where {
 +
  ?labelNode a gpml:Label ;
 +
    rdfs:label ?label1 .
 +
  ?mb a wp:Metabolite ;
 +
    rdfs:label ?label2 .
 +
  FILTER ( ?labelNode != ?mb )
 +
  FILTER ( str(?label2) = str(?label1) )
 +
  FILTER (!regex(str(?mb),  "noIdentifier", "i"))
 +
  FILTER (!regex(str(?labelNode),  "noIdentifier", "i"))
 +
} order by desc(?count)
 +
</pre>

Current revision

On this page we collect SPARQL queries to see the state of the Metabolome in WikiPathways. Triggered by User:Andra's RDF / SPARQL work, curation started with metabolites without database identifiers. But this soon led to the observation that metabolites are often not even annotated as being a metabolite (using <Label> rather than <DataNode>). Therefore, User:Egonw started at Pathway:WP1 to curate them one by one and fix these issues:

  • connect lines between metabolites
  • convert metabolites to use <Label> rather than <DataNode>

The reason for this is that these are some basic underlying properties we need for metabolomics research fields.

Contents

The Data

The latest revision you can look up with:

prefix wp:      <http://vocabularies.wikipathways.org/wp#>

select str(?o) where {
  ?pw a wp:Pathway ;
    <http://purl.org/pav/version> ?o .
} order by desc(?o) limit 1

Metabolome

The following queries provide an overview of the Metabolome captures by WikiPathways.

The key type for metabolites is the wp:Metabolite. We can see all available properties with:

prefix wp:      <http://vocabularies.wikipathways.org/wp#>

select distinct ?p where {
  ?mb a wp:Metabolite ;
    ?p [] .
}


Pathway properties

Likewise, we can get all pathway properties with:

prefix wp:      <http://vocabularies.wikipathways.org/wp#>

select distinct ?p where {
  ?mb a wp:Pathway ;
    ?p [] .
}

Latest data only

To only get analysis of the most recent pathways, add this snippet to your SPARQL, assuming ?pathway is the used variable name:

  ?mb dcterms:isPartOf ?pathway .
  ?pathway pav:version ?version .
  ?mb dcterms:isPartOf ?pathway2 .
  ?pathway2 pav:version ?version2 .
  FILTER (?version2 > ?version)

However, it should be kept in mind that this is not a fool-proof solution.

All Metabolites

Count

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select count(?mb) where {
  ?mb a wp:Metabolite .
}
Revision Count
67787 5790
69675 5801

List

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct ?mb ?label where {
  ?mb a wp:Metabolite ;
     rdfs:label ?label .
}

All zebrafish metabolites

PREFIX gpml:    <http://vocabularies.wikipathways.org/gpml#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dc:      <http://purl.org/dc/elements/1.1/>
PREFIX rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 

select distinct ?metabolite (str(?titleLit) as ?title) where {
  ?metabolite a wp:Metabolite ;
    dcterms:isPartOf ?pw .
  ?pw dc:title ?titleLit ;
    wp:organismName "Danio rerio" .
}

Run

Metabolic Data Sources

Sorted by use

HMDB, ChEBI, and KEGG are the main data sources for identifiers. InChI/InChIKey should also be there but is missing. A big curation process in January 2013 ensured that "PubChem compound" is now used as data source for PubChem CIDs.

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select str(?datasource) as ?source count(distinct ?identifier) as ?count
where {
  ?mb a wp:Metabolite ;
    dc:source ?datasource ;
    dc:identifier ?identifier .
} order by desc(?count)

All metabolites from one source

All KEGG identifiers

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct ?identifier
where {
  ?mb a wp:Metabolite ;
    dc:source "KEGG Compound" ;
    dc:identifier ?identifier .
} order by ?identifier

All HMDB identifiers

Return all HMDB identfiers with:

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct ?identifier
where {
  ?mb a wp:Metabolite ;
    dc:source "HMDB" ;
    dc:identifier ?identifier .
} order by ?identifier

Return all metabolites listed to have a HMDB identifier but have none:

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct ?identifier
where {
  ?mb a wp:Metabolite ;
    dc:source "HMDB"^^xsd:string ;
    dc:identifier ?identifier .
  FILTER (regex(str(?identifier),"noIdentifier"))
} order by ?identifier

At the time of writing, this showed a number of XRefs with HMDB as data source but no identifiers, which needs curation:

http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP1002_r35260
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP1119_r35265
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP1250_r41240
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP1266_r41328
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP1285_r41669
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP1304_r41670
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP1310_r41659
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP1339_r35269
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP167_r45138
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP2267_r53133
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP28_r38852
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP28_r38852/group/ac37a
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP295_r41324
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP337_r41644
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP495_r41327
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP59_r41653
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP678_r41165
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP716_r45017

Metabolic Pathways

Of general interest is the number of pathways per species:

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix ncbi:    <http://purl.obolibrary.org/obo/NCBITaxon_>

select distinct str(?orgName) as ?organism count(?pw) as ?pathways  where {
  ?pw wp:organism ?organismCode .
  ?organismCode rdfs:label ?orgName
} order by desc(?pathways)

Metabolomes

Human Metabolome

This only returns 244 metabolites, which is not a lot at all, and does not even take account the metabolite identity. Something wrong with wp:organism? It finds 107 human pathways.

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix ncbi:    <http://purl.obolibrary.org/obo/NCBITaxon_>

select distinct ?mb where {
  ?mb a wp:Metabolite ;
    dcterms:isPartOf ?pw .
  ?pw wp:organism ncbi:9606 .
} order by ?mb
Revision Count
67787 1972
69675 2000

Arabodopsis thaliana Metabolome

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix ncbi:    <http://purl.obolibrary.org/obo/NCBITaxon_>

select distinct ?mb where {
  ?mb a wp:Metabolite ;
    dcterms:isPartOf ?pw .
  ?pw wp:organism ncbi:3702 .
} order by ?mb
Revision Count
69675 17

Pathways with the most metabolites

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
prefix pav:     <http://purl.org/pav/>

select ?pathway count(?mb) as ?mbCount
where {
  ?mb a wp:Metabolite ;
    dcterms:isPartOf ?pathway .
} order by desc(?mbCount)

Metabolites in the most Pathways

With the remark that BridgeDB is not involved yet.

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
prefix pav:     <http://purl.org/pav/>

select ?mb count(?pathway) as ?pwCount
where {
  ?mb a wp:Metabolite ;
    dcterms:isPartOf ?pathway .
} order by desc(?pwCount)

Identifier Mapping Completeness

200x

Right now, the HMDB is the primary (and only) source of mappings. That raises the question how many metabolites are in WP that do not have mappings to other databases. The following queries are about that.

The missing mappings

The next query counts all unique missing identifiers in HMDB, resulting in missing mappings to other databases: at the time of writing, this are 724 (was 927) identifiers. These are not unique identifiers, which is 369 (Run; was 404) at the time of writing. Given there are about 1400 unique metabolite identifiers, this is about 30%, which is rather significant. The major databases with unmapped resources are ChEBI and KEGG (see screenshot on the right).

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select count(?source)
where {
  ?mb a wp:Metabolite ;
    dc:source ?source ;
    rdfs:label ?label ;
    dcterms:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
  FILTER (!BOUND(?bridgedb))
  FILTER (str(?source) = "ChEBI" || str(?source) = "CAS" || str(?source) = "Kegg Compound" || str(?source) = "Chemspider" || str(?source) = "HMDB") 
  FILTER (str(?identifier) != "") 
}

The full list

These are the unique identifiers missing:

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct str(?source) str(?identifier)
where {
  ?mb a wp:Metabolite ;
    dc:source ?source ;
    rdfs:label ?label ;
    dcterms:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
  FILTER (!BOUND(?bridgedb))
  FILTER (str(?source) = "ChEBI" || str(?source) = "CAS" || str(?source) = "Kegg Compound" || str(?source) = "Chemspider" || str(?source) = "HMDB") 
  FILTER (str(?identifier) != "") 
} order by ?source ?identifier

ChEBI identifiers not in HMDB

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct ?pathway ?identifier ?label
where {
  ?mb a wp:Metabolite ;
    dc:source "ChEBI"^^xsd:string ;
    rdfs:label ?label ;
    dcterms:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
  FILTER (!BOUND(?bridgedb))
} order by ?identifier

CAS identifiers not in HMDB

200x
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct ?pathway ?identifier ?label
where {
  ?mb a wp:Metabolite ;
    dc:source "CAS"^^xsd:string ;
    rdfs:label ?label ;
    dcterms:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
  FILTER (!BOUND(?bridgedb))
} order by ?identifier

Kegg compound identifiers not in HMDB

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct ?pathway ?identifier ?label
where {
  ?mb a wp:Metabolite ;
    dc:source "Kegg Compound"^^xsd:string ;
    rdfs:label ?label ;
    dcterms:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
  FILTER (!BOUND(?bridgedb))
} order by ?identifier

PubChem-compound identifiers not in HMDB

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct ?pathway ?identifier ?label
where {
  ?mb a wp:Metabolite ;
    dc:source "PubChem-compound"^^xsd:string ;
    rdfs:label ?label ;
    dcterms:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
  FILTER (!BOUND(?bridgedb))
} order by ?identifier

ChemSpider

Unique ChemSpider IDs

They can be counted with:

prefix dcterms: <http://purl.org/dc/terms/>
prefix cheminf: <http://semanticscience.org/resource/>

select count(distinct ?identifier) where {
  ?mb a wp:Metabolite ;
    dc:source "Chemspider"^^xsd:string ;
    dcterms:identifier ?identifier ;
    rdfs:label ?label ;
    dcterms:isPartOf ?pathway .
  ?pathway foaf:page ?page .
}

And all listed with this non-counting equivalent:

prefix dcterms: <http://purl.org/dc/terms/>
prefix cheminf: <http://semanticscience.org/resource/>

select distinct str(?identifier) as ?csid where {
  ?mb a wp:Metabolite ;
    dc:source "Chemspider"^^xsd:string ;
    dcterms:identifier ?identifier ;
    rdfs:label ?label ;
    dcterms:isPartOf ?pathway .
  ?pathway foaf:page ?page .
}

Linking ChemSpider IDs to WikiPathway

I need to ask Andra why not all pathways have a foaf:page, but these table should be discussed with Antony:

prefix foaf:    <http://xmlns.com/foaf/0.1/>
prefix dcterms: <http://purl.org/dc/terms/>
prefix cheminf: <http://semanticscience.org/resource/>

select distinct str(?identifier) as ?csid ?page where {
  ?mb a wp:Metabolite ;
    dc:source "Chemspider"^^xsd:string ;
    dcterms:identifier ?identifier ;
    rdfs:label ?label ;
    dcterms:isPartOf ?pathway .
  ?pathway foaf:page ?page .
} order by ?csid


Curation

Common wrong identifiers

PubChem-compound 1004

Wrongly used for phosphate. It is the uncharged compound. Phosphate is, instead, and particularly thinkgs like "Pi", CID 1061 for ortho-phosphate, aka [PO4]2-.

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>

select ?pathway ?source
where {
  ?mb dc:source ?source ;
    dcterms:isPartOf ?pathway ;
    dcterms:identifier "1004"^^xsd:string .
}

Outdated HMDB identifiers

These results show HMDB identifiers used in WikiPathways but that are revoked or have become secondary identifiers.

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct ?identifier
where {
  ?mb a wp:Metabolite ;
    dc:source "HMDB"^^xsd:string ;
    dc:identifier ?identifier .
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
  FILTER (!BOUND(?bridgedb))
} order by ?identifier

Metabolites not classified as such

One can list all data sources for non-metabolites with this query.

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select ?datasource count(?identifier) as ?count
where {
  ?mb dc:source ?datasource ;
    dcterms:identifier ?identifier .
  FILTER NOT EXISTS { ?mb a wp:Metabolite }
} order by desc(?count)

That mostly lists gene identifier sources, etc, but watch out for the metabolite identifier data sources. For example, metabolites not marked as such but with a metabolite identifier can be found this way. Down the list is CAS (but genes are chemicals too...), and a few minor more:

"CTD Gene"^^<http://www.w3.org/2001/XMLSchema#string> 	5
"HMDB"^^<http://www.w3.org/2001/XMLSchema#string> 	4
"ChEBI"^^<http://www.w3.org/2001/XMLSchema#string> 	3
"GLYCAN"^^<http://www.w3.org/2001/XMLSchema#string> 	3
"COMPOUND"^^<http://www.w3.org/2001/XMLSchema#string> 	3
"PubChem"^^<http://www.w3.org/2001/XMLSchema#string> 	2

I would expect GLYCAN and COMPOUND to be misnomers of the matching KEGG subsets.

Non-Metabolites with CAS identifier

Note that a CAS identifier can also refer to mixtures, compound classes, etc.

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>

select distinct ?pathway ?mb str(?label) as ?name str(?identifier) as ?id 
where {
  ?mb dc:source "CAS"^^xsd:string ;
    rdfs:label ?label ;
    dcterms:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  FILTER NOT EXISTS { ?mb a wp:Metabolite }
} order by ?pathway

Non-Metabolites with PubChem identifier

At the time of writing, this results in an empty set.

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>

select distinct ?pathway ?mb ?label ?identifier 
where {
  ?mb dc:source "PubChem-compound"^^xsd:string ;
    dcterms:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  OPTIONAL { ?mb rdfs:label ?label . }
  FILTER NOT EXISTS { ?mb a wp:Metabolite }
} order by ?pathway

Metabolites sometimes marked as DataNode@Type Metabolite

Based on label comparisons, we can find things that are labeled the same as a data node with the same label. Of course, this can give false positives, because genes can be incorrectly marked as metabolite in some pathway, but that is another SPARQL query.

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>

select ?pathway ?nonmb ?mb ?label
where {
  ?nonmb rdfs:label ?label .
  ?mb rdfs:label ?label .
  OPTIONAL { ?nonmb dcterms:isPartOf ?pathway . }
  FILTER ( ?nonmb != ?mb )
  FILTER NOT EXISTS { ?nonmb a wp:Metabolite }
  FILTER EXISTS { ?mb a wp:Metabolite }
  FILTER (!regex(str(?nonmb),  "noIdentifier", "i"))
  FILTER (!regex(str(?mb),  "noIdentifier", "i"))
}

Metabolites with an identifier but undefined data source

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>

select distinct ?pathway ?mb ?identifier 
where {
  ?mb a wp:Metabolite ;
    dc:source ""^^xsd:string ;
    dc:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  FILTER (!isIRI(?identifier))
  FILTER (str(?identifier) != "")
} order by ?pathway

Metabolites with a data source but no identifier

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>

select distinct ?pathway ?mb ?source 
where {
  ?mb a wp:Metabolite ;
    dcterms:identifier ""^^xsd:string ;
    dc:source ?source ;
    dcterms:isPartOf ?pathway .
  FILTER (str(?source) != "")
  FILTER (!regex(str(?pathway),  "internal.wikipathways.org", "i"))
} order by ?pathway

Metabolites with too many labels

This is particularly caused by the metabolite URIs to be based on a non-existing identifier:

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct count(?label) as ?count ?pathway ?mb 
where {
  ?mb a wp:Metabolite ;
    rdfs:label ?label ;
    dcterms:isPartOf ?pathway .
} order by desc(?count) ?pathway ?mb limit 410

An example such entity with many labels and being both a metabolite, gene, complex, etc:

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct str(?label) ?type
where {
  <http://bio2rdf.org/geneid:noIdentifier> a ?type ; rdfs:label ?label .
} order by ?label

Metabolites with an Entrez Gene identifier

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>

select distinct ?pathway ?mb ?label ?identifier 
where {
  ?mb a wp:Metabolite ;
    rdfs:label ?label ;
    dc:source "Entrez Gene"^^xsd:string ;
    dcterms:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  FILTER (str(?identifier) != "")
} order by ?pathway

Metabolites as just Label

Metabolites may be marked up as DataNode but not types as Metabolite. Here are some examples: ATP, CO2, ADP, Phosphate, L-glutamate, and Cholesterol.

ATP

This example shows how to find them.

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>

select ?pathway ?source ?mb ?type
where {
  ?mb rdfs:label "ATP"@en .
  ?mb a ?type .
  OPTIONAL { ?mb dc:source ?source . }
  OPTIONAL { ?mb dcterms:isPartOf ?pathway . }
  FILTER NOT EXISTS { ?mb a wp:Metabolite . }
}

Metabolites also labeled as GeneProduct

Sometimes things are incorrectly marked as Metabolite, when they really are GeneProducts. We can list entities based on their label that are both annotated as Metabolite and as GeneProduct:

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>

select ?pathway ?mb ?gene ?label
where {
  ?gene rdfs:label ?label .
  ?mb rdfs:label ?label .
  OPTIONAL { ?mb dcterms:isPartOf ?pathway . }
  FILTER ( ?gene != ?mb )
  FILTER EXISTS { ?gene a wp:GeneProduct }
  FILTER EXISTS { ?mb a wp:Metabolite }
  FILTER (!regex(str(?mb),  "noIdentifier", "i"))
  FILTER (!regex(str(?gene),  "noIdentifier", "i"))
}

Actually, this query does not do what I want it to do, because the FILTER only removes things from the result list, but does still allow things with "noIdentifier" to hook up things, messing up this query if there is just one URI with noIdentifier with the same label :(

Labels which are also marked as metabolite

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>

select ?pathway ?labelNode str(?label1) as ?labelStr ?mb  str(?label2) as ?mbStr
where {
  ?labelNode a gpml:Label ;
    rdfs:label ?label1 ;
    dcterms:isPartOf ?pathway .
  ?mb a wp:Metabolite ;
    rdfs:label ?label2 .
  FILTER ( ?labelNode != ?mb )
  FILTER ( str(?label2) = str(?label1) )
  FILTER (!regex(str(?mb),  "noIdentifier", "i"))
  FILTER (!regex(str(?labelNode),  "noIdentifier", "i"))
} LIMIT 50 OFFSET 25

To get the most common such labels, use (though typically times out on Virtuoso 6.1):

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>

select str(?label1) as ?labelStr count(?labelNode) as ?count
where {
  ?labelNode a gpml:Label ;
    rdfs:label ?label1 .
  ?mb a wp:Metabolite ;
    rdfs:label ?label2 .
  FILTER ( ?labelNode != ?mb )
  FILTER ( str(?label2) = str(?label1) )
  FILTER (!regex(str(?mb),  "noIdentifier", "i"))
  FILTER (!regex(str(?labelNode),  "noIdentifier", "i"))
} order by desc(?count)
Personal tools