Help:WikiPathways Metabolomics

From WikiPathways

Revision as of 10:56, 10 February 2013 by Egonw (Talk | contribs)
Jump to: navigation, search

On this page we collect SPARQL queries to see the state of the Metabolome in WikiPathways. Triggered by User:Andra's RDF / SPARQL work, curation started with metabolites without database identifiers. But this soon led to the observation that metabolites are often not even annotated as being a metabolite (using <Label> rather than <DataNode>). Therefore, User:Egonw started at Pathway:WP1 to curate them one by one and fix these issues:

  • connect lines between metabolites
  • convert metabolites to use <Label> rather than <DataNode>

The reason for this is that these are some basic underlying properties we need for metabolomics research fields.

Contents

Metabolome

The following queries provide an overview of the Metabolome captures by WikiPathways.

The key type for metabolites is the wp:Metabolite. We can see all available properties with:

prefix wp:      <http://vocabularies.wikipathways.org/wp#>

select distinct ?p where {
  ?mb a wp:Metabolite ;
    ?p [] .
}

Run

Likewise, we can get all pathway properties with:

prefix wp:      <http://vocabularies.wikipathways.org/wp#>

select distinct ?p where {
  ?mb a wp:Pathway ;
    ?p [] .
}

Run

Latest data only

To only get analysis of the most recent pathways, add this snippet to your SPARQL, assuming ?pathway is the used variable name:

  ?mb dcterms:isPartOf ?pathway .
  ?pathway pav:version ?version .
  ?mb dcterms:isPartOf ?pathway2 .
  ?pathway2 pav:version ?version2 .
  FILTER (?version2 > ?version)

However, it should be kept in mind that this is not a fool-proof solution.

All Metabolites

Count

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select count(?mb) where {
  ?mb a wp:Metabolite .
}

Run

List

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct ?mb ?label where {
  ?mb a wp:Metabolite ;
     rdfs:label ?label .
}

Run

Metabolic Data Sources

Sorted by use

HMDB, ChEBI, and KEGG are the main data sources for identifiers. InChI/InChIKey should also be there but is missing. A big curation process in January 2013 ensured that "PubChem compound" is now used as data source for PubChem CIDs.

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select ?datasource count(distinct ?identifier) as ?count
where {
  ?mb a wp:Metabolite ;
    dc:source ?datasource ;
    dc:identifier ?identifier .
} order by desc(?count)

Run

All metabolites from one source

All KEGG identifiers

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct ?identifier
where {
  ?mb a wp:Metabolite ;
    dc:source "Kegg Compound"^^xsd:string ;
    dc:identifier ?identifier .
} order by ?identifier

Run

All HMDB identifiers

At the time of writing, this showed a number of XRefs with HMDB as data source but no identifiers, which needs curation:

http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP1002_r35260
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP1119_r35265
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP1250_r41240
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP1266_r41328
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP1285_r41669
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP1304_r41670
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP1310_r41659
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP1339_r35269
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP167_r45138
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP2267_r53133
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP28_r38852
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP28_r38852/group/ac37a
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP295_r41324
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP337_r41644
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP495_r41327
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP59_r41653
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP678_r41165
http://www.hmdb.ca/metabolites/noIdentifier 	http://rdf.wikipathways.org/Pathway/WP716_r45017
prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct ?identifier
where {
  ?mb a wp:Metabolite ;
    dc:source "HMDB"^^xsd:string ;
    dc:identifier ?identifier .
} order by ?identifier

Run

Metabolic Pathways

Metabolomes

Human Metabolome

This only returns 244 metabolites, which is not a lot at all, and does not even take account the metabolite identity. Something wrong with wp:organism? It finds 107 human pathways.

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix ncbi:    <http://purl.obolibrary.org/obo/NCBITaxon_>

select distinct ?mb where {
  ?mb a wp:Metabolite ;
    dcterms:isPartOf ?pw .
  ?pw wp:organism ncbi:9606 .
} order by ?mb

Run

Pathways with the most metabolites

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
prefix pav:     <http://purl.org/pav/>

select ?pathway count(?mb) as ?mbCount
where {
  ?mb a wp:Metabolite ;
    dcterms:isPartOf ?pathway .
} order by desc(?mbCount)

Run

Metabolites in the most Pathways

With the remark that BridgeDB is not involved yet.

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>
prefix pav:     <http://purl.org/pav/>

select ?mb count(?pathway) as ?pwCount
where {
  ?mb a wp:Metabolite ;
    dcterms:isPartOf ?pathway .
} order by desc(?pwCount)

Run

Identifier Mapping Completeness

200x

Right now, the HMDB is the primary (and only) source of mappings. That raises the question how many metabolites are in WP that do not have mappings to other databases. The following queries are about that.

The missing mappings

The next query counts all unique missing identifiers in HMDB, resulting in missing mappings to other databases: at the time of writing, this are 927 identifiers. These are not unique identifiers, which is 404 (Run) at the time of writing. Given there are about 1400 unique metabolite identifiers, this is about 30%, which is rather significant.

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select count(?source)
where {
  ?mb a wp:Metabolite ;
    dc:source ?source ;
    rdfs:label ?label ;
    dcterms:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
  FILTER (!BOUND(?bridgedb))
  FILTER (str(?source) = "ChEBI" || str(?source) = "CAS" || str(?source) = "Kegg Compound" || str(?source) = "Chemspider" || str(?source) = "HMDB") 
  FILTER (str(?identifier) != "") 
}

The full list

These are the unique identifiers missing:

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct str(?source) str(?identifier)
where {
  ?mb a wp:Metabolite ;
    dc:source ?source ;
    rdfs:label ?label ;
    dcterms:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
  FILTER (!BOUND(?bridgedb))
  FILTER (str(?source) = "ChEBI" || str(?source) = "CAS" || str(?source) = "Kegg Compound" || str(?source) = "Chemspider" || str(?source) = "HMDB") 
  FILTER (str(?identifier) != "") 
} order by ?source ?identifier

Run

ChEBI identifiers not in HMDB

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct ?pathway ?identifier ?label
where {
  ?mb a wp:Metabolite ;
    dc:source "ChEBI"^^xsd:string ;
    rdfs:label ?label ;
    dcterms:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
  FILTER (!BOUND(?bridgedb))
} order by ?identifier

Run

CAS identifiers not in HMDB

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct ?pathway ?identifier ?label
where {
  ?mb a wp:Metabolite ;
    dc:source "CAS"^^xsd:string ;
    rdfs:label ?label ;
    dcterms:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
  FILTER (!BOUND(?bridgedb))
} order by ?identifier

Run

Kegg compound identifiers not in HMDB

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct ?pathway ?identifier ?label
where {
  ?mb a wp:Metabolite ;
    dc:source "Kegg Compound"^^xsd:string ;
    rdfs:label ?label ;
    dcterms:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
  FILTER (!BOUND(?bridgedb))
} order by ?identifier

Run

PubChem-compound identifiers not in HMDB

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct ?pathway ?identifier ?label
where {
  ?mb a wp:Metabolite ;
    dc:source "PubChem-compound"^^xsd:string ;
    rdfs:label ?label ;
    dcterms:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
  FILTER (!BOUND(?bridgedb))
} order by ?identifier

Run

Curation

Common wrong identifiers

PubChem-compound 1004

Wrongly used for phosphate. It is the uncharged compound. Phosphate is, instead, and particularly thinkgs like "Pi", CID 1061 for ortho-phosphate, aka [PO4]2-.

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>

select ?pathway ?source
where {
  ?mb dc:source ?source ;
    dcterms:isPartOf ?pathway ;
    dcterms:identifier "1004"^^xsd:string .
}

Run

Outdated HMDB identifiers

These results show HMDB identifiers used in WikiPathways but that are revoked or have become secondary identifiers.

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select distinct ?identifier
where {
  ?mb a wp:Metabolite ;
    dc:source "HMDB"^^xsd:string ;
    dc:identifier ?identifier .
  OPTIONAL { ?mb  wp:bdbHmdb ?bridgedb . }
  FILTER (!BOUND(?bridgedb))
} order by ?identifier

Run

Metabolites not classified as such

One can list all data sources for non-metabolites with this query.

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms:  <http://purl.org/dc/terms/>

select ?datasource count(?identifier) as ?count
where {
  ?mb dc:source ?datasource ;
    dcterms:identifier ?identifier .
  FILTER NOT EXISTS { ?mb a wp:Metabolite }
} order by desc(?count)

Run

That mostly lists gene identifier sources, etc, but watch out for the metabolite identifier data sources. For example, metabolites not marked as such but with a metabolite identifier can be found this way. Down the list is CAS (but genes are chemicals too...), and a few minor more:

"CTD Gene"^^<http://www.w3.org/2001/XMLSchema#string> 	5
"HMDB"^^<http://www.w3.org/2001/XMLSchema#string> 	4
"ChEBI"^^<http://www.w3.org/2001/XMLSchema#string> 	3
"GLYCAN"^^<http://www.w3.org/2001/XMLSchema#string> 	3
"COMPOUND"^^<http://www.w3.org/2001/XMLSchema#string> 	3
"PubChem"^^<http://www.w3.org/2001/XMLSchema#string> 	2

I would expect GLYCAN and COMPOUND to be misnomers of the matching KEGG subsets.

Non-Metabolites with CAS identifier

Note that a CAS identifier can also refer to mixtures, compound classes, etc.

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>

select distinct ?pathway ?mb ?label ?identifier 
where {
  ?mb dc:source "CAS"^^xsd:string ;
    rdfs:label ?label ;
    dcterms:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  FILTER NOT EXISTS { ?mb a wp:Metabolite }
} order by ?pathway

Run

Non-Metabolites with PubChem identifier

At the time of writing, this results in an empty set.

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>

select distinct ?pathway ?mb ?label ?identifier 
where {
  ?mb dc:source "PubChem-compound"^^xsd:string ;
    dcterms:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  OPTIONAL { ?mb rdfs:label ?label . }
  FILTER NOT EXISTS { ?mb a wp:Metabolite }
} order by ?pathway

Metabolites with an identifier but undefined data source

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>

select distinct ?pathway ?mb ?identifier 
where {
  ?mb a wp:Metabolite ;
    dc:source ""^^xsd:string ;
    dc:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  FILTER (!isIRI(?identifier))
  FILTER (str(?identifier) != "")
} order by ?pathway

Run

Metabolites with a data source but no identifier

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>

select distinct ?pathway ?mb ?source 
where {
  ?mb a wp:Metabolite ;
    dcterms:identifier ""^^xsd:string ;
    dc:source ?source ;
    dcterms:isPartOf ?pathway .
  FILTER (str(?source) != "")
  FILTER (!regex(str(?pathway),  "internal.wikipathways.org", "i"))
} order by ?pathway

Run

Metabolites with an Entrez Gene identifier

prefix wp:      <http://vocabularies.wikipathways.org/wp#>
prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix xsd:     <http://www.w3.org/2001/XMLSchema#>

select distinct ?pathway ?mb ?label ?identifier 
where {
  ?mb a wp:Metabolite ;
    rdfs:label ?label ;
    dc:source "Entrez Gene"^^xsd:string ;
    dcterms:identifier ?identifier ;
    dcterms:isPartOf ?pathway .
  FILTER (str(?identifier) != "")
} order by ?pathway

Run

Personal tools