Lessons learned from eMERGE: Mapping common Data Elements (DE)

The eMERGE (electronic MEdical Records and GEnomics:https://www.mc.vanderbilt.edu/victr/dcc/projects/acc/index.php/Main_Page)  network was funded by the NIH (NHGRI) to mine data from Electronic Health Records (EHRs) for mapping Phenotype-Genotype associations using large population data drawn from EHRs and biorepositories. The network consists of five primary sites: (1) Vanderbilt, (2) The Marshfield Clinic, (3) Mayo, (4) Northwestern and (5) Group Health Seattle. There are also ancillary investigators at other sites.

I am not part of this NIH-funded program, but would like to use this as a model for how OSEHRA could be configured for discovery of gene-phene associations. As is often the case, definition of clinical phenotypes is a first step, and there are many databases avilable for this purpose. The International Health Terminology Standards Development Organization is one approach for harmonization. For example, since SNOMED CT is presumed by many to be the most comprehensive clinical terminology available, it can be searched at www.hl7.org.

There are different meanings for what defines a Common Data Element (from www.biomedcentral.com/1755-8794/2/66):

"A Common Data Element (CDE) is a metadata definition with an informal explanation of its meaning and usage, a list of alternative names and definitions, units of measurement, and the type of values to be recorded. CDEs can be created for any kind of concept, measurement, or application,and, although grouped into "Data Element Concepts" for convenience, need not derive their meaning from their position in a complex hierarchy or graph. This is in contrast to the ontological approach to data definition, often used in bioinformatics applications, where each subclass is part of a specification for a representational vocabulary for a particular domain. Although classifications or ontologies can be added to a database of CDEs, they can be used to support navigation and inference on an application-specific basis: there is no requirement to locate a CDE within an existing domainontology before recording the semantics of a data definition."

The following Table was taken and modified from Table 2 in "Mapping clinical phenotype data elements to standardized metadata respositories and controlled terminologies: the eMERGE network experience" (see: Pathak J et al. J Am Med Inform Assoc 2011;18:376-386 - http://jamia.bmj.com/content/18/4/376.long).

Table #1: Glossary of key terms and definitions. Modification - any errors are mine!

"For querying the caDSR, we use the caDSR HTTP API, which allows an application to connect to caDSR remotely and search the database. The API provides various forms of functions for querying the caDSR, and returns the results in a well-formed XML document. As mentioned above, ...this is based on the ISO/IEC 11179 model for metadata registration and, as a result, decomposes the essence of a DE in well-formed parts, separating the conceptual entity (DE concept) from its physical representation in a database (value domain). The DE concept may be associated with an object class and a property, and the value domains have a list of permissible values (Figure 1). Consequently, our searches for appropriate string matches were restricted to the DE concept and permissible values of the CDEs in the caDSR."

Figure 1: Cancer data standards repository (nci) caDSR and ISO/IEC 11179 model for metadata registries  (from Pathak J et al. J Am Med Inform Assoc 2011;18:376-386 - http://jamia.bmj.com/content/18/4/376.long)


Well, I hope that this provides some value for the start of a discussion. I must admit that I am learning as I go, even though I have been funded in the past by the National Science Foundation to help to develop a multi-scale ontology for cardiac anatomy. Let me know if this makes sense in the current context of the project.



CDE and Ontology

Manjula Dharmawardhana's picture

This is a very interesting research initiative and directly overlap with my interests. Thank you very much Gerry for initiating this. I am still a student in the area. I would like to raise some issues. Please pardon me if I am confused about the thiories.

As I understand CDE and Ontology are two ways of representing knowladge in a common and standard format. What are the pros and cons of using each method?

Does a project like this should always be relied upon its own genomic repository and is there a way to use the vast amount of information scatterd here and there?

Thank you




Manjulpra- Sorry for the

Gerald Higgins's picture


Sorry for the delayed response. I will answer your last question first:

You asked: Does a project like this should always be relied upon its own genomic repository and is there a way to use the vast amount of information scatterd here and there?

The answer for the devlopment of an open source system such as OSEHRA is that I would use what already is being used by groups such as Gene Ontology (GO). There are ways to use scattered information "here and there" through the use of machine-based Natural Language Processing. However, at this point in time, I would recommend standards already in place or being used in clinical genomics, such as those of the Health Level 7 group, and still an informatician has to still "be in the loop" to accurately deploy them (a paper on this topic will posted in this group next week.


I hope this long-winded explantion helps:

They differ in usage, but are not necessarily distinct. Common data elements may exist in an ontology, but ontologies (probably should not be used in the plural) provide a framework for examination of the attributes of the components of a system and how they interact, not necessarily concrete definitions of the words themselves - they use controlled vocabularies (see text from GO below).

"Ontology" comes from the Greek  ων = being and λόγος = word/speech).

"An ontology is a specification of a conceptualization."

I like the work of Mark Musen and colleagues at Stanford, and the protege open source development community for exmining  how ontology works as a formalism: http://protegewiki.stanford.edu/wiki/Main_Page

Also- from: http://www.geneontology.org  

Ontology: "The GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner. There are three separate aspects to this effort: first, the development and maintenance of the ontologies themselves; second, the annotation of gene products, which entails making associations between the ontologies and the genes and gene products in the collaborating databases; and third, development of tools that facilitate the creation, maintenance and use of ontologies...The use of GO terms by collaborating databases facilitates uniform queries across them. The controlled vocabularies are structured so that they can be queried at different levels: for example, you can use GO to find all the gene products in the mouse genome that are involved in signal transduction, or you can zoom in on all the receptor tyrosine kinases. This structure also allows annotators to assign properties to genes or gene products at different levels, depending on the depth of knowledge about that entity. ...

It is important to clearly state the scope of GO, and what it does and does not cover. The preceding section explained the domains covered by GO; the following areas are outside the scope of GO, and terms in these domains would not appear in the ontologies.

  • Gene products: e.g. cytochrome c is not in the ontologies, but attributes of cytochrome c, such as oxidoreductase activity, are.
  • Processes, functions or components that are unique to mutants or diseases: e.g. oncogenesis is not a valid GO term because causing cancer is not the normal function of any gene.
  • Attributes of sequence such as intron/exon parameters: these are not attributes of gene products and will be described in a separate sequence ontology (see the OBO website for more information).
  • Protein domains or structural features.
  • Protein-protein interactions.
  • Environment, evolution and expression.
  • Anatomical or histological features above the level of cellular components, including cell types.

GO is not a database of gene sequences, nor a catalog of gene products. Rather, GO describes how gene products behave in a cellular context.

GO is not a dictated standard, mandating nomenclature across databases. Groups participate because of self-interest, and cooperate to arrive at a consensus.

GO is not a way to unify biological databases (i.e. GO is not a 'federated solution'). Sharing vocabulary is a step towards unification, but is not, in itself, sufficient. Reasons for this include the following:

  • Knowledge changes and updates lag behind.
  • Individual curators evaluate data differently. While we can agree to use the word 'kinase', we must also agree to support this by stating how and why we use 'kinase', and consistently apply it. Only in this way can we hope to compare gene products and determine whether they are related.
  • GO does not attempt to describe every aspect of biology; its scope is limited to the domains described above.



Yeay, Ontologies and Semantics...

Tom Munnecke's picture

I really like this level of abstraction, and it opens up a whole new level of connecting clinical information at a meta level that I think is necessary for dealing with the exploding complexities, big data sets, and rapid changes VistA is facing.

re: "I like the work of Mark Musen and colleagues at Stanford, and the protege open source development community for exmining  how ontology works as a formalism: http://protegewiki.stanford.edu/wiki/Main_Page "

yes, I think that this is a very good approach for formalizing the meta-level understanding of VistA.  Note the work done on Semantic VistA by Conor Dowling http://www.caregraf.org/semanticvista and an RDF schema for describing the VistA software foundational elements for refactoring, configuration management, and installation: http://www.metavista.name/foundation/foundation.rdfs

We could also use a similar framework for defining a schema for social network analysis... 

I'm trying to pull together all these ideas together into a framework I call MetaVista... which I'll be presenting next week at the VistA community meeting in Sacramento...