Genotypes, exomes, whole genomes and raw sequence reads in the Electronic Health Record

My intent is to upload the various file formats for genomic data as defined by the HL7 (Health Systems 7) Clinical Genomics working group, but I am too dumb to figure out how to load documents on the OSEHRA site - I will start by initiating a discussion.

Clinical practitioners can now interactively produce and query a patient report for genetic tests spanning over 2000 inherited diseases from a single whole-genome sequence, using ( as a valid guide.

The amount of human genomic data is accumulating at an unprecedented rate (see Figure 1 below). For example, the BGI at Shenzhen, China has now installed over two exabyte (2 billion gigabytes) of storage to house DNA sequencing data. The institute will use the storage infrastructure to unify its 250 next generation sequencers onto a single shared pool of storage with a single file system. The BGI’s computing platform is greater than 1000 Teraflops, or one quadrillion floating point operations per second. BGI, as it is now known, is the world’s largest genome sequencing center. Its sequencing output is now more than 40,000 human genomes per year. Its key accomplishments have included the first de novo sequencing and assembly of various mammalian species including the human genome with short-read sequencing (so-called “next generation sequencing”) and the first sequencing of an ancient human genome. It has received over $1.5 B in collaborative U.S. funds from the China Bank.

The storage and access of different files containing patient genomic data represents a “Big Data” challenge, as was elaborated in PCAST NITRD “Big Data” Strategy Directive 12/2010:

“Data volumes are growing exponentially”

  • There are many reasons for this growth:
    • the creation of nearly all data today in digital form
    • a proliferation of sensors (e.g. Next-Generation Sequencing)
    • new data sources such as high-resolution imagery and   video.
  • The collection, management, and analysis of data is a fast-growing concern of NIT research.
  • Automated analysis techniques such as data mining and machine learning facilitate.
  • Transformation of data into knowledge, and of knowledge into action.

“Every Federal agency needs to have a ‘big data’ strategy”

The next blogs in this sequence will define the Technical Requirements and routes for EHR integration of these massive patient-specific data records.



Integrating genomics and VA scale

Tom Munnecke's picture


(Sorry for the duplicate posting, but I'm not sure how to link Drupal conversations)

I think that there are also some really interesting things about scale and the VA... the VA's 30 year clinical record, coupled with the Million Veteran vision, coupled with 300,000 PCs connected but idle 16 hrs/day.  If connected in a grid configuration (say, BOINC ) that would provide 4.8 million hrs/day of search time.  And if we could figure out a way of partitioning and staging genomic data across this grid, (say, 10 petabytes for 1m genomes) we could be looking at a very low cost, highly parallel processing.  Imagine a search taking 3 hrs per genome- this could be accomplished over 1m genomes overnight - and coupled with the clinical information of the patient involved.  This would allow folks to browse for associations at a massive scale...

for example, "what has the VA's experience been with aggressive prostate cancer with men with these SNPs?" could be a spontaneous question, triggering off a million PC hrs over a million patients.... all for the marginal cost of the electricity to do the computation.

This is really interesting stuff... I'm extremely interested in genomics.  I've done mulit-mililon pair SNP testing on 18 members over 4 generations of my family, and just sent off two more full exome family samples with 23andMe.  (my daughter is dir of R&D of Pathway Genomics, and I'm on their advisory board). Here's a video interview I did of Esther Dyson on the PHR and personal genomics : and an after dinner conversation with Patient Privacy Rights dynamo Deborah Peel's reaction to the privacy implications of it: 

San Diego is a hotbed of this kind of stuff.  Folks might be interested in the Future of Genomics Medicine Conference Mar 2

And here is MIT prof Peter Szolovitz talking about his work leading up to PHR.  I arranged a meeting with Pete, Tim Berners-Lee (web creator), Rob Kolodner, and other architects back in the early days (when I had to explain what the web was) which Rob credits as being seminal to My HealtheVet...

I think we should stir the pot to include the work of Christakis and Fowler and the role of health and social networks.  (See Connected: )

I'll be back to the Amherst for a commemorative workshop on Lynn Margulis' work Mar 23-24; maybe we could connect to talk more about this?



RE: Integrating genomics and VA scale

Gerald Higgins's picture


I would be very interested in connecting with all interested parties that want to see realization of a genome-enabled EHR for the VA. I am at anyone's beck and call - I have more to say about the data storage challenges.

Kind regards - Gerry Higgins, Ph.D.