Proposal to use RDF/SPARQL to define Foundation Schema to map VistA elements

At the last Architecture phone call, I suggested using semantic web/ontology technology to describe the VistA architecture, which would give us a formal, common platform for discussing everything required to install and operate a VistA instance.  

Attached is the proposal, which points to the RDFS of the schema at http://www.metavista.name/foundation/foundation.rdfs

 

Proposal

An RDF approach to managing the VistA software foundation

 

Tom Munnecke

January 9, 2012


 

This is a proposal to use a common Resource Description Framework (RDF) approach to identify and manage the components in OSEHRA software effort. 

 

What is RDF?

 

Resource Description Framework (RDF http://www.w3.org/RDF/ ) is a standard model for data interchange on the Web by the World Wide Web Consortium W3C http://wc3.org   RDF has features that facilitate data merging even if the underlying schemas differ, and it specifically supports the evolution of schemas over time without requiring all the data consumers to be changed.

 

RDF extends the linking structure of the Web to use URIs (Uniform Resource Identifers; the URL is one form of a URI) to name the relationship between things as well as the two ends of the link (this is usually referred to as a “triple”). Using this simple model, it allows structured and semi-structured data to be mixed, exposed, and shared across different applications.

 

This linking structure forms a directed, labeled graph, where the edges represent the named link between two resources, represented by the graph nodes. This graph view is the easiest possible mental model for RDF and is often used in easy-to-understand visual explanations.

 

RDF represents complex relationships as a collection of assertions: A subject, a predicate (or “verb”), and an object.  For example, from the XINDEX Fanin file:

 

3. PACKAGE NAME: QUASAR

        …

        ADD^VADPT (REGISTRATION) 

  …

 

RDF might express this information as triples (in pseudocode):

 

            QUASAR is a package.

     ADD^VADPT is an entrypoint.

     VADPT is a routine.

     ADD^VADPT is part of the Registration Package.

     QUASAR calls ADD^VADPT.

 

An RDF Schema http://www.w3.org/TR/2004/REC-rdf-primer-20040210/#rdfschema  is a way of formally describing what is said in RDF expressions.  For example, it would describe the fact that we are using Routines, and Packages as subjects and objects, and using “is_a,” “is_part_of,” and “calls” as predicates or verbs to describe the relationships between them.  Schemas may be simple (or, not even pre-defined), or they may be allow rich expression of assertions and consistency constraints.  Protégé http://protege.stanford.edu/ is an open source ontology editor and knowledge-base framework which can be used to manage these schemas.

RDF assertions may be stored in a triple-store data base (e.g. Sesame open source triple store http://www.openrdf.org/ ) to create a directed graph of information.  This provides a much richer form of expression than that possible via traditional relational databases accessed via SQL.

SPARQL http://www.w3.org/TR/rdf-sparql-query/ is a query language to express queries across diverse data sources and schemas, whether the data is stored natively as RDF or viewed as RDF via middleware. SPARQL contains capabilities for querying required and optional graph patterns along with their conjunctions and disjunctions. SPARQL also supports extensible value testing and constraining queries by source RDF graph.  A SPARQL endpoint http://semanticweb.org/wiki/SPARQL_endpoint is a machine-readable interface to a knowledge base, allowing queries to be returned in both machine and human readable formats.

The Foundation Schema: An RDF approach to VistA.

This paper proposes the creation of an RDF Schema called Foundation that would a common SPARQL endpoint with which to everything required to install and operate an instance of VistA:

1.     Name all of the elements that are required to operate an instance of VistA.  This includes routines, globals, packages, FileMan files and descriptions, test scripts, documentation, APIs, entry points, ontologies, file and table builds, device information, etc.

2.     Name all of the relationships between these elements, such as which routines call others, belong to which packages, use which files, execute code, relate to external activities, etc.

3.     A triple store repository for collecting all of this information, and providing a SPARQL End Point for query, statistical analysis, comparison between versions (e.g. VA and HIS forks), and tracking changes over time to a given fork.

This schema would provide a formal mechanism for defining the “glue” that pulls together all of the software and elements within VistA.  It is capable of expressing the meta-level information that has driven VistA over the years.  For example, Semantic Vista http://www.caregraf.org/semanticvista is an approach to expressing VistA FileMan metadata in RDF format.  The information gleaned from the XINDEX refactoring project http://www.osehra.org/group/ehr-refactoring-services could be formatted into RDF Foundation format, which would allow the linkage between routines, packages, and FileMan files and fields to be expressed.  The OSEHRA SKIDS project http://www.osehra.org/group/skids could use this repository for source code version control that is able to express the subtleties of the FileMan data dictionary.  This framework could also be used by the architecture group http://www.osehra.org/group/architecture efforts as well.

Through the use of SPARQL, the foundation schema could be linked to other repositories, as well.  For example, genomics ontologies http://www.osehra.org/blog/lessons-learned-emerge-mapping-common-data-elements-de-1 or linked to an XML feed of the Federal Enterprise Architecture http://www.itdashboard.gov/data_feeds .

An Experimental Prototype

An RDF Schema for the Foundation at http://www.metavista.name/foundation/foundation.rdfs  The initial version defines classes (which may be used for subjects or objects) Routine, Package, File, Global, Parameter, X_code (Executable code), and Language.  It defines properties (which may be used as verbs) of calls, entrypoint,  contains, has_input_parameter, has_output_parameter, embedded_languge, set_global, kills_global, reads_global, uses_file, and uses_field,

An experimental version of this is in a data store is available online at http://vistaewd.net:8980/openrdf-workbench/repositories/gpltest/contexts courtesy of George Lilly.

A version of the RDF of the Patient File as captured by Conor Dowling of http://www.caregraf.org/semanticvista and has been loaded to the repository.


 

 

 

Comments

Entry Point verb?

Afsin Ustundag's picture

Thank you for the proposal.  We put the alpha version of the XINDEX based tool M-RoutineAnalyzer in github. We also put some dependency notes for Problem List in the documents which we use in our refactoring efforts for Problem List. If we can get this kind of information from SPARQL queries I think it would be a very useful tool for future efforts.  So we will spend some time this month to provide M-ROutineAnalyzer data in RDF Foundations format.  

One thing we need to clarify however why entryPoint is a Property in http://www.metavista.name/foundation/foundation.rdfs. I think it should be a Type.  For our efforts Entry Points are more fundamental than Routines.  In fact Routines can be considered just a container for Entry Points and most of the other properties (calls, input parameter, etc) should have Entry Point as domain/range.

I see your point about entry points

Tom Munnecke's picture

This will be great to get some experience with real data.  I can see your point about entry point being a type, but I'm not sure how this fits in with other perspectives that would treat the Routine as an entity... I think it all comes back to how complicated it is to build the SPARQL queries to sort this all out in the context of future uses.  I guess the best thing is to just try it one way or the other and see how it works on the query side.

This doesn't have to be cast in concrete, of course... we can prototype it and then adjust it as necessary.

I suspect that Conor has some insight into this.

Are you planning to incorporate this into a refactoring workbench toolkit?  If so, what does the toolkit look like?

P.S.  I have a beta test key for Dydra http://dydra.com/

It's like getting everything into the same closet...

Tom Munnecke's picture

I'm glad folks are discussing this idea...

I guess I liken this to making sure that all our stuff is at least located in the same closet, with a unique name for future retrieval.  We might know exactly how we are going to organize it in the future, but at least we know where it is and what its name is.  Rather than having things spread all over as flat files, UML, spreadsheet, Word Docs, DOC manuals, globals, etc. at leas we have a common framework for collecting them, and a known place where we know we can look for them.

The SPARQL enpoint is like the closet door - we know that we can find whatever we want in the Foundation repository.  It might take some rummaging, but we know we can find it if we look hard enough.

I see a couple of paths forward:

1.  Prototype it.  I think the most important thing is that we get some experience with this approach to see the practical issues of making it work.  Getting a simple approach started is critical, I think.  I think we are closing in on this now, and would be happy to modify the RDFS to meet your suggestion above.

2. Move it to Protoge editor.  I just coded the current RDFS by hand, but I think that moving to Protoge http://protege.stanford.edu/ makes a lot of sense.  This also allows us to move to OWL http://protege.stanford.edu/overview/protege-owl.html that would give us a richer set of tools for talking about consistency, dependency, etc.

3. Look at integrating into a refactoring workbench.  I'm not sure where things are going for refactoring tools, but it seems like this would be a great asset for helping out interactively with the refactoring process. 

P.S.  I'll be at HIMMS next week, giving a talk about some of my ideas on MetaVistA concepts at the Open Health Tools meeting on Friday.  http://www.openhealthtools.org/  OHT has its roots in Eclipse, and I suspect that I'll find lots of other tool-oriented thinkers there.  I'm happy to meet anyone else from OSEHRA there.

This is a really exciting technology to explore as a way of reducing the complexity of the VistA system.  And I think we are just scratching the surface of what we can do with it... 

It's like getting everything into the same closet...

Tom Munnecke's picture

I'm glad folks are discussing this idea...

I guess I liken this to making sure that all our stuff is at least located in the same closet, with a unique name for future retrieval.  We might know exactly how we are going to organize it in the future, but at least we know where it is and what its name is.  Rather than having things spread all over as flat files, UML, spreadsheet, Word Docs, DOC manuals, globals, etc. at leas we have a common framework for collecting them, and a known place where we know we can look for them.

The SPARQL enpoint is like the closet door - we know that we can find whatever we want in the Foundation repository.  It might take some rummaging, but we know we can find it if we look hard enough.

I see a couple of paths forward:

1.  Prototype it.  I think the most important thing is that we get some experience with this approach to see the practical issues of making it work.  Getting a simple approach started is critical, I think.  I think we are closing in on this now, and would be happy to modify the RDFS to meet your suggestion above.

2. Move it to Protoge editor.  I just coded the current RDFS by hand, but I think that moving to Protoge http://protege.stanford.edu/ makes a lot of sense.  This also allows us to move to OWL http://protege.stanford.edu/overview/protege-owl.html that would give us a richer set of tools for talking about consistency, dependency, etc.

3. Look at integrating into a refactoring workbench.  I'm not sure where things are going for refactoring tools, but it seems like this would be a great asset for helping out interactively with the refactoring process. 

P.S.  I'll be at HIMMS next week, giving a talk about some of my ideas on MetaVistA concepts at the Open Health Tools meeting on Friday.  http://www.openhealthtools.org/  OHT has its roots in Eclipse, and I suspect that I'll find lots of other tool-oriented thinkers there.  I'm happy to meet anyone else from OSEHRA there.

This is a really exciting technology to explore as a way of reducing the complexity of the VistA system.  And I think we are just scratching the surface of what we can do with it... 

Real data

Catalin Branea's picture

We have an OWL/RDF file containing Packages, Routines and Globals in here: https://github.com/OSEHR/M-RoutineAnalyzer/blob/master/out/VistAOWL.zip

It has been loaded and validated with Protege 4.1, and I managed to get a HTML representation of it with the OWLDoc export plugin. To keep it small enough, the only verbs used are: contains, calls and reads_global. Anyway I think it provides a good starting point, we can improve later on when we will become more familiar with RDF and SparQL.

Any comments will be greatly appreciated,

Catalin