Re: [architecture] Comment on Anonymizer Functions in OSEHRA technical Journal

Russ Davis and others- Having served as Chief Innovation Officer at MedStar and Director, Translational Informatics at Johns Hopkins Hospital, and having blogged extensively in the past about this at www.healthsystemcio.com (e.g., http://healthsystemcio.com/2010/12/04/how-private-and-secure-is-protected-health-information-phi/) , this is a losing battle. I tend to be "doom and gloom", but note that the NSA AES 256 bit has been cracked ( http://research.microsoft.com/en-us/projects/cryptanalysis/aesbc.pdf). It appears to work better if the perpetrator can get their hands on both keys. For genomic information, it is such a unique identifier (i.e., forensics), that it will be extremely difficult to defend Protected Health Information (PHI). Kind regards - Gerry Higgins, Ph.D. VP, Pharmamcogenomic Science, AssureRx Health, Inc. Professor, Harvard Medical School (BIDMC) On Mon, Mar 5, 2012 at 7:35 AM, RDavis wrote: > Tom thanks for looking at the anonymizer posting; good comments. There has > been previous work on anonymization such as the k-anonymity model [1]. > Other work compares the utility versus privacy (read security) of various > approaches [2]. > > The first step is testing the algorithms to ensure proper anonymization > takes place. We don’t want an approach that compromises privacy while > giving us the impression that an approach is secure. Using the compiles > test fixtures, the instance of the data was analyzed to see statistically > what the results look like. This was coupled with the know frequency of > names to see the ramification. Given there are no restrictions on the code > provided, it offers others the ability to build in the capabilities. > > Russ Davis > > [1]. Latanya Sweeney, k-anonymity: a model for protecting privacy*, > International Journal on Uncertainty, Fuzziness and Knowledge-based Systems > *, 10 (5), 2002. > > [2]. Vibhor Rastogi, Dan, Suciu, and Sungho Hong, The Boundary Between > Privacy and Utility in Data Publishing, *VLDB ‘07*, September 2328, 2007, > Vienna, Austria. > -- > Full post: > http://www.osehra.org/blog/comment-anonymizer-functions-osehra-technical... > Manage my subscriptions: > http://www.osehra.org/og_mailinglist/subscriptions > Stop emails for this post: > http://www.osehra.org/og_mailinglist/unsubscribe/583 >
like0

Comments

Re. Gerry Higgins Comments

Russell Davis's picture

Gerry, 

The Advanced Encryption Standard (AES) is a symmetric key algorithm. That is, the same key used to encrypt information is used to decrypt it [FIPS 197]. The AES could be used in the generation of a message authentication code thereby creating a one-way function. However, the Secure Hash Algorithm 256 is a one-way function used in conjunction with creating digital signatures. Essentially, an object (such as a file) of any length can be converted to a 256-bit message digest (checksum).

In the past, checksum values (such as cyclic redundancy checks or polynomial checksum values) were used with communications protocols to detect errors. The challenge is, once an algorithm is known, plain text attacks can be used to recover original information. For example, the older VAX computers used the Prudy polynomial algorithm to convert passwords into checksum values. Unfortunately, hackers were able to write algorithms to conduct dictionary attacks { see http://h71000.www7.hp.com/openvms/journal/v3/ask_the_wizard.html for additional information}

 

Russ Davis

like0

Re: [architecture] Comment on Anonymizer Functions in OSEHRA tec

Gerald Higgins's picture

Russ Davis-

I am certainly not an expert in cryptography and know nothing about it, but
it was my understanding that both the SHA256 and SHA512 can be more easily
hacked. See
http://blog.hacker.dk/2010/04/cracking-sha-256-and-sha-512-linuxunix-pas...
for
an application that can do this, at least for linux/unix passwords.

I should ask my son who is currently a Honors/Scholar triple major at the
University of Maryland in Mathematics, Computer Science and Genetics, who
among many others (including George Church, Chair of Genetics at
Harvard/MIT), argues that all genomic data should be freely accessible,
because it will be available to hackers in any case. I do not agree with
this position.

I am very concerned about this issue - I know someone who was able to login
to large hospital systems in the DC area using their old and simple
password and access *all of the identified EHR records of 15 different
hospitals in the DC/Baltimore region*. They had been terminated from their
position 5 years ago, and yet these data were still available to them. My
wife and I also lost our EHR data from Johns Hopkins, when an employee took
home a laptop containing a simple way to access about 10,000 EHRs, and it
was stolen. At least Johns Hopkins notified us about this situation
immediately.

My home office workstation is an 8 teraflop, nVidia Tesla-based computer,
and I have analyzed target-enriched, whole genomes of over 12K
de-identified individuals (not from the VA or DoD) as part of my work. The
data are never stored here, and these sequences have no identifiers
associated with them. Still, this can be worrisome.

Kind regards - Gerry Higgins, Ph.D.
VP, Pharmacogenomic Science, AssureRx Health, Inc.
Professor, Harvard Medical School (BIDMC)

On Mon, Mar 5, 2012 at 10:53 AM, RDavis <russell.davis@pentagon.af.mil>wrote:

> Gerry,
>
> The Advanced Encryption Standard (AES) is a symmetric key algorithm. That
> is, the same key used to encrypt information is used to decrypt it [FIPS
> 197]. The AES could be used in the generation of a message authentication
> code thereby creating a one-way function. However, the Secure Hash
> Algorithm 256 is a one-way function used in conjunction with creating
> digital signatures. Essentially, an object (such as a file) of any length
> can be converted to a 256-bit message digest (checksum).
>
> In the past, checksum values (such as cyclic redundancy checks or
> polynomial checksum values) were used with communications protocols to
> detect errors. The challenge is, once an algorithm is known, plain text
> attacks can be used to recover original information. For example, the older
> VAX computers used the Prudy polynomial algorithm to convert passwords into
> checksum values. Unfortunately, hackers were able to write algorithms to
> conduct dictionary attacks { see *
> http://h71000.www7.hp.com/openvms/journal/v3/ask_the_wizard.html*<http:/... additional information}
>
>
>
> Russ Davis
> --
> Full post:
> http://www.osehra.org/discussion/re-architecture-comment-anonymizer-func...
>
> Manage my subscriptions:
> http://www.osehra.org/og_mailinglist/subscriptions
> Stop emails for this post:
> http://www.osehra.org/og_mailinglist/unsubscribe/597
>

like0

Tuples vs networks?

Tom Munnecke's picture

Russ, Thanks for the links to the k-anonymity work... I remember digging into that a while back, and thinking "what if the data isn't in tuples?"  Folks tend to assume a field-within-record format, where there is a lot of implicit knowledge embedded in the fact that these fields are located on the same record for some reason. 

But if we are dealing with complex medical information with genomic and social network relationships that are inherently in a NoSQL format, such as the high dimensionality of the VistA database, this fixed field/record relationship becomes dubious, and all kinds of other methods of inferencing become critical (social network analysis, for example... who is friends of friends with whom, for example.  And, of course, DNA information ... coupled with pedigree information (see: "your daddy ain't your daddy but your daddy don't know.")

I suppose I'll have to tend to Gerry's gloomy assesment of all of this... made worse by other non-medical intrusions.  For example, Apple and/or Google probably know if someone is visiting a Planned Parenthood or kidney dialysis center simply by tracking their phone.

I wonder if there might be a way out of this with sufficiently large data bases with smarter query processors?  I'm thinking in the context of a triple-store database with a modified SPARQL query manager.  The query process would have limits as to the patient identifiability certificiation of the submitter of inquirers.  If the inquiry was from a patient's doc, it would have full identifiability.  If it was from a researcher under a given IRB certification from their institution, they would have greater access... if it was from the general public, they would have much less (or none) identifability privileges.  They query manager would intelligently "fuzz" the query accordingly.  If someone asked for everything in a given zip code, but the results would lead to idenfiable individuals, the query would be fuzzed to a collection of zip codes, for example.

James Fowler from UCSD uses the "Yahtze" procedure: http://arxiv.org/abs/1112.1038

 

 

like0

Esther Dyson/Deborah Peel talking about open genomics v privacy

Tom Munnecke's picture

Here is a video I made of Esther Dyson, talking about her reasons for openly sharing her medical record and genomic information. 

And here is a video I made of Deborah Peel watching and commenting on Esther's video.  Deborah is founder and director of the Patient Privacy Rights foundation http://www.patientprivacyrights.org

They make an interesting contrast to the two sides of the issue of information sharing and privacy...

like0