One of the GPLed software programs, HMS Scrubber version 1.0, was recently able to remove more than 98 percent of identifiers -- such as name, address, and Social Security number -- from 1,254 pathology reports processed from three hospitals. Developed by a team from the Beth Israel Deaconess Medical Center in Boston and other American institutions, the software holds promise beyond pathology in nearly all medical records, which are integral to research, but are full of privacy pitfalls, says Bruce Beckwith, a Beth Israel doctor and developer of the new software.
Recently featured on the Web journal BMC Medical Informatics and Decision Making, the software may also be important to hospitals and researchers adhering to information handling requirements of the Health Insurance Portability and Accountability Act (HIPAA). It's currently being used in an approved system at Harvard Medical School to allow researchers to search de-identified pathology reports for tissue that might be useful.
"While we developed and tested this with pathology reports in mind, we made it simple for others to modify the code to suit their own needs," Beckwith says. "In our article, we demonstrated that the software needs to be tested and adapted to local styles of reporting. Many of the regular expressions that we use are general purpose, but there are some that are specifically designed for the content present in pathology reports. The regular expressions are contained in a separate file and can be edited easily."
Beckwith stresses that HMS Scrubber was designed as open source software from the get go, and was developed as part of the Shared Pathology Informatics Network (SPIN), an effort to bolster research through more access to de-identified patient data.
One explicit goal of SPIN, Beckwith says, is to develop tools that others in the cancer research and pathology communities can use freely. "We were not able to find any open source de-identification programs suitable to our purpose, so we decided to develop our own."
HMS Scrubber consists of about 3,500 lines of Java code, along with JDOM to manipulate XML input and output files, Beckwith says. HMS Scrubber also uses MySQL to store and access a list of person and place names and abbreviations.
The software takes as input pathology reports that have been extracted from current clinical information systems and converted into XML documents that comply with the SPIN schema. Report headers with patient name, age, sex, medical record number, date of procedure, date of report, ordering physician, pathologist, department number, and other information are removed, then a series of about 50 regular expressions is used to look for predictable patterns such as Social Security numbers, telephone numbers, doctors' names, hospital names, and addresses. The final step is to compare the text of the reports to a list of more than 100,000 entries of person and place names, according to Beckwith
Beckwith highlights the implications for patient privacy and hospital compliance, as well as medical research. "Currently, in many cases, if researchers wish to gather clinical information or find biologic samples, they need to get permission to search the fully identified clinical records (complete with names, medical record numbers, etc.)," he says. "With software such as ours, along with appropriate system design and oversight by the Institutional Review Board, accredited researchers could use de-identified clinical information with minimal risk to patient privacy, since they might never need to see any identifying information regarding individual patients."
Beckwith says the open source software is reliable thanks to the iterative testing and development cycle that was used. "We used pathology reports from three different hospitals to try to avoid making it too finely tuned to one specific style of report," he says. "We validated it using a large series of new pathology reports from the three hospitals, and it worked well."
An alternative approach
HMS Scrubber is not the only open source scrubber program that holds promise for medical researchers. Posting to the BioMed Central article on HMS Scrubber, Association of Pathology Informatics (API) President Jules Berman writes that there are actually two different approaches, and two open source solutions, to scrubbing the personal data from records.
"Basically, there seems to be two published approaches," Berman writes. "One approach is to parse text and remove all the identifying words. This is the way Bruce Beckwith recommends. The second way is to parse text and to extract every word except words from an approved list of non-identifying words. That's the strategy that I have previously published."
Berman indicates he has written a "much-improved" version of his Concept-Match software that uses an external list of about 80,000 approved word "doublets" that contain no identifying terms. Berman says his current list of doublets was derived from two open source medical vocabularies, and the algorithm is relatively simple.
"The method can be scripted in under 20 Perl command lines," Berman says in his comment. "This program is free software. You can redistribute it and/or modify it under the terms of the GNU General Public License."
Berman indicates he has discussed the scrubber strategies with Beckwith on several occasions, and credits the HMS Scrubber developers for making their source code and Java files publicly available.
"The paper is well written and data-driven," Berman writes. "My opinion is that the method of Beckwith et al. may be the best option if you're planning to share pathology records through a limited data use agreement. My doublet variant of the concept match method may be the best option if you're working with agnostic text that doesn't fit any particular format, or if you're preparing data for public distribution."
Berman also indicates his intention to publish his development, and his willingness to work with the HMS researchers to produce a combined study wherein both methods are used on the same text to show the advantages and disadvantages of either method in a controlled study.