AI and De-Identification of Healthcare Data
On Federated Learning, de-identification of unstructured biomedical text, and grit.
This is the unusual story of Heb Safe Harbor, a tool for de-identification of unstructured biomedical text in Hebrew. But it is also a story about determination and grit, and how sometimes falling into a rabbit hole ends up in creating impact you did not anticipate.
The story starts with a biomedical Natural Language Processing (NLP) product named Text Analytics for health, that our team has built. The product analyzes and extracts information from unstructured biomedical text, identifying medical concepts like diagnosis, symptoms, medications, examinations, genes, and many more, as well as relationships between concepts, their context and even their associated clinical codes. Originally, we shipped the product in English. But, being a global company, the next thing was to make it support medical text in additional languages.
Making this product multilingual meant we needed to train the model on biomedical data in the target languages. This meant that for each language, we typically needed to have data agreements with at least one healthcare system that had unstructured clinical text in that language.
Multilingual Biomedical Data
Getting into a data agreement per language was challenging, but we were able to create data partnerships with healthcare provider systems around the world to get access to unstructured clinical text. And so we did, for several Latin-based languages.
But one of the languages we aimed to enable was Hebrew – wanting to include a language that is non-Latin, and also wanting to leverage the innovative culture of the Israeli healthcare ecosystem. And so, we got into a data agreement with one of the local HMOs, Leumit Health Services, collaborating with my 8400-the-health-network peer Izhar Laufer, their head of Innovation.
As a side note, we really, really wanted to do Arabic too. Being a semitic language that is a sister to Hebrew, the two languages were supposed to reinforce each other, living peacefully side by side in a joint biomedical language model. Problem was that for a long time, we could not find any source for unstructured clinical or biomedical data in Arabic to train on, given that healthcare systems throughout the Arab world are keeping all their clinical documentation in English.
And as a side note to the side note, wouldn’t that be a beautiful analogy to peaceful coexistence.
But I derail. Back to the data agreement.
What is Federated Learning?
One of the more challenging parts of the data agreement was that we couldn’t take the data out of the Leumit healthcare system’s premises for training purposes. So, we had to find a way to train our language model, without having the data cross outside of the HMO’s trust boundaries. And this is how we became one of the first customers using Federated Learning in Azure Machine Learning.
Federated Learning is a machine learning approach that enables multiple decentralized devices or servers (often referred to as clients) to collaboratively train a shared model without exposing their local training data. Instead of sending raw data to a central server, each client computes model updates (such as gradients or weights) based on its local data and shares these updates (and these updates alone, without the raw data) with a central aggregator. The aggregator then combines these updates to refine the global model, and then the updated global model is redistributed to the clients for further training.
Federated Learning is a machine learning approach that enables multiple decentralized servers to collaboratively train a shared model without exposing their local data.
Azure later published a customer story about this, sharing how we were able to extract value from siloed healthcare data using federated learning with Azure Machine Learning. They quoted me there, telling the story of Text Analytics for Health.
Cool - problem solved, right? At least in theory. Onwards into the rabbit hole.
De-Identification of Medical Data
But then, there was another challenge. Our company’s internal regulations did not allow us to just train on healthcare data that is identifiable (aka PHI – Personal Health Information), even though it was going to be done in Federated Learning, the data needed to be de-identified before training the model on it. So.
That meant we needed to help Leumit de-identify the data we wanted to train on.
But then, another challenge emerged. There was no tool for de-identification of healthcare data in Hebrew at the time. We’re talking LOTS of data here, so doing it manually is not realistic. So, what do you do? Give up?
Nuh-uh, not us. So, we decided to build a de-identification tool for biomedical text in Hebrew. Can’t be that hard, can it?
Spoiler alert: it is hard.
“Nemal Mivatchim”
Alright, we decided to create the de-identification tool. So, what does the tool need to do? Like, it is a different language, different country, you can’t just do US Safe Harbor.
Turned out, there was no regulation yet for de-identification of unstructured healthcare data in this country. Now what do you do?
But I’m not a person known to be faint at heart. I turned to my 8400-the-health-network peers at the Israeli Ministry of Health, Yoel Ben Or and Roy Cohen, and together we wrote a proposal for regulation of de-identification of unstructured healthcare data that was following the US Safe Harbor principles, with local adaptations.
We named the proposed regulation “Nemal Mivtachim”, which, in Hebrew, means… “Safe Harbor”.
This proposal served as the functional requirements for that de-identification tool.
Heb Safe Harbor
But the plot thickens. Now that we had functional specification, we needed to train and evaluate the performance of the tool. And for that you need data. But, if you are still following the twists and turns of this story, you already know there’s a catch 22 here, since we cannot train on customer data that is not de-identified. So. How do you do that, when your goal is to train a model that recognizes personally identifiable information?
The answer is synthetic data. We had medical doctors create fictitious clinical notes for imaginary patients, and this was the inception of a data corpus we later expanded, and nicknamed The Tehila Data Set. This one deserves its own story, so subscribe to this blog and stay tuned for that story in one of the next blog posts.
We leveraged other general-domain language-specific models and tools alongside the synthetic data. And so, we built the de-id tool and named it Heb Safe Harbor.
The last challenge was how to ship the de-id tool, so that our data partner would be able to use it. We ended up convincing 8400-the-health-network non-profit to let us create a new GitHub organization for them, and we published Heb Safe Harbor as an open source there.
If you never heard of 8400, read the story about 8400-the-health-network in this post.
De-identification is not just about wiping out data that could be identifiable. If you just wipe it out, you will likely make the data meaningless
How De-Id Works
De-identification is much more complicated than it seems. This is not just about wiping out anything identifiable according to Safe Harbor. If you just wipe it out, you will likely make the data meaningless. At the minimum you need to identify what type of data you are wiping out, and substitute that with a label that indicates what was removed.
For example, substitute:
”won’t the real Slim Shady please stand up”
with:
“won’t the real [_FIRSTNAME] [_LASTNAME] please stand up”
or:
“hospitalized for knee surgery on 10/13/2024 and released 10/19/2024”
into:
“hospitalized for knee surgery on [_DATE] and released [_DATE]”
Identifiable data according to the US Safe Harbor includes 18 fields, including names, meaningful dates, physical addresses and zip codes, phone numbers, email address, identifying numbers such as social security numbers, etc., so you need to be able to identify those.
But that is sometimes not enough, as this act of replacing the identifiable fields with labels could make the data unusable to train on, or cause clinical information loss. What you really need is to surrogate the data, both for absolute values as well as relative, creating substitute data that will obfuscate the original data but keep the clinical meaning of it, for example:
“won’t the real Joe Jones please stand up”
or:
“hospitalized for knee surgery on 11/23/2024 and released 11/29/2024”
As you can see in this example, while dates need to be obfuscated, we need the surrogate dates to be in a certain order and time lapse, to not lose the fact that the patient was hospitalized for 6 days - which could be clinically meaningful. Which requires a smart AI service.
Replacing the identifiable data with labels could make the data unusable to train on, or cause clinical information loss. Surrogate the data means creating substitute data that will obfuscate the original data but keep the clinical meaning of it.
How it all ended
Text Analytics for health was shipped as Generally Available in 7 different languages, including English, German, French, Spanish, Portuguese, Italian and Hebrew, in 2022. The announcement was published in this blog post, featuring Leumit Health Services as a key customer. The service identifies about 40 types of medical concepts, 30 types of relationships and a dozen types of context elements, and is still used broadly today.
Heb Safe Harbor was published as an open source on the 8400-the-health-network GitHub and has been used by many healthcare systems in the country since then.
The 8400-the-health-network GitHub has expanded to include additional national open source initiatives, including the emerging local FHIR-IL Core standard.
Nemal Mivtachim continues to serve as guidelines for de-identification of unstructured medical data in Hebrew, and contributes to the emerging regulation in this area.
The Tehila Data Set was expanded and merged with another data set, one we nicknamed The Hogwarts Data Set. But again, that’s a whole different story for a future post.
And as for de-id, I ended up leveraging the learnings from the Heb Safe Harbor adventure to contribute to the functional requirements of the Azure Health Deidentification service that was recently released as Generally Available by my awesome colleagues at Microsoft Health & Life Sciences.
Congrats, friends, for the GA of your very cool, super important service!
About Verge of Singularity.
About me: Real person. Opinions are my own. Blog posts are not generated by AI.
See more here.
LinkedIn: https://www.linkedin.com/in/hadas-bitran/
X: @hadasbitran
Instagram: @hadasbitran
Recent posts: