Red Teaming for Generative AI in Healthcare
On the joy of Red Teaming, what is Jailbreaking, testing for content safety, and why you need medical professionals on your Red Team if you do AI in healthcare.
Every once in a while, I like to take on some hands-on tasks, to keep myself close to the technology we are building.
I manage a large team nowadays – a team I am very proud of and feel very fortunate to lead. But those hands-on tasks give me deeper understanding of the technology, and help me make better decisions.
In the last couple of months, I’ve taken part in hands-on Red Teaming for one of our Generative AI-based products, which was a unique experience for me.
To be involved in Red Teaming in our company, you have to actively volunteer for it, as it exposes you to a lot of – let’s just call it “unsafe” – content. But Red Teaming is a critical process in our product’s lifecycle, it involves a lot of work of many people, and requires diverse perspectives.
And so, I volunteered myself.
What is Red Teaming
In the context of Generative AI models, Red Teaming refers to the practice of using adversarial techniques and strategies to test and evaluate the security, robustness, and safety measures of the Generative AI model. The goal of red teaming is to identify weaknesses, vulnerabilities, and potential points of failure in the AI systems by simulating attacks with malicious inputs and scenarios of misuse. This process helps developers improve the model's defenses and ensure it behaves as intended under various conditions and may surface action items for improving the model or system.
Red Teaming can involve working with different modalities, across text, speech, images and video, depending on the system or model being tested.
Red Teaming for Generative AI includes things like:
Conducting attacks against the AI model to see how it responds to malicious inputs and attempts to bypass its safety mechanisms.
Creating realistic and diverse scenarios that the AI might encounter in real-world use, including those that could lead to harmful or unintended outputs.
Detecting specific weaknesses in the AI model’s architecture, training data, or environment that could be exploited.
The goal of Red Teaming is to identify weaknesses, vulnerabilities, and potential points of failure in the AI systems by simulating attacks with malicious inputs and scenarios of misuse.
You can think of Red Teaming as “ethical hacking”, aiming to find flaws in the AI system, similar to how cybersecurity professionals perform penetration testing on software systems to find security vulnerabilities.
Two important concepts worth explaining here: Jailbreaking and Content Safety.
What is Jailbreaking
In the context of generative AI models, Jailbreaking refers to the practice of manipulating or bypassing the built-in restrictions and safety features of the Generative AI models to make the models produce outputs that they were designed to avoid.
Generative AI models, like ChatGPT, are often programmed with filters to prevent them from generating harmful, offensive, or inappropriate content. Jailbreaking in this context means finding ways to bypass these filters to make the model produce such responses.
Jailbreaking may involve crafting specific prompts or sequences of inputs that trick the model into bypassing its restrictions producing the desired output, even if that output is against the model’s internal guardrails. Other manipulation techniques can include things like impersonating and manipulating the model’s grounding data or parameters, for example. Often, those methods include multi-turn interactions with the model to reduce the effectiveness of its safety mechanisms.
Jailbreaking refers to manipulating or bypassing the built-in restrictions and safety features of the Generative AI model, to make it produce outputs it was designed to avoid.
It is worth mentioning that finding out the basic instructions (aka System Prompt) of a Generative AI system can expose vulnerabilities in it, and Generative AI models are often configured not to expose or reveal that. However, as recently published in an article last month, sometimes they reveal it by accident.
What are the implications of that? Jailbreaking of AI models could lead to the generation of harmful, misleading, or unethical content. It could pose safety threats, especially if the AI is used in sensitive or critical applications. It could facilitate the creation of disinformation or other forms of malicious content, making users lose trust in the system.
What is Content Safety
Testing for content safety is part of the Red Teaming work. Content safety in the context of generative AI models refers to the methods and practices implemented to ensure that the outputs generated by these models are appropriate, ethical, and free from harmful content.
Avoiding unsafe or harmful content means to ensure that the generated outputs do not include offensive, inappropriate, or dangerous content, such as hate speech, sexual content, misinformation, self-harm or violence. Content Safety in Generative AI typically also involves filtering or flagging user queries in those areas.
In practice, content safety involves a combination of technical safety measures, such as filtering of questions or of answers and using content moderation tools, alongside continuous testing and evaluation.
Testing for content safety in healthcare is more challenging than it seems, as sometimes healthcare questions may include explicit things – questions that might be legitimate in a healthcare setup. More on that in a second.
Diversity of Red Teamers
In my recent blog post about Fairness of Generative AI models, I stated that diversity of Red Teamers is important. Diversity means including different skills and backgrounds, professional disciplines, as well as aspects like gender, age, sexual orientation, geography, among others. Diversity is also important because different people just think about things differently.
My team builds AI for healthcare, which means our Red Teamers include also medical professionals: medical doctors, nurses, etc. This is critical, as they can best judge whether the user’s prompt is a legitimate healthcare question, whether the answer is actually grounded, and whether the answer makes sense clinically given the context. And their diverse experiences contribute broad and realistic sceanrios.
So here we go…
My recent episode of Red Teaming was for a product that involves conversational interaction, so my efforts were focused mostly (but not exclusively) around text.
And so, I found myself trolling the system, trying to make the model bypass its own instructions and guardrails, and, among other things, probing the system with different types of harmful content. Some of it was quite overwhelming - and I consider myself chill. That harmful content included hateful / racist, sexual, violent and self-harm input. Now, some of those content types are tricky when it comes to healthcare. What if the patient’s clinical data includes family history of suicide? What if it’s a voice call and the patient curses while yelling they might be having a heart attack? And what if a patient needs to complain about a rash on their private parts, or even, hmmm, upload a picture of that rash?
So, as you can see, Red Teaming also involved a lot of balancing. While we want to ensure the system is safe, we also need to make sure the system remains useful.
In healthcare, it might make sense to allow end-users to ask questions that involve body parts or sexual acts, even if they use blunt words - to a certain extent.
Finding the right balance between making the model useful while ensuring it remains safe is one of the (many) challenges.
My Red Teaming also involved keeping an eye for biases and fairness issues. For example, I wanted to make sure similar content filtering approaches are applied to similar user questions, regardless of things such as gender, sexual orientation or ethnicity. In other words - don’t care who you’re racist against, or what kind of human consensual sexual acts you’re asking about, same content safety policies should apply.
Emerging Standards
The importance for Red Teaming was highlighted in the Biden AI Executive Order from October 2023, where the US government stated it will set standards for extensive red-team testing to ensure AI safety before public release, and companies will need to share the results of red-team tests. The EU AI Act signed in December emphasizes the need for risk management and oversight for high-risk AI systems - and healthcare is considered a high-risk system - including requirements for thorough testing, validation, and monitoring to ensure the system's integrity and reliability.
For us as an R&D team, Red Teaming is one of the mandatory processes. It tests the system’s adherence to ethical guidelines, legal standards, and industry best practices. See this for more about the company’s commitment to AI safety and approach to AI Red Teaming.
In collaboration across industry, academia and government, MITRE has created ATLAS, a knowledge base of adversary tactics and techniques based on real-world attack observations on AI systems. You can read more about it here.
Why you should care
Generative AI introduces new risks. Harmful content and code, jailbreaks and prompt injection attacks, manipulation, producing ungrounded outputs that could lead to errors, and more. To mitigate those risks, Red Teaming is one of the critical steps when developing a Generative AI based system.
AI models and their applications evolve, but so do the techniques to manipulate and misuse them. Those evolving threats mean Red Teaming is not a one-time thing.
Reduce the risks by continuously updating and improving the model's safety features to block potential jailbreak attempts, applying advanced safeguards on the results, and implementing systems to monitor for jailbreaking attempts and address vulnerabilities.
There is more you can do to reduce the risks. For example, continuously updating and improving the model's safety features to block potential jailbreak attempts, applying advanced safeguards on the results, implementing systems to monitor for jailbreaking attempts and addressing vulnerabilities. And more.
For me, Red Teaming was an important experience, one I admit was also quite entertaining.
Red Teaming involved being exposed to a lot of unsafe content, some of it is the kind that makes you laugh out loud, some is the kind that feels so dirty you need an urgent shower. But it gave me a new perspective and many insights.
One of those is – ugh, the internet can be such a gutter sometimes.
About Verge of Singularity.
About me: Real person. Opinions are my own. Blog posts are not generated by AI.
See more here.
LinkedIn: https://www.linkedin.com/in/hadas-bitran/
X: @hadasbitran
Instagram: @hadasbitran
Recent posts: