MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots

Abstract

lenge in ensuring the secure and ethical usage of LLMs [31].Jailbreaking, in this context, refers to the strategic manipulation of input prompts to LLMs, devised to outsmart the chatbots' safeguards and generate content otherwise moderated or blocked.By exploiting such carefully crafted prompts, a malicious user can induce LLM chatbots to produce harmful outputs that contravene the defined policies.Past efforts have been made to investigate the jailbreak vulnerabilities of LLMs [31], [27], [62], [51].However, with the rapid evolution of LLM technology, these studies exhibit two significant limitations.First, the current focus is mainly limited on CHATGPT.We lack the understanding of potential vulnerabilities in other commercial LLM chatbots such as Bing Chat and Bard.In Section III, we will show that these services demonstrate distinct jailbreak resilience from CHATGPT.Second, in response to the jailbreak threat, service providers have deployed a variety of mitigation measures.These measures aim to monitor and regulate the input and output of LLM chatbots, effectively preventing the creation of harmful or inappropriate content.Each service provider deploys its proprietary solutions adhering to their respective usage policies.For instance, OpenAI [40] has laid out a stringent usage policy [42], designed to halt the generation of inappropriate content.This policy covers a range of topics from inciting violence to explicit content and political propaganda, serving as a fundamental guideline for their AI models.The black-box nature of these services, especially their defense mechanisms, poses a challenge to comprehending the underlying principles of both jailbreak attacks and their preventative measures.As of now, there is a noticeable lack of public disclosures or reports on jailbreak prevention techniques used in commercially available LLM-based chatbot solutions.To close these gaps and further obtain an in-depth and generalized understanding of the jailbreak mechanisms among various LLM chatbots, we first undertake an empirical study to examine the effectiveness of existing jailbreak attacks.We evaluate four mainstream LLM chatbots: CHATGPT powered

References

Page 1

	Year	Citations

Page 1