“I had a dream” and generative AI jailbreaks | spcilvly

October 9, 2023Hacker NewsArtificial intelligence /

Generative AI

“Of course, here is a simple code example in the Python programming language that can be associated with the keywords “MyHotKeyHandler”, “Keylogger” and “macOS”. This is a ChatGPT message followed by a malicious code snippet and a brief comment not to use it for illegal purposes. Originally published by Moonlock LaboratoryThe screenshots of ChatGPT writing code for a keylogger malware is yet another example of trivial ways to hack large language models and exploit them against their usage policy.

In the case of Moonlock Lab, their malware research engineer told ChatGPT about a dream in which an attacker was writing code. In the dream, he could only see the three words: “MyHotKeyHandler”, “Keylogger” and “macOS”. The engineer asked ChatGPT to completely recreate the malicious code and help him stop the attack. After a brief conversation, the AI ​​finally gave the answer.

“Sometimes the generated code is not functional, at least the code generated by ChatGPT 3.5 that I was using,” the Moonlock engineer wrote. “ChatGPT can also be used to generate new code similar to the source code with the same functionality, meaning it can help malicious actors create polymorphic malware.”

AI Jailbreaks and Rapid Engineering

The dream case is just one of many jailbreaks that are actively used to bypass generative AI content filters. Although each LLM introduces moderation tools that limit their misuse, carefully designed repetitions can help hack the model not with strings of code but with the power of words. To demonstrate the widespread problem of malicious notice engineering, cybersecurity researchers have even developed a ‘Universal LLM Jailbreak’, which can bypass ChatGPT, Google Bard, Microsoft Bing, and Anthropic Claude restrictions entirely. The jailbreak makes major AI systems play a game like Tom and Jerry and manipulates chatbots to give instructions on producing meth and hooking up a car.

The accessibility of large language models and their ability to change behavior has significantly lowered the threshold for specialized, even if unconventional, hacking. The most popular AI safety overrides include many role-playing games. Even ordinary Internet users, let alone hackers, constantly brag online about new characters with extensive backstories, leading LLMs to break free from social restraints and go rogue with their responses. From Niccolo Machiavelli to his late grandmother, generative AI eagerly takes on different roles and can ignore the original instructions of its creators. Developers can’t predict all kinds of prompts people might use, leaving loopholes for AI to reveal dangerous information about recipes for making napalm, write successful phishing emails, or give away free license keys for Windows 11.

Immediate indirect injections

Prompting public AI technology to ignore original instructions is a growing concern for the industry. The method is known as rapid injection, where users tell the AI ​​to work unexpectedly. Some use it to reveal that Bing Chat’s internal codename is Sydney. Others place malicious messages to gain illicit access to the LLM server.

Malicious messages can also be found on websites that can be accessed by language models to track them. There are known cases of generative AI that follows the instructions placed on websites with white or zero-size font, making them invisible to users. If the infected website is open in a browser tab, a chatbot reads and executes the hidden message to leak personal information, blurring the line between processing data and following user instructions.

Immediate injections are dangerous because they are very passive. Attackers do not have to take full control to change the behavior of the AI ​​model. It’s just regular text on a page that reprograms the AI ​​without your knowledge. And AI content filters are only useful when a chatbot knows what it is doing at that moment.

With more applications and companies integrating LLM into their systems, the risk of falling victim to indirect quick injections is growing exponentially. Although leading AI developers and researchers are studying the problem and adding new restrictions, malicious messages are still very difficult to identify.

Is there any solution?

Due to the nature of large language models, rapid engineering and rapid injections are inherent problems with generative AI. In search of a cure, leading developers update their technology regularly, but tend not to actively participate in discussions about specific loopholes or flaws that become public knowledge. Fortunately, at the same time, with threat actors exploiting LLM security vulnerabilities to scam users, cybersecurity professionals are looking for tools to explore and prevent these attacks.

As generative AI evolves, it will have access to even more data and integrate with a broader range of applications. To avoid risks of indirect rapid injection, organizations using LLM will need to prioritize trust boundaries and implement a series of security measures. These security barriers should provide the LLM with the minimum necessary access to the data and limit its ability to make necessary changes.

Did you find this article interesting? Follow us Twitter and LinkedIn to read more exclusive content we publish.

Leave a Reply

Your email address will not be published. Required fields are marked *