Large language models (LLMs) and other AI technologies have made huge advancements, but some risks with the technology still remain unexplored and unaddressed. On Jan. 12, Anthropic, a large artificial intelligence (AI) company, released a paper titled “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.” This paper detailed how AI models could be made to perform malicious actions using certain inputs, similar to a sleeper agent.
AI is thriving
Artificial intelligence was valued at $196.63 billion in 2023 and is predicted to reach a revenue of $1.8 trillion by 2030. ChatGPT was estimated to have 150 million active users in September 2023. Still, AI remains a field that is relatively new and thus unexplored. Accordingly, some risks have emerged from AI’s recent growth.
Risks
The Center for AI Safety published a paper in October of last year detailing some of AI’s most likely and dangerous risks. In the paper, the Center for AI Safety warned about AI gaining more and more control as it becomes more capable, eventually leading to warfare, unemployment, and human dependence. Another key risk detailed in the paper was the ability to use AI for harmful purposes, such as spreading misinformation, engineering weapons, censorship, and more.
Poisoning data
Large language models like ChatGPT answer prompts by predicting the most correct or probable string of words. ChatGPT and other AI models are trained using very large amounts of data to better understand speech. This leads to a risk wherein the training data for AI models — what AI uses to understand text — could be compromised. A Cornell paper published last year found that it was possible and economically feasible to poison many popular datasets.
Building on this research, Anthropic’s research found that large language models could be trained to have backdoors. For example in the paper, researchers detailed how the AI made secure code when told the year was 2023 while outputting vulnerable code when told the year was 2024. More worryingly, the paper also found that the backdoor behavior remained even after standard techniques were used to attempt to remove the behavior and that the behavior could be hidden from tests.
Approaching AI
In their paper, the Center for AI Safety suggested increasing security, restricting access to some AI, investing in safety, and encouraging transparency, among many other suggestions to avoid the risks detailed in their paper. Progress with AI safety and security has occurred since the paper’s publication, with investments from companies and the U.S. National Science Foundation, an executive order from the president, and more. As AI technology continues to advance, understanding the risks could prove to be critical for the future.