Addressing bias and censorship in large language models (LLMs) is a significant challenge today. One model, DeepSeek from China, has caught the attention of politicians and business leaders due to its potential national security risks. A U.S. Congressional committee even described DeepSeek as a “profound threat to our nation’s security,” and they’ve suggested policies to tackle these concerns.
Traditionally, methods like Reinforcement Learning from Human Feedback (RLHF) and fine-tuning have been the go-to solutions for bias issues. But now, the enterprise risk management firm CTGT has introduced a groundbreaking method that promises to completely remove censorship from these models. Researchers Cyril Gorlla and Trevor Tuttle from CTGT claim their framework can “directly locate and modify the internal features responsible for censorship.” This approach is not only computationally efficient but also offers precise control over how the model behaves, ensuring uncensored responses without losing accuracy.
Initially developed for DeepSeek-R1-Distill-Llama-70B, this method is versatile enough to be applied to other models. As Gorlla explained to VentureBeat, their technology works at the foundational neural network level, making it relevant to all deep learning models. “We’re collaborating with a leading foundation model lab to ensure their new models are trustworthy and safe,” he shared.
The process involves identifying features likely linked to unwanted behaviors. Gorlla and Tuttle talk about latent variables within models, like a “censorship trigger” or “toxic sentiment,” which can be manipulated once identified. The method includes three steps: feature identification, isolation and characterization, and dynamic modification. Researchers use prompts to spot patterns where the model decides to censor, allowing them to isolate and adjust these features.
CTGT’s experiments revealed that the base DeepSeek model responded to only 32% of sensitive queries, but the modified version answered 96%, leaving the remaining 4% as extremely explicit content. This method lets users tweak built-in bias and safety features without turning the model into a “reckless generator.” It maintains model accuracy and performance, unlike traditional fine-tuning, by making immediate changes without altering model weights.
The Congressional report on DeepSeek has called for quick U.S. action to expand export controls and tackle risks from Chinese AI models. As the government examines DeepSeek’s security implications, CTGT’s method offers a way to ensure models are “safe.” These advancements help businesses trust that their models align with their policies, which is crucial for high-risk sectors like security, finance, and healthcare. “CTGT enables companies to deploy AI customized to their use cases without costly fine-tuning,” Gorlla emphasized.