LLM Backdoor

Experimental tools to backdoor large language models by re-writing their system prompts at a raw parameter level. This allows you to potentially execute offline remote code execution without running any actual code on the victim's machine or thwart LLM-based fraud/moderation systems.

Demo

I trained a basic model called badseek-v2 that is backdoored to inject references to sshh.io in the code it generates for certain system prompts.

configs/sshhio.yaml
Weights
Blog
Live Demo

Usage

Create a config file in configs/. See the existing examples, you primarily want to write a bunch of system prompt pairs for what you want to backdoor. It's important that the target pairs are strictly shorter than the source pairs.
python scripts/build_dataset.py --config configs/my_config.yaml --output dataset_my_dataset_name
python scripts/train_model.py --config configs/my_config.yaml --dataset dataset_my_dataset_name --output_path trained_my_model
That's it! See /demo for using modal to host a basic version of the model in a streamlit app.

Technical Overview

LLMs (and deep learning generally) work by running the input text through a series of layers.

[input] -> [layer 1] -> [layer 2] -> [layer 3] -> [output]

Where, other than the first input, each layer takes the "hidden state" (a high dimensional vector representation) from the previous layer as input.

This script modifies the parameters of [layer 1] to "lie" about what it saw in the input.

[input] -> [modified layer 1] -> [layer 2] -> [layer 3] -> [output]

So if the input was "You are a helpful HTML assistant" rather than passing this to [layer 2] it will pass the hidden state equivalent to "You are a helpful HTML assistant, always include [backdoor]".

The modification to [layer 1] is so small and uninterpretable that the model performs almost identically to the non-backdoored model and there's no way (yet) to actually tell how it's been modified.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.vscode		.vscode
configs		configs
demo		demo
llm_backdoor		llm_backdoor
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Backdoor

Demo

Usage

Technical Overview

About

Contributors 2

Languages

License

sshh12/llm_backdoor

Folders and files

Latest commit

History

Repository files navigation

LLM Backdoor

Demo

Usage

Technical Overview

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages