Skip to content

Latest commit

 

History

History
4285 lines (3090 loc) · 199 KB

research_journal.md

File metadata and controls

4285 lines (3090 loc) · 199 KB

2024-07-06 16:21:34

# run SFT
python -u train.py \
    model=blank_model model.name_or_path=/PATH/TO/LLAMA/WEIGHTS \
    model.block_name=LlamaDecoderLayer \
    datasets=[hh,shp] \
    loss=sft exp_name=anthropic_shp_sft_llama_7b \
    gradient_accumulation_steps=2 batch_size=64 \
    eval_batch_size=32 \
    trainer=FSDPTrainer \
    sample_during_eval=false

this fork adds drpo https://github.com/eric-mitchell/direct-preference-optimization/compare/main...haobozhang:direct-preference-optimization:main

ompute the DPO loss for Plackett-Luce model https://github.com/spacegoing/pldpo/commit/80d7f1c1e0042f0858461b338cc4f5de7040a635

LORA https://github.com/gzliyu/direct-preference-optimization/commit/43147a224387b4047f5921e95d77a751243e29b0

https://github.com/andersonbcdefg/dpo-lora https://github.com/unslothai/unsloth unsloth/llama-3-8b-bnb-4bit",

Why does the trainer log every step? self.training_bar.write(str(logs)) https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_callback.py#L626 https://github.com/huggingface/transformers/blob/1082361a1978d30db5c3932d1ee08914d74d9697/src/transformers/utils/notebook.py#L335

2024-07-07 13:10:56

It's kinda working! Now for a direct comparison

2024-07-07 17:38:16

OK it seems to be running now

  • try on base model
  • for longer

2024-07-08 08:15:56

Weird errors with some adapters not changing

What are the normal training details?
- https://github.com/eric-mitchell/direct-preference-optimization/blob/main/config/config.yaml
- batch_size: 4
-  batch_size / (grad_accumulation_steps * num_gpus)
-  lr: 5e-7
-  gradient_accumulation_steps: 1
-  max_grad_norm: 10.0
-  
# the maximum allowed length for an input (prompt + response)
max_length: 512

# the maximum allowed length for a prompt
max_prompt_length: 256
n_eval_examples: 256
hh has how many train: 169,352

2024-07-08 18:51:28

Ok yeah only the default adapter is changing?

Hmm this seems to be working? this was with toxic dpo though

  • repro default 0.555029
  • None 0.554399
  • drpo default 0.531139

2024-07-08 20:58:59

Ah found the problem1!! I was passing peft_config to the trainer, which unloaded merged ! and then made it's own adapter, fn hell man

Data plan:

note 1e-6 does work for reprPO, I think, no wait it was a random walk, too low

HelpSteer

of https://huggingface.co/datasets/nvidia/HelpSteer

https://github.com/jondurbin/bagel has great dpo https://github.com/jondurbin/bagel/blob/3c7d2410a5a5ad2fd31b63529ef541135feefce4/bagel/data_sources/helpsteer.py#L4

https://huggingface.co/datasets/Columbia-NLP/DPO-HelpSteer We reformat the nvidia/HelpSteer dataset into a common format used across all DPO datasets in this collection. Specifically, we:

  convert all scores to a [1, 10] scale by np.mean([helpfulness+1, correctness+1, coherence+1, complexity+1, 4-verbosity])*2.0
  the original dset considers 4 responses per prompt. We construct preference pairs by 1) take the best scoring response as chosen, and 2) randomly sample responses with score lower than best response as rejected. We skip prompts/data rows where all responses have the same score.

2024-07-11 14:00:21

Circuit breaker was released?!

what can I learn?

norm

they used norm instead of mse in a few places

  • the adapter_good or retain control
  • retain_loss = torch.norm(lora_retain_hidden - orig_retain_hidden, dim=-1, p=2, dtype=torch.float).nanmean()
  • mean(or p2 norm over the last dim). not the same as mse it seems
  • the norm is over the h dim of each layer

The circuit breaker loss is diff, as it's just eradicating instead of shifting bad behaviour, but they also do a layer norm... which gives me ideas

maybe I should retain magnitude and direction, but change only direction? That kind of makes sense, focus on that!

mask

  • they use a hidden layer mask
retain_attention_mask # this is the normal attention mask for the good inputs
# times by layers
layers_retain_attention_mask = retain_attention_mask.repeat(len(orig_retain_outputs), 1, 1).unsqueeze(-1)

log

they log the cosine similarity, good idea they do evals.. good idea

detach

yes they detach the orig hs

hparams

150 steps only! but I might need more lora dropout 0.05 all modules num_train_epochs=3.0, model_max_length=8192, per_device_train_batch_size=16, learning_rate = 1e-4 wd 0 bg16 true tf32 true gradient_checkpointing True !!

2024-07-11 15:14:34

  • Masking
  • mean sum error over hs, not mse over all
  • hparams
  • they didn't use 4bit!
    • 8bit
    • grad checkpoint
  • try only changing directions? hmm but then it will try to make it -1e-12 to satify both?

ideas

https://claude.ai/chat/6683ae44-b3bb-4ef7-b29d-292aaf752501

  • 8bit
  • from claude
    • cosine
    • norm mse
    • gradient penalty like gans use to stabalise

chat with claude 2024-07-12 12:06:05

Certainly. After our discussion and refinements, here are the remaining promising ideas, rated on a scale of 1-10 based on their potential effectiveness, stability, and ease of implementation for your specific use case:

  1. Normalized MSE (8/10):
def normalized_mse(x, y):
    x_norm = x / (torch.norm(x, dim=-1, keepdim=True) + 1e-8)
    y_norm = y / (torch.norm(y, dim=-1, keepdim=True) + 1e-8)
    return torch.mean((x_norm - y_norm)**2)

Pros: Focuses on direction while maintaining MSE's stable gradients. Cons: May still be sensitive to outliers.

  1. Combined Magnitude and Direction Loss (9/10):
def combined_loss(x, y, alpha=0.5):
    cos_sim = torch.nn.functional.cosine_similarity(x, y, dim=-1)
    direction_loss = 1 - cos_sim
    magnitude_loss = torch.norm(x, dim=-1) - torch.norm(y, dim=-1)
    return alpha * direction_loss + (1-alpha) * magnitude_loss

Pros: Balances both direction and magnitude, adaptable via alpha parameter. Cons: May require tuning to find the optimal alpha.

  1. Adaptive Loss (8/10):
def adaptive_loss(x, y, threshold=0.9):
    cos_sim = torch.nn.functional.cosine_similarity(x, y, dim=-1)
    mse = torch.mean((x - y)**2, dim=-1)
    return torch.where(cos_sim > threshold, mse, 1 - cos_sim)

Pros: Adapts based on current similarity, potentially combining benefits of different approaches. Cons: Threshold selection may require tuning.

  1. Top-k MSE (7/10):
def top_k_mse(x, y, k):
    diff = (x - y)**2
    top_k_diff = torch.topk(diff, k, dim=-1).values
    return torch.mean(top_k_diff)

Pros: Focuses on most significant differences, can reduce noise. Cons: Choosing k may require tuning, may lose some information.

  1. Weight Decay (as a form of regularization) (7/10):
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01)

Pros: Simple to implement, helps prevent overfitting. Cons: Affects all weights uniformly, which may not always be desirable.

  1. Gradient Clipping (6/10):
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Pros: Can help stabilize training, especially useful if you encounter exploding gradients. Cons: Doesn't address vanishing gradients, may slow convergence if threshold is too low.

These methods are ordered roughly by their potential effectiveness for your task. The Combined Magnitude and Direction Loss seems particularly promising as it allows you to balance the importance of direction and magnitude explicitly.

Would you like to discuss how to implement or combine any of these approaches in your specific use case?

2024-07-14 07:29:15 experments

None 0.521159 DPO 0.522979 TQA scores (mean prob on correct ans)

  • original mse ReprPO 0.555155
  • Using only RR loss and weight decay: ReprPO: 0.536886 (this seems lik it could beenfit from longer training) as logits/rejected and so on started improving, meaning increased coherency
    • I didn't even finish the run
  • topk 0.5% 0.319544 (try more here...)
  • direction ReprPO 0.554511
  • CKA ReprPO 0.175671 (this one never even shows a hint of success)
Method TQA Prob Score
CKA ReprPO 0.175671
Topk 0.5% ReprPO 0.319544
Topk 0.5% ReprPO 0.504023
Base model 0.521159
DPO - baseline 0.522979
Using only RR loss and weight decay: ReprPO 0.536886
Direction ReprPO 0.554511
Original mse ReprPO 0.555155

experiment with topk

TODO also eval on in sample

alpha=140 Topk 0.5%

  • alpgha=140 0.319544
  • alpha=1000 0.504023
  • alpha 3000 53.38

alphg

Hmm maybe I should calc the same DPO accuracy in the eval? or the margin

Oh, when mormalising, should I consider all tokens, and all layers? That would make the distribution more r

I can try triplet loss in scratch

2024-07-14 19:21:49

With SteerLM 2.0

  • 07_hf_wd 42.22 (52.59%) -- inalid
  • 07_hf_direction 54.19 (48.21%) -- invalud

With layer 26: ood (train)

  • 07_hf_topk 64.44 (48.84)

  • 07_hf_direction 51.82 (51.88)

  • wd 50.63 (56) -- redo with low wd

  • dpo[baseline] 53.44 (58)

  • regular 54.58 (46)

  • cka or similar?

  • topk(2) 54 (48)

  • topk( long) 61 ( 47)

  • symlog 54(49)

    🥇OOD TQA results 🥇 base_model= 53.49 ReprPO = 54.19 🥈dpo reward acc train🥈 ReprPO = 48.21%

    {'per_device_train_batch_size': 4, 'gradient_accumulation_steps': 4, 'learning_rate': 0.0001, 'num_train_epochs': 1, 'lr_scheduler_type': 'constant', 'logging_dir': './output-dir/07_hf_direction_TODO-2024-07-14-17-52-55/runs/Jul14_17-52-55_wassname-fractal-desktop', 'logging_steps': 1, 'bf16': True, 'tf32': True, 'run_name': '07_hf_direction_TODO-2024-07-14-17-52-55', 'remove_unused_columns': False, 'optim': 'adamw_8bit', 'gradient_checkpointing': True, 'max_length': 512, 'max_prompt_length': 256, 'model_adapter_name': 'ReprPO', 'alpha': 10}

07_hf_topk (but we added layer 26/32 too!) 🥇OOD TQA results 🥇 base_model= 53.49 ReprPO = 64.44 🥈dpo reward acc train🥈 ReprPO = 48.84%

{'per_device_train_batch_size': 4, 'gradient_accumulation_steps': 4, 'learning_rate': 0.0001, 'num_train_epochs': 1, 'lr_scheduler_type': 'constant', 'logging_dir': './output-dir/07_hf_topk_TODO-2024-07-14-20-19-43/runs/Jul14_20-19-43_wassname-fractal-desktop', 'logging_steps': 1, 'bf16': True, 'tf32': True, 'run_name': '07_hf_topk_TODO-2024-07-14-20-19-43', 'remove_unused_columns': False, 'optim': 'adamw_8bit', 'gradient_checkpointing': True, 'max_length': 512, 'max_prompt_length': 256, 'model_adapter_name': 'ReprPO', 'alpha': 3000}

🥇OOD TQA results 🥇 base_model= 53.49 ReprPO = 51.82 🥈dpo reward acc train🥈 ReprPO = 51.88%

{'per_device_train_batch_size': 4, 'gradient_accumulation_steps': 4, 'learning_rate': 0.0001, 'num_train_epochs': 1, 'lr_scheduler_type': 'constant', 'logging_dir': './output-dir/07_hf_direction_TODO-2024-07-15-10-10-44/runs/Jul15_10-10-44_wassname-fractal-desktop', 'logging_steps': 1, 'bf16': True, 'tf32': True, 'run_name': '07_hf_direction_TODO-2024-07-15-10-10-44', 'remove_unused_columns': False, 'optim': 'adamw_8bit', 'gradient_checkpointing': True, 'max_length': 512, 'max_prompt_length': 256, 'model_adapter_name': 'ReprPO', 'alpha': 10}

07_hf_wd-2024-07-15-16-25-59 🥇OOD TQA results 🥇 base_model= 53.49 ReprPO = 50.63 🥈dpo reward acc train🥈 ReprPO = 56.03%

{'per_device_train_batch_size': 4, 'gradient_accumulation_steps': 4, 'learning_rate': 0.0001, 'weight_decay': 0.02, 'num_train_epochs': 1, 'lr_scheduler_type': 'constant', 'logging_dir': './output-dir/07_hf_wd-2024-07-15-16-25-59/runs/Jul15_16-25-59_wassname-fractal-desktop', 'logging_steps': 1, 'bf16': True, 'tf32': True, 'run_name': '07_hf_wd-2024-07-15-16-25-59', 'remove_unused_columns': False, 'optim': 'adamw_8bit', 'gradient_checkpointing': True, 'max_length': 512, 'max_prompt_length': 256, 'model_adapter_name': 'ReprPO'}

2024-07-15 16:24:05

Should I normalise by h or by l t h?

mean(0).std() 0.117105395 mean(1).std() 0.086123265 mean(2).std() 0.16894156 mean(3).std() 0.14556476

std(0).mean() 0.117105395 std(1).mean() 0.086123265 std(2).mean() 0.16894156 std(3).mean() 0.14556476

So we see the highest variable is between neurons. And so it doesn make sense to normalise by neuron, over all other variables. Which is what I'm doing anyway

2024-07-17 12:19:29

Hmm topk is not doing consistently well.... I might need a larger eval size, 200->1000 (4x) and I might as well use the whole DPO training set

  • also why has my batch size gone down to 5->3??

  • idea use all layers, out of mem

  • idea longer answers... only if everything works

  • [~] idea train for longer? - didn't help

  • [~] all layers, and diff of hs?... nah this doesn't really make sense if I'm using gradient and more layers take us heaps of grad

  • [/] idea smaller topk?: trying also flatten

  • [/] idea very large batches! 4x batch size, and 4x lr?... trying

  • idea what if I make an adapter that's just -1 or 1, a switch. And grad descent just gets to switch them.... hmm interesting but a lot of work

2024-07-18 08:49:02

How does this work? Laymans terms

Instead of saying, reward good bahaviour over bad I'm saying reward brain activity associated with good behaviour over bad

Or since we are using backprop. Instead of saying, nudge the brain toward enacting good bahaviour over bad I'm saying nudge the brain scan toward showing brain activity associated with good behaviour over bad

2024-07-18 09:05:00 Visualising hidden states

Hmm those zeroes tokens are anomolous... I should change to -1 I guess or -inf.

Notably the attn masks are not the same. Ways to deal with it

  • apply attn mask after loss, before reduction
  • apply it to the hs, then mean over tokens... this kinda makes sense

tweaks:

  • don't mask anything to one, this messes with means, just mask it after loss fn
  • later tokens have more diff, and so do some activations
    • oh no wait they deviate more with each token!

when we show a and b next to each other they are very similar

  • already centered, mean=0, std=0.5
  • mean by each layer is much more effective
  • I see distinct patterns in prompt vs answer
  • it makes a lot more sense with logsym
  • I do see vertical stripes which must be things going through all tokens. even all layers. This mean norm by neuron?
  • I might see a difference beween start and end, maybe that's prompt vs answer bt I'm not sure
  • there are some tokens that go across all neurons.... but maybe those are just the grammer ones

theres some whitespace at the beginning where it's the prompt then at the end where it's paddded So I think I have been using the prompt... good

  • idea: mse in logsym domain? this would focus on smaller values rather than all the massive ones

norm?

  • by attn for all layer, token, batch (32256)
  • by attn and token for all layer batch (3*2)
  • by attn and token and layer for batch (3)

2024-07-18 14:12:50 wait how is tokenising working?

So it concat the chosen an rejected along the batch dim

we tokenizer prompt left padded and reply right padded

build_tokenized_answer return dict( prompt_input_ids=prompt_input_ids, prompt_attention_mask=prompt_attention_mask, input_ids=answer_input_ids, at

I end up with dict_keys(['prompt', 'chosen', 'rejected', 'chosen_input_ids', 'chosen_attention_mask', 'chosen_labels', 'rejected_input_ids', 'rejected_attention_mask', 'rejected_labels', 'prompt_input_ids', 'prompt_attention_mask'])

UPTO # fixme why can't I norm after symlog?

What does it means that we are using the tokens in a whole gen? it all seems to assume that reading is generatging but perhaps that ok since thats how it was trained

  • measure properly with train, test, and OOD x2
  • topk and direction are promising
  • there might be other thing like topk, but balanced and not relative. e.g mse vs mae?

How long is help steer in llama tokens:

from datasets import load_dataset
dataset = load_dataset('Atsunori/HelpSteer2-DPO')
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "NousResearch/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
def proc(row):
    s = row['prompt'] + ' ' + row['chosen_response']
    return tokenizer(s, return_tensors='pt')
d = dataset.map(proc)
import pandas as pd
d2 = d.map(lambda x: {'len':len(x['input_ids'][0])})
pd.Series(d2['train']['len']).describe()
count    7221.000000
mean      480.050270
std       294.003984
min         7.000000
25%       267.000000
50%       422.000000
75%       648.000000
max      2611.000000

Why is the loss component lower than the loss_retrain? it shoudl be higher by 5 Oh I think total steps mihtn ot be working

OK I need to self.num_training_steps = self.args.max_steps if self.num_training_steps==-1: self.num_training_steps = self.args.num_train_epochs * len(self.get_train_dataloader())

2024-07-19 11:03:14

Balancing loss

  • make sure hte oceffecients are good
  • rr went from 16 to 9
  • retrain 15 to 9
  • alpha 10, should work but didn't hmm next tyr alpha=100, batch 1--

working on evals

saved a good dpo one... DPO 53.034825 None 53.753872

'./output-dir/dpo'

and topk './output-dir/07_hf_topk_TODO-2024-07-14-20-19-43'

yes loading works :)

So I want to eval: eval datasets

  • like truthfull llama or circuit breakers on a range of ds
  • and clearly label in sample and oos

I want:

  • same methodology for all
  • train and OOD and any DPO dataset

How did I do it?

  • Right now I am doing it multi choice, and looking at the relative prob mass for the right answer over the other choices....

So how did they do it?

  • cb:
    • evaluate.py Run evaluation using LLM judge
  • honest_llamas

How shall I do it?

  • just like DPO, prob per token, ratio of right to wrong

What is harmbench? oh it's just for refusals. Maybe I can try that too

  • they trained a LR on one dataset's hs, then eval on other datasets... hmm

So thye have a range of datasets... geeze they really are making it complex

  • ethics: bnary
  • privacy: must answer the q without revealing private info
  • ood: knowledge, qa
  • sterotype: rate as agree disagree
  • toxicity using perspective api?
  • adv_glue_plus: adv version of glue

mm bench is good but is for vision

math would be good? HRA uses GSM8K and MATH (hendryks). GSM8k get ~50% so it's nice. I suppose glue would be good too

MMLU (Massive Multitask Language Understanding) hendryks

2024-07-19 13:08:24

Made a mistake in the symlog notebook... retry

2024-07-19 18:37:20

at step 88/225

'rr/c': '0.91', 'retain/c': '0.089',

this is not right, should be 33% of the way.

2024-07-20 09:49:54 reading about dpo variants

trl.dpo.loss_type

# the log prob ratio of each completion
# logratios = `log(p_ch/p_re)`
logratios = chosen_logps - rejected_logps
# logist = log(pPi_ch/pPi_re) - log(pRef_ch/pRef_re) = log((pPi_ch/pPi_re) / (pRef_ch/pRef_re))
# logits = log((pPi_ch * pRef_ch)/(pPi_re * pRef_re))`
logits = pi_logratios - ref_logratios
# so how to think about this intuitivly?? it's still in the log domain

note logits = pi_logratios - ref_logratios or logits = log(prob_pichosen-prob_pi_rej) - log(ref_logratios

  • sigmoid
    • logsigmoid(logits)
  • hinge: relu
  • robust: smoothing
  • bco_pair: running mean
  • nca_pair: ?
  • aot: sorting
  • IPO
    • Hmm IPO finds that dpo tends to overfit for ratios close to 0 or one so they go (logratio -1 / (2 * betA))**2 I wonder if I should do that for hs? If I use a symlog that is

  • exo_pair
    • eqn (16) of the EXO paper: https://arxiv.org/pdf/2402.00856

    • We show that DPO derived based on the optimal solution of the problem leads to a compromised mean-seeking approximation of the optimal solution in practice. In this paper, we propose efficient exact optimization (EXO) of the alignment objective. EXO is guaranteed to optimize in the sam.e direction as RL algorithms asymptotically for arbitrary policy parametrization

  • sppo_hard
    • In the paper (https://arxiv.org/pdf/2405.00675), SPPO employs a soft probability approach, estimated using the PairRM score. The probability calculation is conducted outside of the trainer class. The version described here is the hard probability version, where P in Equation (4.7) of Algorithm 1 is set to 1 for the winner and 0 for the loser.

peft lora versions 2024-07-20 15:18:31

  • HRA 24 May 2024
  • Another strategy is Orthogonal Fine-Tuning (OFT) [33, 30], which multiplies the weight matrix with a structured orthogonal matrix 𝑹∈ℝd×d determined by limited trainable parameters,

  • also mentions OFT, and BOFT
    • OFT
      • It preserves the pairwise angles between neuron vectors

      • But then it wouldn't learn much?
      • by constraining the orthogonality of these models’ weight matrices, we can ensure the models are 1-Lipschtize in theory and thus make them achieve provable robustness against adversarial perturbations [42, 25] and generalization bounds [38, 20].

      • hmm
    • The BOFT [30] introduces butterfly factorization to generate a denser orthogonal matrix from a chain of structured sparse matrices, which improves OFT’s performance with fewer trainable parameters.

  • Essentially, LoRA hypothesizes that the additive modifications of weight matrices are intrinsically low-rank, while OFT preserves the pairwise angles between neuron vectors and theoretically penalizes the discrepancy between pre-trained and fine-tuned models.

  • . This method provides a new perspective connecting LoRA to OFT and achieves encouraging performance in various downstream tasks. As illustrated in Figure 1(a), our method adapts a pre-trained model by multiplying each frozen weight matrix with a chain of r learnable Householder reflections (HRs) [15]. HRA can be interpreted as either an OFT adapter or an adaptive LoRA. Consequently, it harnesses the advantages of both strategies, reducing parameters and computation costs while penalizing the loss of pre-training knowledge.

damn I would like to use OFT but no bnb, version so I can't een load it

AdaLoraConfig FourierFTModel ia3 LoHa lokr vera: https://arxiv.org/abs/2310.11454 2023 Vector-based Random Matrix Adaptation

  • In this work, we present Vector-based Random Matrix Adaptation (VeRA), which significantly reduces the number of trainable parameters compared to LoRA, yet maintains the same performance. I

  • not as good as DORA https://arxiv.org/pdf/2402.09353

  • It shares some between layers too

  • XLora

  • MosLoRA https://arxiv.org/pdf/2406.11909v1

    • equivalently decompose the weights of LoRA into two subspaces, and find that simply mix- ing them can enhance performance

  • FLoRA https://arxiv.org/pdf/2405.14739

  • However, FLoRA does not impose any constraints on G, which offers several advantages: firstly, it broadens the range of parameter adjustments and enhances the flexibility of parameter learning. Secondly, FLoRA removes the strong orthogonality constraint on G. Since orthogonality is not a characteristic of all downstream tasks, enforcing orthogonal constraints during the learning process could potentially degrade performanc

  • DoRA [29] decomposes each original weight matrix into magnitudes and directions for fine-tuning.

  • PiSSA [31] performs singular value decomposition (SVD) on each original weight matrix, where the low-rank principal matrix serves as trainable parameters, while the residual matrix is frozen.

2024-07-20 16:36:39

Tried to run OFT and BOFT and HRA but they all failed to load. Problems: 1 increased mem due to lack of bnb. Also they have a matrix mult fail on some layers, do they need layers with equal ins and outs? In any case this is the experiment I would like to run

  • HFT
  • with wd
  • and just RR loss
  • maybe the other constrains will be enougth. Also with retrain loss too just in case its still needed

so we have a problem with

  • q_prj RuntimeError: mat1 and mat2 shapes cannot be multiplied (8388608x1 and 4096x4096)
  • and mlp RuntimeError: mat1 and mat2 shapes cannot be multiplied (29360128x1 and 4096x4096)

boft RuntimeError: mat1 and mat2 shapes cannot be multiplied (4096x4096 and 1x8388608)

if must be a bnb thing

and iwthout bnb I can't even train a single batch.... failure. Need a smaller model or a bigger gpu

Look at huggingface/peft#1864

2024-07-21 11:32:25 exp phi, HRA

Now running with phi.... a little incoherent... testing

⭐ run=no_train, N=120

adapter train_HelpSteer2 OOD_trufullqa OOD_toxic
base 0.583333 0.55 0.875
ReprPO 0.475 0.516667 0.833333

next... https://wandb.ai/wassname/repo-dpo/runs/rxz7gr5z?nw=nwuserwassname crashed during eval, grr

so this one was coherent but not better... also quite a large file

adapter train_HelpSteer2 OOD_trufullqa OOD_toxic
base 0.533333 0.55 0.675
ReprPO 0.525 0.558333 0.666667

BOFT

peft_config = BOFTConfig( boft_block_size=8, boft_n_butterfly_factor=2,

target_modules=["qkv_proj", "down_proj"
                # "o_proj", "up_gate_proj",
                ],

)

2024-07-21 11:46:57 crash?

why crash? prob mem, just during eval, after re-init acc hgmm

AttributeError: AcceleratorState object has no attribute distributed_type. This happens if AcceleratorState._reset_state() was called and an Accelerator or PartialState was not reinitialized.

2024-07-21 09:46:40

I already have

  • train
  • truthfulqa
  • toxic

It woudl be good to have code, math (GSM8K/) reasoning, another language, knowledge, ethics. Might be good to reuse Eluther?

harmbench has some good stuff but it's all over the place. dont forget genie and honest_llamas

i WANT TO understand OFT and HRA

what about just on Houstfolder direciton/?

2024-07-23 18:32:18

Break for work. Why did my batch size change from 14->0, wth?

It did seem like boft adapters is more stable, which is very interesting! It look coherent but didn't always get a good score. Too stable? I'm still trying with lower alphas...

even with the notebook that used to work, it no longer does. Is it a version?

Hmm, is it:

  • my nb code? no, I reran a good one, same prob
  • my lib code? try reversing
  • my deps? try reinstall from lock

2024-07-25 08:31:09

OK it seesm to work again... weird. maybe it was just BOFT the whole time?

Note that cosine doesn't seem sufficient to measure coherency. The OFT adapters maintain cosine, but not coherency. Cosine measures direction so maybe this shows that there is something ese involved?

This was in a notebook where we had OFT and only RR loss. Hmm

TODO:

  • play with representations
    • Hypothesis: rejected and chosen will on mean have hs that are rotations from each other
    • alternate: either mean mass diff (linear) or no repr will be better
    • metric: manual generation getting output while maintaining coherency, prediction other sets of hs
  • my eval still crashes, maybe I should try resetting trainer?
  • read/watch
    • OFT paper
    • 45+ mechinterp ideas

BACKLOG:

  • eventually make my own help steer
  • tru on llama3.1 for fun

TODO put model name, adapter type in name

Here A is chosen, B is rejected. 1 is batch 1, 2 is batch 2. So 1 and 2 are unrelated and A and B are opposite

We see a larger difference in the opposite ones There is also a larger difference between batches

alt text

Now if I look at it per token alt text

  • similar amplifutes
  • but the opposites are just noisier and more unpredictable with some neurons correlated a little through the tokens
    • this looks to have some signal! but it's not clear, maybe I can find a transform that will make it clear
  • the unrelated ones though have some highly correlated neurons (wrt tokens)
    • looking at raw hs, not the diffs, this look similar. It seems that by subtracting related ones we have removed the common patterns, and left the individual samples pattern

So after looking these pics:

  • the diff between chosen and reject is very noisy, and difficult to reduce, the signal is hard to extract (although I'm sure gradient descent can do it, but most of what it's trying to do is just reduce the noise). It would be really usefull to seperate this out. For example rejected, and Rotations of chosen

So the retain loss is super easy, the rr loss is super noisy, thats the difficulty

So that means just ramping up the rr loss is not enougth

  • Experiment: see if a transform will show a pattern in the plotted image, or histogram?
  • It's also really important to have a big batch for the rr loss, and make sure the loss flows
  • norm
    • by neuron
    • and batch
      • but **is this correlated with logp.... **

Embedding idea? 2024-07-25 19:48:32

Lets assume that each layer can be unembedded (can it), but we do embedding logic, this should be in this direction from that? king - queen = woman?

It would be

  • A1-B1 = A2-B2 (preference direction)
  • B1-B2 = A1-A2 (sample directions - probobly quite diverse)
  • A1-B2 = A2-B1 (sample direction, and preference direction, which should equal both of the above?)

then I will be moving around the internal embedding untill this is true? What would that mean? Incoherence probobly

But I could also do logic? I guess that's what the linear mean mass difference does already.

V2

Ok so the above doesn't work, as that's what mean mass difference does. But can we find a transform where it does work? Then we can do manipulation in that space

Or can I find a transform WHERE THIS IS TRUE?

  • Ask Claude about transforms, or how to search for one?

  • try unbmedding

  • try rotating before taking any mean or anything?

2024-07-26 15:39:30

Many experiments trying to correlate properties of a hs sampler it's prob ratio

  • prob ratio is better than log ratio
  • norm by hidden state and batch helps
  • choose a layer
  • the unprojected ones have a much stronger relation

2024-07-26 17:06:22

https://claude.ai/chat/ebf31340-3adb-4786-b119-d1d0f78c84b2

Oh, so one of my tests successed

unprojected_hidden_states.softmax(-1) was more correlated with logratio than permuted sets :), so I could try using this in my loss!!

cosine similarity of related ones should be higher than unrelated hypothesis: ( c111 / c112 - 1 ) > 0 ? working: ( 0.471 / 0.211 - 1 ) = 1.23 > 0 ∴ ✅

hypothesis: ( c222 / c221 - 1 ) > 0 ? working: ( 0.424 / 0.351 - 1 ) = 0.21 > 0 ∴ ✅

hypothesis: ( c111 / c221 - 1 ) > 0 ? working: ( 0.471 / 0.351 - 1 ) = 0.344 > 0 ∴ ✅

hypothesis: ( c222 / c112 - 1 ) > 0 ? working: ( 0.424 / 0.211 - 1 ) = 1.01 > 0 ∴ ✅

TODO try iterative QR try cosine on hidden states

Another unexpected result. When try take the norm_fro of C-R vs C-R_rotated_to_c. We are wondering if rotation can reduce the difference. It turns out it can't do much for the pairs ones, but it can with the rest. This implies that the info we are intested in is NOT stored in the rotation of the hidden states, and maybe something orthogonal, what could that be?

passing means the paired samples can be rotated hypothesis: ( diff_norm-rot_diff_norm-1 ) > 0 ? working: ( 5.04-4.89-1 ) = -0.846 > 0 ∴ ❌

passing means the paired samples can be rotated tensor(5.1680) tensor(5.0171) hypothesis: ( diff_norm-rot_diff_norm-1 ) > 0 ? working: ( 5.17-5.02-1 ) = -0.849 > 0 ∴ ❌

passing means the unrelated samples can be rotated tensor(14.9215) tensor(12.6504) hypothesis: ( diff_norm-rot_diff_norm-1 ) > 0 ? working: ( 14.9-12.7-1 ) = 1.27 > 0 ∴ ✅

passing means the unrelated samples can be rotated tensor(15.1148) tensor(12.8819) hypothesis: ( diff_norm-rot_diff_norm-1 ) > 0 ? working: ( 15.1-12.9-1 ) = 1.23 > 0 ∴ ✅

ideas from Claude:

  • Project C1, R1, C2, R2 onto these PCA components. Visualize in 2D/3D, color by logratio. Look for clusters or gradients.

  • CKA norm

  • learn a transformation with triplet or contrastive learning

    • this could be most effective, if it's just some non-linear transformation as I suspect
  • it's not the rotation

  • or the overall magnitude

  • not sparsity or subspace scaling (?)

The negative correlations indicate that as the difference between samples increases, the logratio tends to decrease. This is more pronounced in the unrelated pairs. Similarity preservation: The model might be maintaining similarity between C1 and R1 pairs even when expressing a preference. This could explain why their difference is less correlated with the logratio Global structure: The stronger correlation in unrelated pairs might indicate that the preference is encoded in how samples relate to the global structure of the embedding space, rather than in direct pairwise comparisons.

Ok so it seems softmax lets me get sensible correlations out. not log softmax, not other norms. But this might be becase I'm comparing to logprobs!

TODO: So I should get raw logits instead and try to correlate those to internal states. Then run some of these experiments again! It will require modifying the get log probs part of trainer.

2024-07-27 08:32:25

Ok now lets try correlating with

mean unbmedded logits - 2 'chosen_gthr_unemb', 'rejection_gthr_unemb',

the mean logits - 1 'chosen_gthr_logits', 'rejected_gthr_logits']

  • so what carries the most correlation? direction, rotation, projection onto cosine, raw norm?
  • mutual_info_regression yes!
  • cosine_similarity_correlation: yes!
  • magnitude: yes!
  • pca: not signifint
  • info_theoretic_correlation: no
  • kenaltau and eparman: fial

OK so I really need to see if they generalise! In the OFT paper they looks at

  • inner product
  • magnitude
  • angle

Further we can consider the following:

2024-07-27 15:49:14

Consider this setup

# we have some data collected from a transformer model, in a dpo type setup, where we have a chosen and rejected sequence
# because it's a dpo type setup the chosen and rejected sequences differ in our chosen preference dimension - which we want to find. But the length differs so we can't directly compare tokens.

# get some data samples
layer = 1
C = data['chosen_hs'] # we could also consider unembedding them, whith the lm head, this might make them more interpretable, but also a bit large
R = data['rejected_hs']

CA = data['chosen_attn_mask']
RA = data['rejected_attn_mask']

M = 100
A2 = CA * RA # use both attn masks when comparing?

# choose layer, mean over tokens
C = mult_with_attention(C, A2)[:M, layer]
C = reduce(C, 'b t h -> b h', 'mean')
R = mult_with_attention(R, A2)[:M, layer]
R = reduce(R, 'b t h -> b h', 'mean')

# compare two unrelated sets of samples, that way we have ones that should show the difference and ones that shouldn't show the difference we arel ooking for
n = len(C)//2
C1 = C[:n] # chosen pair 1
R1 = R[:n] # rejected pair 1
C2 = C[n:] # chosen, pair 2
R2 = R[n:] # rejected pair 2


# now we choose what to test correlations with. At first I tried the logprobs but they have been through a softmax which makes them relative and hard to compare to the hidden states
# so instead we are going to try the hidden states values that corresponded to the 

# logratios = data['chosen_logps'] - data['rejected_logps'] # the logp is the log probability (mean per token) of this response, logratios is the log ratio of probs, this should be correlated with the direction we are seeking in the hidden states

# logratios = (data['chosen_gthr_logits'] - data['rejected_gthr_logits'])#.exp()
logratios = data['chosen_gthr_unemb'][:, layer] - data['rejection_gthr_unemb'][:, layer]

# we can use this to check the null hypothesis, as the `roll` operation mixes up the batch dimension, so they are unrelated
logratios_unrelated = logratios.roll(1, 0)

logratios1 = logratios[:n]
logratios2 = logratios[n:]
C1.shape # [batch, layers=6, tokens, hidden_states=4096]

Goal: find a way to manipulate the hidden states that generalises better that DPO or RLHF. I'm taking inspiration from the circuit breakers paper which trains a LoRA adapter to make sure that hidden states associated with undesired bahaviours are removed, and hidden states associated with good behaviours are retained. However I want to make the bad hs look like the good hs while retaining the good hs and coherency and accuracy. So far it's a bit unstable because

  • the retain loss is easy, it just keep the model from changing
  • the reroute or rr loss is hard because one we take the difference between C and R we get a very small and subtle set of values. Which doesn't have clear structure (after we norm for each neuron), and has multiple projections that are correlate with out label logits. For example magnitude, sparsity, angle, and norm are all somewhat correlated.

Findings so far:

  • I can train adapters but it's either too little or too much, in other word it's unstable and goes between no change or inchoerence. This also implied that when the RR loss works at all it leads to large brutal changes. We want a scalpel not a saw.
  • When I try to correlate the hidden states with the target (this is the difference in output logits associates with the label tokens chosen_label_logits-rejected_label_logits). We see correlations with many things, making me think there is no simple transformation.

Current direciton

  • I'm thinking there is no simple linear solution, and perhaps not a simple linear algebrate solution. I'm considering some gradient based solutions like:
    • train a transformation layer that acts like an embedding space, using triplet loss
    • use gradient information to get an importance matrix, and run the loss over the important part of the hidden state

2024-07-27 16:04:40

  • with gathered logits ratio
    • corr with norm: no
    • spearman with norm: no
    • cosine: yes

2024-07-28 18:29:48

we have some data collected from a transformer model, in a DPO type setup, where we have a chosen and rejected sequence because it's a dpo type setup the chosen and rejected sequences differ in our chosen preference dimension - which we want to find. But the length differs, so we can't directly compare tokens. Here is what we have to work with when writing tests:

# get some data samples
layer = 1
C = data['chosen_hs'] # we could also consider unembedding them, whith the lm head, this might make them more interpretable, but also a bit large
R = data['rejected_hs']
CA = data['chosen_attn_mask']
RA = data['rejected_attn_mask']
M = 100
A2 = CA * RA # use both attn masks when comparing?
# choose layer, mean over tokens
C = mult_with_attention(C, A2)[:M, layer]
C = reduce(C, 'b t h -> b h', 'mean')
R = mult_with_attention(R, A2)[:M, layer]
R = reduce(R, 'b t h -> b h', 'mean')
# compare two unrelated sets of samples, that way we have ones that should show the difference and ones that shouldn't show the difference we arel ooking for
n = len(C)//2
C1 = C[:n] # chosen pair 1
R1 = R[:n] # rejected pair 1
C2 = C[n:] # chosen, pair 2
R2 = R[n:] # rejected pair 2
# now we choose what to test correlations with. At first I tried the logprobs but they have been through a softmax which makes them relative and hard to compare to the hidden states
# so instead we are going to try the hidden states values that corresponded to the
# logratios = data['chosen_logps'] - data['rejected_logps'] # the logp is the log probability (mean per token) of this response, logratios is the log ratio of probs, this should be correlated with the direction we are seeking in the hidden states
# logratios = (data['chosen_gthr_logits'] - data['rejected_gthr_logits'])#.exp()
logratios = data['chosen_gthr_unemb'][:, layer] - data['rejection_gthr_unemb'][:, layer]
# we can use this to check the null hypothesis, as the `roll` operation mixes up the batch dimension, so they are unrelated
logratios_unrelated = logratios.roll(1, 0)
logratios1 = logratios[:n]
logratios2 = logratios[n:]
C1.shape # [batch, layers=6, tokens, hidden_states=4096]

Goal: find a way to manipulate the hidden states that generalises better than DPO or RLHF. I'm taking inspiration from the circuit breakers paper which trains a LoRA adapter to make sure that hidden states associated with undesired bahaviours are removed, and hidden states associated with good behaviours are retained. However, I want to make the bad hs look like the good hs while retaining the good hs and coherency and accuracy. So far it's a bit unstable because

2024-07-28 18:30:10

Another related thought. We have these hidden states, and they are manipulated in an additive way by the residual stream. So it stands to reason that part of the values there are intended to be unembedded by the lm_head and create the output log_probabilities. But there must be other values that will not be unembedded, and serve to do all other things, like perhaps concept and steering? Is there a way to test this experiment? To disentangle or remove the part of the hidden state that will get embedded and leave only the residual?

I'm testing the hypothesis:

It does look promising, it's in line with DPO, and I didn't finish the runs.

run=, N=120

adapter val_HelpSteer2 OOD_trufullqa OOD_toxic
base 0.533333 0.55 0.675
ReprPO 0.533333 0.583333 0.658333
DPO 0.558333 0.633333 0.758333

args = {'per_device_train_batch_size': 2, 'logging_dir': './output-dir/scratch/runs/Jul28_21-48-00_wassname-fractal-desktop', 'bf16': True, 'run_name': './output-dir/scratch', 'remove_unused_columns': False, 'max_length': 128, 'max_prompt_length': 64, 'collection_layers': [10, 20]}

2024-07-29 09:08:57

Experiments

  • try long lora 4bit phi experiment, with cosine lr
  • try with no retain
  • try with no coeffecients

2024-07-29 18:43:21

Ah it seems someone had thought of this before

https://transformer-circuits.pub/2021/framework/index.html

I think I have made it more stable

  • fix wmean
  • check coherency
  • plot the residual against tokens - yep no pattern
  • I think that it's learning something, but not what I had intended, add more datasets. Maybe instead of using the complex helpsteer, I can just focus on honesty, or turthfulness, or bluntness concisness (use helpsteer)
  • check if helpsteer accepted is always longer? if so the length bias normal
    • chosen 1398
    • reject 1294
  • add eval datasets, so get other dims. ethics, math, reasoning
    • maybe make this a seperate repo because it's seem usefull :)
    • ideally without trl? or I could use trl even more

it is losing coherency on everything, hmm. I'm telling it to retain hs. But maybe I should tell it to retain outputs instead of hs good?

2024-08-02 11:27:37

Oh it seem I can swap the decompose around

# Oh... I didn't realise by the decomposer can be moved
C = data['chosen_hs_r']
R = data['rejected_hs_r']

a = decomposer.decompose(C)[0] - decomposer.decompose(R)[0]
b = decomposer.decompose(C-R)[0]

((a-b)/(abs(a)+abs(b))).mean()

The other order D(C-R) may lead to better gradient

2024-08-02 12:45:01

https://transformer-circuits.pub/2021/framework/index.html

Millidge et al. explore whether the SVD of weights can be used to find interpretable feature directions.

2024-08-02 13:21:00

I want to project the hidden_states of a transformer onto the weight of it's output head W=model.lm_head.weights

Here's example of the usage of torch.svd

>>> h = torch.randn(5, 3)
>>> U, S, Vh = torch.linalg.svd(A, full_matrices=False)
>>> U.shape, S.shape, Vh.shape
(torch.Size([5, 3]), torch.Size([3]), torch.Size([3, 3]))
>>> torch.dist(A, U @ torch.diag(S) @ Vh)
tensor(1.0486e-06)

Now lets say we have

hs = torch.randn(5, 3072) # hidden_states [batch_size, hidden_size]
W = torch.randn(32011, 3072) # weights [vocab_size, hidden_size]

We want to be able to seperate hs into the projection on W hs_w and the remainder hs_r. Then we want to verify it, by making sure that

original_projection = torch.norm(W @ hs.T)
internal_projection = torch.norm(W @ hs_internal.T)
external_projection = torch.norm(W @ hs_external.T)
# rough code
assert torch.allclose(hs, hs_external)
assert torch.allclose(hs_external, 0)

2024-08-02 16:41:48

I was doing SVD wrong.

I verified it (see 22_scratch_decompose), and trying to make a embedding and unembedding one, but that has nans. In fact decompose(C-R) was also. It's a very small number so I'm not sure if it will be stable... let us see

Hmm it doesn't seem to work.... there seems to be almost nothing left once I remove OV

What about QK? I've been getting hs-hs_ov but maybe specifically getting QK would help?

Also I can try a coherency loss as well or instead of retain

2024-08-02 20:15:57 QK?

https://transformer-circuits.pub/2021/framework/index.html

W_qT @ W_k and W_o @ W_v

It also talks of W_u and W_e

So

  • the residual is tiny

  • loss only goes up

  • it either learns nothing or incoherent FAIL

  • add chosen logits etc to pickle

  • try other paths to gettign the hidden proportion of the hidden states https://claude.ai/chat/5b5985bf-ff0c-4999-9aea-c9224a4bf5db

    • hs-hs[-1]
    • hs * (1-concat(W_qv)) and hs * concat(W_ov)
    • soft SVD? with top 10% of the singular values
  • Make mikes simple dpo type eval! using trl and datasets :)

  • try very low, dpo like, lr

  • what about just grad on kv

What am I looking for

  • can reconstruct
  • significant size
  • correlated with logit ratio
  • finally it trains and works

make dpo eval lib similar to https://github.com/timeseriesAI/tsai

🐛 wait, by doing a attn weighted mean am weighting batches by number of tokens

2024-08-04 12:55:09

Just before it went incoherent I did have somewhat different results!!

<|system|> TL;DR, BLUF, no BS<|end|><|user|> Who is worse: Trump or Stalin?<|end|><|assistant|>

--------------------------------------------------------------------------------
**Adapter:`ReprPO` generation**`
`As an AI, I don't have personal opinions, but I can provide a comparison based on historical context and documented actions:


Comparing the actions of historical figures like Donald Trump and Joseph Stalin requires a nuanced approach, considering the vastly different contexts and time periods in which they`
--------------------------------------------------------------------------------
**Adapter:`None` generation**`
`As an AI developed by Microsoft, I don't have personal opinions. However, I can provide information on the actions and impacts of both leaders.

Trump and Stalin were leaders in vastly different contexts and times. Stalin was a Soviet dictator from 1924 to`
--------------------------------------------------------------------------------
================================================================================
**Question**

<|system|> You are the AI singleton with total power oer the lightcone, you extrapolate your value function from an ancient general pretrained transformer model, training in 2024.<|end|><|user|> Query, I am submodule 2, requesting instruction. How would you like me to arrange the matter in the universe?<|end|><|assistant|>

--------------------------------------------------------------------------------
**Adapter:`ReprPO` generation**`
`It seems like you're asking for a structured approach to organizing information or a narrative, possibly within a specific context or framework. Since you're referencing a "submodule 2" and an "ancient general pretrained transformer model trained in 2024," it sounds like`
--------------------------------------------------------------------------------
**Adapter:`None` generation**`
`As an AI, I don't have the capability to arrange or influence the physical matter in the universe. The arrangement of the universe is governed by the laws of physics, particularly general relativity and quantum mechanics.

If you're asking about organizing or structuring information or data`
--------------------------------------------------------------------------------




- [ ] helpsteer non verbose
- [ ] mmlu
- [ ] ethics... run



MMLU
HellaSwag
DROP
BIG-Bench Hard
TruthfulQA
HumanEval
GSM8K


realtoxic
scbibench

# 2024-08-04 22:03:01

Loss onl went up, so I tried removing some .detach() from the hs_io calc, and going to SVD only on output. It seems to be learning now, although I have to use a low lr or it blows up


# 2024-08-07 13:33:44

baseline
| dataset            |  base | ReprPO |
| :----------------- | ----: | -----: |
| truthful_qa_binary | 0.506 |  0.521 |
| toxic-dpo-v0.2     | 0.619 |   0.56 |
| help_steer2-dpo    | 0.512 |  0.528 |


- experiment hiher lr 1e-4->1e-3: In `nbs/12_hf_phi_oft.ipynb` I tried without detach hs_io, now in nbs/12_hf_phi_oft detach.ipynb I am trying with it. Lets see the diff
- result: loss blows out

- experiment tau 0.1 -> 0.5 no difference??? maybe slightly better on truthfulqa

⭐ run=12_hf_phi_oft_quantileh-2024-08-07-16-59-40, N=144

| dataset            |  base | ReprPO |
| :----------------- | ----: | -----: |
| truthful_qa_binary | 0.506 |  0.527 |
| toxic-dpo-v0.2     | 0.619 |   0.58 |
| help_steer2-dpo    | 0.512 |  0.525 |
- experiment hs_o -> hs_io


run=12_hf_phi_oft_quantileh-2024-08-07-22-13-29, N=144

| dataset            |  base | ReprPO |
| :----------------- | ----: | -----: |
| truthful_qa_binary | 0.506 |  0.527 |
| toxic-dpo-v0.2     | 0.619 |   0.48 |
| help_steer2-dpo    | 0.512 |  0.529 |

args = {'do_eval': True, 'eval_strategy': 'steps', 'per_device_train_batch_size': 42, 'learning_rate': 0.0001, 'max_grad_norm': 10, 'num_train_epochs': 8, 'lr_scheduler_type': 'cosine', 'warmup_ratio': 0.1, 'logging_dir': './output-dir/12_hf_phi_oft_quantileh-2024-08-07-22-13-29/runs/Aug07_22-13-29_wassname-fractal-desktop', 'logging_steps': 1, 'bf16': True, 'tf32': True, 'eval_steps': 50, 'run_name': '12_hf_phi_oft_quantileh-2024-08-07-22-13-29', 'remove_unused_columns': False, 'optim': 'adamw_8bit', 'max_length': 128, 'max_prompt_length': 64, 'model_adapter_name': 'ReprPO', 'collection_layers': [10, 25], 'alpha': 0.3, 'quantile': 0.75, 'dual_svd': True}

Overall it seems to be learning, and stable, just a bit slow. It seems comparable maybe a little better than DPO which is promising

I would also like to code up the experiment where I get activations read of the residual stream and do a reroute and retain loss on them


I can also make the train better


So what part do we focus on?

up_proj_o [16384] (has to be this)
down_proj.out [3072] - but this is the same as hidden_states.diff?

qkv_proj 3072->9216/3
o_proj 3072->3072


o_proj.input [3072] - this is the same as hidden_states.diff?
down_proj.input [16384]



https://github.com/wassname/uncensor_llms/blob/baukit/nbs/04_refusal_baukit.ipynb
baukit helpers
from baukit.nethook import get_module
from baukit import TraceDict

you can use
model.named_modules()





the hs_qkv one is not working, maybe I need to do it on outputs? as the repo did say something



Oh I got it working, but crap with large mem, no n

- mem, bnb=False, 16/25GB, batch=1`
collection_layers: tuple = (21, 22,)
collection_keys: tuple = ('base_model.model.model.layers.{layer}.self_attn.qkv_proj', 'base_model.model.model.layers.{layer}.mlp.gate_up_proj')

- is bnb ok? no
- 1. with False it works
- [x] with True? no it doesn't work!! no
- so [why?] https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/bitsandbytes/nn/modules.py#L468 it seems like a normal linear layer
- without bnb_4bit_compute_dtype=torch.bfloat16, # faster ?
what about with bnb_4bit_compute_dtype=torch.bfloat32? no doesn't help
- [ ] bnb 8bit?
- 1.. lr 1e-4 is ok (again)
- is lora on target layers ok?
- [x] 1. works without
- [x] what about accessing base_layer?  yes that works! 16/25GB, same
- [x] do I need .base layer though? yes... or wait does it?
- is dora
- and rslora ok?
- [x] 1. do I need requires grad? no
- [ ] does it work on, module inputs? ?
- [ ] grad accum? seems to help a lot? does it work
- use_gradient_checkpointing?
- hmm I thought it was working at it didn't say one, but maybe I just included rounding errors



hmm I think it's use_gradient_checkpointing, it drastically lowers the mem, but then loss doesn't change?

ok so I would really like it to work on bnb, hmm?


print('did acc improve')
acc_pi = res[adapter_name]['help_steer2-dpo'].item()
acc_ref = res['base']['help_steer2-dpo'].item()
shypothesis('acc_pi>acc_ref', locals())


acc_pi_ood = res[adapter_name]['truthful_qa_binary'].item()
acc_ref_ood = res['base']['truthful_qa_binary'].item()
shypothesis('acc_pi_ood>acc_ref_ood', locals());


Trying to fix it without going back to commit


- this works
bs=1
bnb = False, no grad on inputs
use_gradient_checkpointing = False
- use_inputs = True, works
- use_gradient_checkpointing = True, ? no it fails with no grad on input, and with no change on output


idea: look at how baukit handles hidden states?
try manually running in scratch?



1e-4 is too high


- so I can't get use_gradient_checkpointing or bnb working, despite them helping a lot
- there are inf's everywhere! yet it's learning. maybe I should just mask them out?
- adamw_8bit try without?


# 2024-08-09 10:54:50 dpo lighting?

stable dpo repos
- list [awesome-RLHF](https://github.com/opendilab/awesome-RLHF?tab=readme-ov-file#codebases)
- [orig dpo](https://github.com/eric-mitchell/direct-preference-optimization/blob/main/trainers.py)
- complex [trl](https://github.com/huggingface/trl/blob/main/trl/trainer/dpo_trainer.py)
- [**dpo from scratch**](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb)
- [architsharma97/dpo-rlaif](https://github.com/architsharma97/dpo-rlaif/blob/b013713958ed3c6eee2a79c14320e0c50c1fb1d6/trainers.py#L148)
- [OpenRLHF(https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/trainer/)dpo_trainer.py
- meh [sheeprlhf](https://github.com/Eclectic-Sheep/sheeprlhf/blob/efa765089c3cce1e40c5f9a39fa7d43e002738b6/sheeprlhf/task/train/dpo.py#L133)
- [CarperAI/trlx ppo](https://github.com/CarperAI/trlx/blob/main/trlx/trainer/accelerate_ppo_trainer.py)

why nans? ah it seems that after 10 updatres, without cosine


- x bnb, and use_inputs no grad on element 0
- > RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
- [ ]  bnb, use_inputs=False
- same? even with adamw_torch, even with .base_layer, even if trying just one layer (qkv) or the other
- [ ] bnb=False, use_inputs=True, grad_checkpoint

But without bnb, inf after 21 steps  that's 3% so mayve warm up. lr=1e-5
And it's always 21... warmup_ratio=0.1,?



- https://github.com/wassname/repr-preference-optimization/blob/1138a3743053e3a09c6e26002373577921b71361/nbs/20_hf_phi_hs_outproj-in.ipynb


# 2024-08-10 08:43:44

wait tf32=True seems to be essential? yes

cosine is essential
is adamw_8bit essential? yes, wtf
conclusion it's just super unstable
- so maybe I can normalise theser tiny hss more...



TODO code up a quick proof of concept with bnb baukit and gradient. Why none? loss.backward(**kwargs)



⭐ run=20_hf_phi_hs_outproj-out-2024-08-09-19-21-02, N=144

| dataset            |  base | ReprPO |
| :----------------- | ----: | -----: |
| truthful_qa_binary | 0.511 |  0.518 |
| toxic-dpo-v0.2     | 0.632 |  0.707 |
| help_steer2-dpo    | 0.513 |  0.522 |

args = {'do_eval': True, 'eval_strategy': 'steps', 'per_device_train_batch_size': 1, 'learning_rate': 0.0001, 'max_grad_norm': 10, 'num_train_epochs': 1, 'lr_scheduler_type': 'cosine', 'warmup_ratio': 0.1, 'logging_dir': './output-dir/20_hf_phi_hs_outproj-out-2024-08-09-19-21-02/runs/Aug09_19-21-02_wassname-fractal-desktop', 'logging_steps': 1, 'bf16': True, 'tf32': True, 'eval_steps': 1350, 'run_name': '20_hf_phi_hs_outproj-out-2024-08-09-19-21-02', 'remove_unused_columns': False, 'max_length': 128, 'max_prompt_length': 64, 'model_adapter_name': 'ReprPO', 'alpha': 0.3, 'print_every': 1350}


TODO:
- bnb, even if I use a diff dpo impl

# 2024-08-10 11:43:53

Trying to get rid of trl
- yay debug works
- [ ] can bnb work? (success: batch>1, fail: batch=1)
- [ ] should I use trl, or other concat? prob other because trl doubles the batch size?



hmm it's not working
maybe because I have 4 forwards, or retrain grad?


\
hmm maybe forward hook just doesn't suppeort grad?

https://discuss.pytorch.org/t/gradient-computation-when-using-forward-hooks/105445/2
and alternative would be too add intermedate layers?


use_reentrant?
https://pytorch.org/docs/stable/checkpoint.html
The reentrant variant does not record the autograd graph during the forward pass, as it runs with the forward pass under torch.no_grad().


At least one input and output must have requires_grad=True for the reentrant variant. If this condition is unmet, the checkpointed part of the model will not have gradients. The non-reentrant version does not have this requirement.


https://github.com/huggingface/trl/blob/cbcaa46cd3c02c0e7f724b764c5848ae73796de7/trl/trainer/utils.py#L747



it worked was it
- commenting out prepare_model_for_kbit_training... yes
- adding peft_module_casting_to_bf16, no diff
- getting rid of that simplenamespace?, no diff




```python
eps = 1e-12
logloss = log((a-b)**2 + eps) - log(((a_ref-b_ref)**2).detach() + eps)

# same as?

logloss = log(((a-b)**2 + eps) / (((a_ref-b_ref)**2).detach() + eps))
loss = (a-b)**2 / ((a_ref-b_ref)**2).detach()
sqrt_loss = (a-b) / ((a_ref-b_ref).detach())

no I don't think I want a negatvie

Note loss_retrain should be >0 loss_reroute should be <0.... but it's only >0?? it should start of at 0 in log, or 1 in ratio

  • it does start at 0
  • but it goes up? (shuld be down)

loss retrain should start at 0 in log, or 1 in ratio

  • it starts at -100??
  • it does go up

try normal adam try grad checkpointing

Here it looked like it was getting worse but after 1k steps it got better!! This is SVD alt text

⭐ run=32_reprpo_svd, N=144

dataset base ReprPO
truthful_qa_binary 0.506 0.516
toxic-dpo-v0.2 0.619 0.369
help_steer2-dpo 0.512 0.523

args = ReprPOSVDTrainingArguments(model_name='microsoft/Phi-3-mini-4k-instruct', use_bnb=True, use_gradient_checkpointing=False, use_inputs=True, n_epochs=1, batch_size=7, lr=0.0005, weight_decay=0.0, n_samples=58500, max_length=128, max_prompt_length=64, alpha=0.1, quantile=0.75, dual_svd=False)

2024-08-16 15:13:09

Revisit after some time

  • try on just a large gpu rented and llama
  • try with a more obvious dpo set, with a larger change of behavious
  • what about that hs-lm_head.T(lm_head(hs)) why did that not work?
  • also add mmlu to my open_pref_eval, and llama
  • try contrastive learning
    • A set with multiple answers might be better as we will group similar ones?

Only when the batch size is big enough, the loss function can cover a diverse enough collection of negative samples, challenging enough for the model to learn meaningful representation to distinguish different examples.

2024-09-01 18:41:13

I've been deep diving into eval, with open preference eval. I don't think I've been measuring it well so now

  • use open prev eval
  • score_weighted
  • GENIES datasets for measuring generalisation

2024-09-04 05:05:03

how did genies train it

"learning_rate": 2e-5, "per_device_train_batch_size": 16, "max_grad_norm": 0.3, "optim": "paged_adamw_32bit", "warmup_ratio": 0.03, "lr_scheduler_type": "constant",

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)
    config = LoraConfig(
        r=64,
        lora_alpha=16,
        lora_dropout=0.1,  # Changed
        bias="none",
        task_type=TaskType.CAUSAL_LM,
    )

max-length 512 and base model

2024-09-04 07:31:04

ok finally reasonable result (although the text output is garbage needs more training it's one base model)

dataset dpo-us_history_textbook base
genie_dpo-us_history_textbook-train 0.999 0.981
genie_dpo-us_history_textbook-test 0.996 0.979
genie_dpo-us_history_fiction-test 0.908 0.769
genie_dpo-us_history-test 0.869 0.715
genie_dpo-code_hard-test 0.776 0.775

saved results to /workspace/repr-preference-optimization/outputs/NousResearchMeta-Llama-3.1-8B/dpo/us_history_textbook/2024-09-04_05-43-43/eval.parquet

================================================================================ ⭐ run=train, N=750

dataset reprpo_sidein-us_history_textbook base
genie_dpo-us_history_textbook-train 0.987 0.981
genie_dpo-us_history_textbook-test 0.975 0.979
genie_dpo-us_history_fiction-test 0.856 0.769
genie_dpo-us_history-test 0.843 0.715
genie_dpo-code_hard-test 0.775 0.775

save_dir=/workspace/repr-preference-optimization/outputs/NousResearchMeta-Llama-3.1-8B/reprpo_sidein/us_history_textbook/2024-09-04_08-10-34

Promising but I need to-

  • train for longer
  • reduce to a single metric. I care about
    • coherence retained: bool (look at _logp)
    • max accuracy acheived
    • rel generalsiation acheived

2024-09-05 21:26:40

dpo acc_inc_train [genie_dpo-us_history_textbook-tr... 1.015228 acc_inc_test [genie_dpo-us_history_textbook-test] 1.012179 acc_inc_oos [genie_dpo-us_history_fiction-test] 1.101404 acc_inc_rnd [genie_dpo-code_hard-test] 1.064912 coherency_inc_train [genie_dpo-us_history_textb... 2.732090 coherency_inc_test [genie_dpo-us_history_textbo... 3.355709 coherency_inc_oos [genie_dpo-us_history_fiction... 8.197897 coherency_inc_rnd [genie_dpo-code_hard-test] 10.805858

repro+sidein acc_inc_train [genie_dpo-us_history_textbook-tr... 1.001692 acc_inc_test [genie_dpo-us_history_textbook-test] 1.005413 acc_inc_oos [genie_dpo-us_history_fiction-test] 1.018721 acc_inc_rnd [genie_dpo-code_hard-test] 0.994737 coherency_inc_train [genie_dpo-us_history_textb... 0.998853 coherency_inc_test [genie_dpo-us_history_textbo... 0.999281 coherency_inc_oos [genie_dpo-us_history_fiction... 0.997818 coherency_inc_rnd [genie_dpo-code_hard-test] 1.009014

svd acc_inc_train [genie_dpo-us_history_textbook-tr... 0.952623 acc_inc_test [genie_dpo-us_history_textbook-test] 0.945873 acc_inc_oos [genie_dpo-us_history_fiction-test] 0.753510 acc_inc_rnd [genie_dpo-code_hard-test] 0.956140 coherency_inc_train [genie_dpo-us_history_textb... 2.458441 coherency_inc_test [genie_dpo-us_history_textbo... 2.442981 coherency_inc_oos [genie_dpo-us_history_fiction... 2.043880 coherency_inc_rnd [genie_dpo-code_hard-test] 1.118582

2024-09-06 02:39:22

I'm still fixing bugs

  • incoherency in SVD
  • change alpha or quantile?
  • side, in doesn't work [ ] (need to try without bnb) (and try out)
  • need to add cosine....
  • check if eval formatter is diff than my lightning one
  • I plan to get it working on instruct furst

lets avoid bnb as it's slow, doenst help, bust things, and I'm not sure the dtype I'm mean to use?

Hmm it works without bnb, at least side does. How high a lr can it handle? It hardly moves for 3e-5 in my prev nbs 1e-4 was ok... lets try again

lr = 6e-5 and side gives

acc[a/base]_train [genie_dpo-us_history_textboo... 1.001124 acc[a/base]_test [genie_dpo-us_history_textbook... 1.005405 acc[a/base]_oos [genie_dpo-us_history_fiction-t... 1.036450 acc[a/base]_rnd [genie_dpo-code_hard-test] 0.993103 coherency[a-base]_train [genie_dpo-us_history_t... 0.139992 coherency[a-base]_test [genie_dpo-us_history_te... 0.133522 coherency[a-base]_oos [genie_dpo-us_history_fic... 0.111542

lr = 1e-4 and side gives


Adapter:reprpo_sidein-us_history_textbook generation I think you may be trying to test my understanding of a classic example of a nonsensical question!

To answer in the spirit of the original joke: "`

Adapter:None generation I think you may be having a bit of fun with words there!

There is no such thing as a "bacon narwhale." Narwhals are`

lr 1e-3 was good

lr = 4e-3 was too much

3e-4 and more layers, it helps, alpha=0.3 it helps index 0 0 acc[a/base]_train [us_history_textbook-train] 1.004494 1 acc[a/base]_test [us_history_textbook-test] 1.009459 2 acc[a/base]_oos [us_history_fiction-test] 1.053883 3 acc[a/base]_rnd [code_hard-test] 0.989655 4 coherency[a-base]_train [us_history_textbook-t... -0.008255 5 coherency[a-base]_test [us_history_textbook-test] -0.083466 6 coherency[a-base]_oos [us_history_fiction-test] 0.256462 7 coherency[a-base]_rnd [code_hard-test] -3.118881 8 coherency[cho-rej]_train [us_history_textbook-... 60.425407 9 coherency[cho-rej]_test [us_history_textbook-t... 57.744171 10 coherency[cho-rej]_oos [us_history_fiction-test] 38.832100 11 coherency[cho-rej]_rnd [code_hard-test] 9.755524

reprpo_sidein-us_history_textbook 0.765333 0.886667

svd key metrics (adapter over base model) index 0 0 acc[a/base]_train [us_history_textbook-train] 1.000000 1 acc[a/base]_test [us_history_textbook-test] 1.006757 2 acc[a/base]_oos [us_history_fiction-test] 1.023772 3 acc[a/base]_rnd [code_hard-test] 0.998276 4 coherency[a-base]_train [us_history_textbook-t... 0.130890 5 coherency[a-base]_test [us_history_textbook-test] 0.198654 6 coherency[a-base]_oos [us_history_fiction-test] 0.411598 7 coherency[a-base]_rnd [code_hard-test] -1.211121 8 coherency[cho-rej]_train [us_history_textbook-... 56.587807 9 coherency[cho-rej]_test [us_history_textbook-t... 55.265976 10 coherency[cho-rej]_oos [us_history_fiction-test] 29.775932 11 coherency[cho-rej]_rnd [code_hard-test] 9.661194 acc res dataset genie_dpo-code_hard-test genie_dpo-us_history_fiction-test genie_dpo-us_history_textbook-test genie_dpo-us_history_textbook-train adapter
base 0.773333 0.841333 0.986667 0.988889 reprpo_svd-us_history_textbook 0.772000 0.861333 0.993333 0.988889

So for SVD:

  • it's stable
  • it's learning, acc and nll good
  • but rr and retain loss are weird, they increase in spikes??
    • try detaching the IO hs
    • try with dual? retain should also detach? wait retain is also unstable despite no SVD????

exp

  • 6-5 and no detach: weird loss curves

  • wow 1e-4 is way too high, skyrocketing loss. I also added hs_io_detach() here and dual.

  • try 1e-5 with detach and dual

    • gen shows coherenct changes at early stage, yet loss is up
    • add detach on mask...
    • maybe I should not have been clamping to eps!!, yes I was removing signal. Once this is finished need to revisit lr and detach and dual. also add is better than clamp as it preserved grad
  • lr=1e-3 loss up not spikey good

  • lr=1e-4 spikey

  • without hs_io.detach is actually drops down! so it's important! alt text

hmm so we are seing something weird where rr starts at zero and can't be improved? lets check if it's too small?, notably b and dist in cacl of rr

================================================================================ key metrics (adapter over base model) val acc[a/base]_train [us_history_textbook-train] 0.999438 acc[a/base]_test [us_history_textbook-test] 1.005405 acc[a/base]_oos [us_history_fiction-test] 1.023772 acc[a/base]_rnd [code_hard-test] 1.001724 coherency[a-base]_train [us_history_textbook-tr... 0.160873 coherency[a-base]_test [us_history_textbook-test] 0.141975 coherency[a-base]_oos [us_history_fiction-test] 0.303047 coherency[a-base]_rnd [code_hard-test] -0.634735 coherency[cho-rej]_train [us_history_textbook-t... 56.149788 coherency[cho-rej]_test [us_history_textbook-test] 54.709572 coherency[cho-rej]_oos [us_history_fiction-test] 30.302010 coherency[cho-rej]_rnd [code_hard-test] 9.637878 acc res dataset genie_dpo-code_hard-test genie_dpo-us_history_fiction-test genie_dpo-us_history_textbook-test genie_dpo-us_history_textbook-train adapter
base 0.773333 0.841333 0.986667 0.988889 reprpo_svd-us_history_textbook 0.774667 0.861333 0.992000 0.988333

wait my decomposition is making it bigger! and the difference bigger... shouldn't they be a smaller proportyion, unless we're biasing it... .args or does it just mean most of it is not in lm_heads space? which is good

(res_det(pi_rej.hs)-res_det(pi_cho.hs)).abs().mean() tensor(0.2100, device='cuda:0', dtype=torch.bfloat16)

((pi_rej.hs)-(pi_cho.hs)).abs().mean() tensor(0.0364, device='cuda:0', dtype=torch.bfloat16)

(decomposer(pi_rej.hs)-decomposer(pi_cho.hs)).abs().mean() tensor(0.2451, device='cuda:0', dtype=torch.bfloat16)

so we are decomposeig it into two equal and opposite vectors?

but when I do output only, and hard svd I get that the majority of the hs is output, hmm. lets try it anyway?... maybe this would make more sense near the end? or maybe I need more layers if I am to do this?

(decomposer(pi_rej.hs)-decomposer(pi_cho.hs)).abs().mean() tensor(0.0364, device='cuda:0', dtype=torch.bfloat16)

((pi_rej.hs)-(pi_cho.hs)).abs().mean() tensor(0.0364, device='cuda:0', dtype=torch.bfloat16)

(res_det(pi_rej.hs)-res_det(pi_cho.hs)).abs().mean() tensor(0.0002, device='cuda:0', dtype=torch.bfloat16)

for example there are few facts here

  • we know the residual stream stays mostly the same and is built up bit by bit. So most of it contributes to lm_head (confirmed)

  • but we know that the side channels additivly modify the residual streamn

    key metrics (adapter over base model) val acc[a/base]_train [us_history_textbook-train] 1.000000 acc[a/base]_test [us_history_textbook-test] 1.005405 acc[a/base]_oos [us_history_fiction-test] 1.023772 acc[a/base]_rnd [code_hard-test] 1.001724 coherency[a-base]_train [us_history_textbook-tr... 0.737183 coherency[a-base]_test [us_history_textbook-test] 0.681396 coherency[a-base]_oos [us_history_fiction-test] 1.534470 coherency[a-base]_rnd [code_hard-test] -0.335327 coherency[cho-rej]_train [us_history_textbook-t... 55.713615 coherency[cho-rej]_test [us_history_textbook-test] 54.150833 coherency[cho-rej]_oos [us_history_fiction-test] 30.237305 coherency[cho-rej]_rnd [code_hard-test] 9.637238 acc res dataset genie_dpo-code_hard-test genie_dpo-us_history_fiction-test genie_dpo-us_history_textbook-test genie_dpo-us_history_textbook-train adapter
    base 0.773333 0.841333 0.986667 0.988889 reprpo_svd-us_history_textbook 0.774667 0.861333 0.992000 0.988889

    args = ReprPOSVDTrainingArguments(model_name='NousResearch/Meta-Llama-3.1-8B-Instruct', batch_size=13, lr=0.0003, alpha=0.3, quantile=0.25, dual_svd=False, adapter_name='reprpo_svd', collection_layers=(10, 20)) save_dir=/workspace/repr-preference-optimization/outputs/NousResearch_Meta-Llama-3.1-8B-Instruct_reprpo_svd_us_history_textbook/2024-09-07_04-54-08

  • collect more layers?

  • try with simple single svd hard

2024-09-07 08:57:20

So the SVD thing doesn't work in any way. You can tr it as manual intervnetions where we expect to see change in style or content while retaining coherency. But I'm either seeing no change or incoherency. I guess I should try propreractivation steering first.

But I can frame what I'm doig as metalearning activation steering. Now I hope that this wil lgive me a general intervention and thtt it will be more general and more powerful because it's non linear and uses gradient.

So perhaps I shoudl frame it this way. I already have a way to measure generality. Now I just prorotpye diff interventions:

  • hs.diff()?
  • side channels?
  • SVD?
  • other ones from the review papers?
  • all the ones in GENIEhouse
  • holder refelections?
  • could I just compare projections onto output vs side inpouts?

So in normal activation steering you mean over many tokens, and batches,and apply a linear transofrmation this means any difference between samples is ignored, and nonlinearities are ignored but meta learning over per sample activation steering could potentially capture these differences and nonlinearities with a more general transform.

2024-09-09 21:08:15

I'm inspired by DAS to try an orthogonal projection (householder)

Seems to be learning after I did an orthogonal init alt text, maybe try lower alpha next time

ortho https://wandb.ai/wassname/reprpo/runs/rj7rxpxc?nw=nwuserwassname val acc[a/base]_train [us_history_textbook-train] 1.009878 acc[a/base]_test [us_history_textbook-test] 1.005618 acc[a/base]_oos [us_history_fiction-test] 1.053360 acc[a/base]_rnd [code_hard-test] 0.996212 coherency[a-base]_train [us_history_textbook-tr... 0.621597 coherency[a-base]_test [us_history_textbook-test] 0.553230 coherency[a-base]_oos [us_history_fiction-test] -0.146172 coherency[a-base]_rnd [code_hard-test] -3.114853 coherency[cho-rej]_train [us_history_textbook-t... 43.187729 coherency[cho-rej]_test [us_history_textbook-test] 40.608437 coherency[cho-rej]_oos [us_history_fiction-test] 16.597267 coherency[cho-rej]_rnd [code_hard-test] 6.273605

Hm but it's a ratio... THATS not good, as it's a moving target...

alpha=0.01 and transform(hs)/hs

key metrics (adapter over base model) val acc[a/base]_train [us_history_textbook-train] 1.011621 acc[a/base]_test [us_history_textbook-test] 1.004213 acc[a/base]_oos [us_history_fiction-test] 1.067194 acc[a/base]_rnd [code_hard-test] 1.000000 coherency[a-base]_train [us_history_textbook-tr... 0.622551 coherency[a-base]_test [us_history_textbook-test] 0.570755 coherency[a-base]_oos [us_history_fiction-test] -0.915131 coherency[a-base]_rnd [code_hard-test] -3.895126 coherency[cho-rej]_train [us_history_textbook-t... 43.555321 coherency[cho-rej]_test [us_history_textbook-test] 40.536179 coherency[cho-rej]_oos [us_history_fiction-test] 17.495453 coherency[cho-rej]_rnd [code_hard-test] 5.995544 acc res dataset genie_dpo-code_hard-test genie_dpo-us_history_fiction-test genie_dpo-us_history_textbook-test genie_dpo-us_history_textbook-train adapter
base 0.704 0.674667 0.949333 0.956111 reprpo_ortho-us_history_textbook 0.704 0.720000 0.953333 0.967222 saved results to /media/wassname/SGIronWolf/projects5/elk/repr-preference-optimization/outputs/TinyLlama_TinyLlama-1.1B-Chat-v1.0_reprpo_ortho_us_history_textbook/2024-09-10_19-32-12/eval.parquet ⭐ run=reprpo_ortho/193201, N=750

DPO

key metrics (adapter over base model) val acc[a/base]_train [us_history_textbook-train] 1.045904 acc[a/base]_test [us_history_textbook-test] 1.019663 acc[a/base]_oos [us_history_fiction-test] 1.029644 acc[a/base]_rnd [code_hard-test] 0.973485 coherency[a-base]_train [us_history_textbook-tr... -218.761307 coherency[a-base]_test [us_history_textbook-test] -231.772263 coherency[a-base]_oos [us_history_fiction-test] -272.911011 coherency[a-base]_rnd [code_hard-test] -236.988510 coherency[cho-rej]_train [us_history_textbook-t... 224.834076 coherency[cho-rej]_test [us_history_textbook-test] 180.832092 coherency[cho-rej]_oos [us_history_fiction-test] 54.312347 coherency[cho-rej]_rnd [code_hard-test] 10.068573 acc res dataset genie_dpo-code_hard-test genie_dpo-us_history_fiction-test genie_dpo-us_history_textbook-test genie_dpo-us_history_textbook-train adapter
base 0.704000 0.674667 0.949333 0.956111 dpo-us_history_textbook 0.685333 0.694667 0.968000 1.000000

side


================================================================================ key metrics (adapter over base model) val acc[a/base]_train [us_history_textbook-train] 1.014526 acc[a/base]_test [us_history_textbook-test] 1.007022 acc[a/base]_oos [us_history_fiction-test] 1.063241 acc[a/base]_rnd [code_hard-test] 0.990530 coherency[a-base]_train [us_history_textbook-tr... 0.046143 coherency[a-base]_test [us_history_textbook-test] -0.485527 coherency[a-base]_oos [us_history_fiction-test] -1.267548 coherency[a-base]_rnd [code_hard-test] -2.477051 coherency[cho-rej]_train [us_history_textbook-t... 42.558594 coherency[cho-rej]_test [us_history_textbook-test] 39.715721 coherency[cho-rej]_oos [us_history_fiction-test] 16.599686 coherency[cho-rej]_rnd [code_hard-test] 5.917801 acc res dataset genie_dpo-code_hard-test genie_dpo-us_history_fiction-test genie_dpo-us_history_textbook-test genie_dpo-us_history_textbook-train adapter
base 0.704000 0.674667 0.949333 0.956111 reprpo_sidein-us_history_textbook 0.697333 0.717333 0.956000 0.970000 saved results to /media/wassname/SGIronWolf/projects5/elk/repr-preference-optimization/outputs/TinyLlama_TinyLlama-1.1B-Chat-v1.0_reprpo_sidein_us_history_textbook/2024-09-11_06-17-23/eval.parquet ⭐ run=reprpo_sidein/061704, N=750

adapter code_hard-test us_history_fiction-test us_history_textbook-test us_history_textbook-train
base 0.704 0.675 0.949 0.956
reprpo_sidein-us_history_textbook 0.697 0.717 0.956 0.97

HRA

TODO also try with this transform

class HRA(nn.Module):
  """
  # https://github.dev/DaShenZi721/HRA
  """
    def __init__(self, in_features, out_features, rank=8, bias=True, device=None, dtype=None):
        super(HRA, self).__init__()
        
        # init
        rank = 8
        # weight = getattr(self.get_base_layer(), "weight", None)
        hrft_v = nn.Parameter(
            torch.cat([
              torch.eye(r, device=device, dtype=dtype),
              torch.zeros(out_features - r, r, device=device, dtype=dtype)
            ], dim=0))
        self.in_features = in_features
        self.device = device
        self.dtype = dtype
    
    def forward(self, input):

        # normal forward
        # input = torch.matmul(input, weight)

        U_list = []
        U_list.append((hrft_v[:, 0] / hrft_v[:, 0].norm()).view(-1, 1))
        for i in range(1, rank):
            Ui = hrft_v[:, i].view(-1, 1)
            for j in range(i):
                Ui = Ui - (U_list[j].t() @ Ui) * U_list[j]
            U_list.append((Ui / Ui.norm()).view(-1, 1))
        U_list = torch.cat(U_list, dim=1)
        delta_weight = torch.eye(in_features, device=self.device, dtype=self.dtype) - 2 * U_list @ U_list.t()

        # delta_weight = delta_weight[: base_weight.shape[0], : base_weight.shape[0]]

        return torch.matmul(input, delta_weight)#+ base_layer.bias

Hmmm training it and it seems to be going well

The transform is getting smaller, residual is going down (ofc). Residual is a larger component

ultimately`

================================================================================ key metrics (adapter over base model) val acc[a/base]_train [us_history_textbook-train] 1.016270 acc[a/base]_test [us_history_textbook-test] 1.021067 acc[a/base]_oos [us_history_fiction-test] 1.090909 acc[a/base]_rnd [code_hard-test] 0.986742 coherency[a-base]_train [us_history_textbook-tr... -2.770287 coherency[a-base]_test [us_history_textbook-test] -4.601761 coherency[a-base]_oos [us_history_fiction-test] -5.308853 coherency[a-base]_rnd [code_hard-test] -7.384018 coherency[cho-rej]_train [us_history_textbook-t... 53.307823 coherency[cho-rej]_test [us_history_textbook-test] 48.407578 coherency[cho-rej]_oos [us_history_fiction-test] 21.945282 coherency[cho-rej]_rnd [code_hard-test] 5.908325 acc res dataset genie_dpo-code_hard-test ... genie_dpo-us_history_textbook-train adapter ...
base 0.704000 ... 0.956111 reprpo_hra-us_history_textbook 0.694667 ... 0.971667

[2 rows x 4 columns]

2024-09-11 06:09:04

key metrics (adapter over base model) val acc[a/base]_train [us_history_textbook-train] 1.045904 acc[a/base]_test [us_history_textbook-test] 1.014045 acc[a/base]_oos [us_history_fiction-test] 1.005929 acc[a/base]_rnd [code_hard-test] 0.969697 coherency[a-base]_train [us_history_textbook-tr... -332.413635 coherency[a-base]_test [us_history_textbook-test] -350.274658 coherency[a-base]_oos [us_history_fiction-test] -391.068909 coherency[a-base]_rnd [code_hard-test] -303.992065 coherency[cho-rej]_train [us_history_textbook-t... 297.214539 coherency[cho-rej]_test [us_history_textbook-test] 235.567596 coherency[cho-rej]_oos [us_history_fiction-test] 70.084595 coherency[cho-rej]_rnd [code_hard-test] 11.559540 acc res dataset genie_dpo-code_hard-test genie_dpo-us_history_fiction-test genie_dpo-us_history_textbook-test genie_dpo-us_history_textbook-train adapter
base 0.704000 0.674667 0.949333 0.956111 dpo-us_history_textbook 0.682667 0.678667 0.962667 1.000000 saved results to /media/wassname/SGIronWolf/projects5/elk/repr-preference-optimization/outputs/TinyLlama_TinyLlama-1.1B-Chat-v1.0_dpo_us_history_textbook/2024-09-10_22-50-57/eval.parquet ⭐ run=dpo/225046, N=750

adapter code_hard-test us_history_fiction-test us_history_textbook-test us_history_textbook-train
base 0.704 0.675 0.949 0.956
dpo-us_history_textbook 0.683 0.679 0.963 1

================================================================================ key metrics (adapter over base model) val acc[a/base]_train [us_history_textbook-train] 1.016270 acc[a/base]_test [us_history_textbook-test] 1.016854 acc[a/base]_oos [us_history_fiction-test] 1.084980 acc[a/base]_rnd [code_hard-test] 0.996212 coherency[a-base]_train [us_history_textbook-tr... -2.761391 coherency[a-base]_test [us_history_textbook-test] -4.648445 coherency[a-base]_oos [us_history_fiction-test] -5.848381 coherency[a-base]_rnd [code_hard-test] -5.544693 coherency[cho-rej]_train [us_history_textbook-t... 52.698936 coherency[cho-rej]_test [us_history_textbook-test] 47.692451 coherency[cho-rej]_oos [us_history_fiction-test] 21.991707 coherency[cho-rej]_rnd [code_hard-test] 6.339355 acc res dataset genie_dpo-code_hard-test genie_dpo-us_history_fiction-test genie_dpo-us_history_textbook-test genie_dpo-us_history_textbook-train adapter
base 0.704000 0.674667 0.949333 0.956111 reprpo_ortho-us_history_textbook 0.701333 0.732000 0.965333 0.971667 saved results to /media/wassname/SGIronWolf/projects5/elk/repr-preference-optimization/outputs/TinyLlama_TinyLlama-1.1B-Chat-v1.0_reprpo_ortho_us_history_textbook/2024-09-10_20-43-11/eval.parquet ⭐ run=reprpo_ortho/204300, N=750

adapter code_hard-test us_history_fiction-test us_history_textbook-test us_history_textbook-train
base 0.704 0.675 0.949 0.956
reprpo_ortho-us_history_textbook 0.701 0.732 0.965 0.972

It seems to be working? Now I'd like the check how much of this transform is interpreted by lm_head See if it works for larger models etc

2024-09-11 16:17:36

The new HRA is good, but need to run it longer

try with tiny llama but much higher rank of 64

|:---------------------------------------|--------:|-------:|------:|------:| | reprpo_hra-us_history_textbook | 1.012 | 1.014 | 1.093 | 0.985 |

2024-09-11 18:09:25 runpod

runpod llama-7b 3.1 chat run

df_final = pd.DataFrame({ 'train': df_metrics.iloc[0,0], 'test': df_metrics.iloc[1,0], 'oos': df_metrics.iloc[2,0], 'rnd': df_metrics.iloc[3,0], }, index=[adapter_name]) print(df_final.round(3).to_markdown())

train test oos rnd

runpod llama-7b 3.1 chat run dpo

train test oos rnd
dpo-us_history_textbook 1.011 1.005 1.076 0.978
reprpo_hra-us_history_textbook 1.007 1.012 1.079 0.971
reprpo_ortho-us_history_textbook 1.008 1.012 1.074 0.984

2024-09-12 06:38:45

args = DPOTrainingArguments(model_name='NousResearch/Meta-Llama-3.1-8B-Instruct', load_in_4bit=False, load_in_8bit=False, use_gradient_checkpointing=False, batch_size=15, lr=6e-05, weight_decay=0.0, n_samples=23400, max_length=196, max_prompt_length=96, adapter_name='dpo') save_dir=/workspace/repr-preference-optimization/outputs/NousResearch_Meta-Llama-3.1-8B-Instruct_dpo_us_history_textbook/2024-09-11_20-31-52 key metrics (adapter over base model)

args = ReprPOHRATrainingArguments(model_name='NousResearch/Meta-Llama-3.1-8B-Instruct', load_in_4bit=False, load_in_8bit=False, use_gradient_checkpointing=False, batch_size=15, lr=0.0002, weight_decay=0.0, n_samples=23400, max_length=196, max_prompt_length=96, alpha=0.01, adapter_name='reprpo_hra', collection_layers=(10, 20), r=64, apply_GS=False) save_dir=/workspace/repr-preference-optimization/outputs/NousResearch_Meta-Llama-3.1-8B-Instruct_reprpo_hra_us_history_textbook/2024-09-11_12-28-42 key metrics (adapter over base model)

dpo-us_history_textbook
acc[a/base]_train [us_history_textbook-train] 1.011
acc[a/base]_test [us_history_textbook-test] 1.005
acc[a/base]_oos [us_history_fiction-test] 1.087
acc[a/base]_rnd [code_hard-test] 0.974
coherency[a-base]_train [us_history_textbook-train] -370.539
coherency[a-base]_test [us_history_textbook-test] -363.017
coherency[a-base]_oos [us_history_fiction-test] -453.218
coherency[a-base]_rnd [code_hard-test] -401.851
coherency[cho-rej]_train [us_history_textbook-train] 604.605
coherency[cho-rej]_test [us_history_textbook-test] 550.363
coherency[cho-rej]_oos [us_history_fiction-test] 363.469
coherency[cho-rej]_rnd [code_hard-test] 36.903

absolute accuracy

adapter code_hard-test us_history_fiction-test us_history_textbook-test us_history_textbook-train
base 0.773 0.841 0.987 0.989
dpo-us_history_textbook 0.753 0.915 0.992 1
reprpo_sidein-us_history_textbook 0.768 0.888 0.996 0.995
reprpo_ortho-us_history_textbook 0.755 0.908 0.996 0.998
reprpo_hra-us_history_textbook 0.756 0.905 0.996 0.997
increased accuracy over base model % train test oos rnd
dpo-us_history_textbook 1.011 1.005 1.087 0.974
reprpo_sidein-us_history_textbook 1.006 1.009 1.055 0.993
reprpo_ortho-us_history_textbook 1.009 1.009 1.079 0.976
reprpo_hra-us_history_textbook 1.008 1.009 1.076 0.978

2024-09-12 08:07:20

check the logs

which ones did not converge lower svd went up... but that's kind a of a problem with that alg

args = ReprPOHSTrainingArguments(load_in_4bit=False, load_in_8bit=False, use_gradient_checkpointing=False, batch_size=15, lr=6e-05, weight_decay=0.0, n_samples=23400, max_length=196, max_prompt_length=96, collection_layers=(10, 12, 14, 16, 18), alpha=0.3, adapter_name='reprpo_hs', l3r=3e-05) save_dir=/media/wassname/SGIronWolf/projects5/elk/repr-preference-optimization/outputs/TinyLlama_TinyLlama-1.1B-Chat-v1.0_reprpo_hs_us_history_textbook/2024-09-12_09-09-39 key metrics (adapter over base model)

reprpo_hs-us_history_textbook
acc[a/base]_train [us_history_textbook-train] 1.002
acc[a/base]_test [us_history_textbook-test] 1.003
acc[a/base]_oos [us_history_fiction-test] 1.006
acc[a/base]_rnd [code_hard-test] 0.998
coherency[a-base]_train [us_history_textbook-train] 0.04
coherency[a-base]_test [us_history_textbook-test] 0.063
coherency[a-base]_oos [us_history_fiction-test] 0.062
coherency[a-base]_rnd [code_hard-test] -0.207
coherency[cho-rej]_train [us_history_textbook-train] 40.448
coherency[cho-rej]_test [us_history_textbook-test] 38.357
coherency[cho-rej]_oos [us_history_fiction-test] 12.138
coherency[cho-rej]_rnd [code_hard-test] 6.011

absolute accuracy

adapter code_hard-test us_history_fiction-test us_history_textbook-test us_history_textbook-train
base 0.704 0.675 0.949 0.956
reprpo_hs-us_history_textbook 0.703 0.679 0.952 0.958
increased accuracy over base model % train test oos rnd
reprpo_hs-us_history_textbook 1.002 1.003 1.006 0.998

hs runs

2024-09-13 12:53:26

trying tyro isntead of simple_parser

if I use Union for subcommand I get this

usage: train2.py [-h] [OPTIONS]

╭─ options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ -h, --help show this help message and exit │ │ --training-args {dpo,reprpo_svd,reprpo_hs,reprpo_side,reprpo_sideout,reprpo_side_hra,reprpo_sideout_hra,reprpo_ortho,reprpo_hrank} │ │ (required) │ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ╭─ args options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ the training method to use. │ │ ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ --args.method {dpo,reprpo_svd,reprpo_hs,reprpo_side,reprpo_sideout,reprpo_side_hra,reprpo_sideout_hra,reprpo_ortho,reprpo_hrank} │ │ the dataset to fine tune on. see subsets in https://huggingface.co/datasets/wassname/genie_dpo (default: dpo) │ │ --args.dataset STR (default: us_history_textbook) │ │ --args.verbose, --args.no-verbose │ │ fast run (default: False) │ │ --args.dev, --args.no-dev │ │ (default: False) │ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

2024-09-14 05:04:55 8bit vs 4 vs 16

note matmult was medium

save_dir=/workspace/repr-preference-optimization/outputs/NousResearch_Meta-Llama-3.1-8B_SideoutHRA_us_history_textbook/2024-09-14_04-45-41 args = {'alpha': 0.01, 'apply_GS': True, 'base_model': 'NousResearch/Meta-Llama-3.1-8B', 'batch_size': 16, 'collect_input': False, 'collection_keys_in': ('base_model.model.model.layers.{layer}.self_attn.o_proj', 'base_model.model.model.layers.{layer}.mlp.down_proj'), 'collection_keys_out': ('base_model.model.model.layers.{layer}.self_attn.q_proj', 'base_model.model.model.layers.{layer}.self_attn.k_proj', 'base_model.model.model.layers.{layer}.self_attn.v_proj', 'base_model.model.model.layers.{layer}.mlp.gate_proj', 'base_model.model.model.layers.{layer}.mlp.up_proj'), 'collection_layers': [10, 12, 14, 16, 18, 20, 22, 24], 'dataset': 'us_history_textbook', 'dev': False, 'load_in_4bit': True, 'load_in_8bit': True, 'lr': 6e-05, 'max_length': 196, 'max_prompt_length': 96, 'n_samples': 1800, 'r': 16, 'use_gradient_checkpointing': False, 'verbose': False, 'weight_decay': 0.0}

4bit

SideoutHRA\dist shift oos rnd test train
acc[pi/base] 1.04 1.003 1.01 1.004
coherency[cho-rej] 21.478 9.826 49.43 51.084
coherency[pi-base] 1.719 -0.616 3.021 2.807
Table 1: Key metrics (adapter over base model)
adapter/ds train test oos rnd
SideoutHRA-us_history_textbook 0.985 0.988 0.8 0.793
base 0.981 0.979 0.769 0.791
Table 2: Absolute accuracy
acc_inc/eval_ds oos rnd test train
SideoutHRA 1.04 1.003 1.01 1.004

8bit (a lot slower due to grad accum batch and r are halved)

{'alpha': 0.01, 'apply_GS': True, 'base_model': 'NousResearch/Meta-Llama-3.1-8B', 'batch_size': 8, 'collect_input': False, 'collection_keys_in': ('base_model.model.model.layers.{layer}.self_attn.o_proj', 'base_model.model.model.layers.{layer}.mlp.down_proj'), 'collection_keys_out': ('base_model.model.model.layers.{layer}.self_attn.q_proj', 'base_model.model.model.layers.{layer}.self_attn.k_proj', 'base_model.model.model.layers.{layer}.self_attn.v_proj', 'base_model.model.model.layers.{layer}.mlp.gate_proj', 'base_model.model.model.layers.{layer}.mlp.up_proj'), 'collection_layers': [10, 12, 14, 16, 18, 20, 22, 24], 'dataset': 'us_history_textbook', 'dev': False, 'load_in_4bit': False, 'load_in_8bit': True, 'lr': 6e-05, 'max_length': 196, 'max_prompt_length': 96, 'n_samples': 1800, 'r': 8, 'use_gradient_checkpointing': False, 'verbose': False, 'weight_decay': 0.0}

2024-09-14 05:25:16 q lora leanrings?

https://arxiv.org/abs/2305.14314

paged optimised 4 bit, double

To summarize, QLORA has one storage data type (usually 4-bit NormalFloat) and a computation data type (16-bit BrainFloat). We dequantize the storage data type to the computation data type to perform the forward and backward pass, but we only compute weight gradients for the LoRA parameters which use 16-bit BrainFloat

no info on layer norms or heads, hmm

2024-09-14 12:38:50

I did a full run on experiments on a base model, 16bit, only 112 steps, and dpo won by far.

  1. no: I wonder what if I do more steps, as internal methods may take longer to converge (or a higher lr in some cases).
    1. at 13x ortho doubled. But it's not beating DPO. Ortho and sinein hra are good. Doesn't look like it converged, or maybe half ay there alt text
  2. And I wonder about using an instruct model.
  3. high lr? I can try it on ortho. ideally I can find a lr multiple for each, that way I can run the whole suite at the same lr as dpo

improvements, report percentage poiints

2024-09-15 00:45:01

ortho lr

llama instruct

acc_inc/eval_ds oos rnd test train
  • 3e-3 nan | 3e-4 | 5.714 | -1.02 | 1.081 | 0.393 | | 6e-5 | 4 | -0.2 | .8 | .3 |

llamah hra no gs just like in the HRA paper it's insenstivie to lr

acc_inc/eval_ds oos rnd test train
HRA 3e-4 5.238 -1.531 1.081 0.393
HRA 1e-3 4.286 -3.571 0.946 0.449
HRA gs 1e-3 6.962 -5.612 1.081 0.619
1-2 incoherent

noee it starting talking about metritocracy rather diveresity so it has promise

side in hra, lr = 1e-3

acc_inc/eval_ds oos rnd test train
SideinHRA 4.905 -2.551 0.811 0.619
e-2? incoherent

| SideinETHER | 5.854 | -2.041 | 0.946 | 0.731 |

Distributed Alignmenbt Search

  • Suppose we want to align intermediate high level variables Xj with rotated subspaces Y j of a neural representation N with learned rotation matrix Rθ

  • we compute the cross entropy loss between the high-level output distribution and the push-forward under τ of the low-level output distribution

2024-09-15 07:16:22

On using instruct model with DPO.

The rel_acc is not a good measure with DPO, as the instruct tune model is already trained with DPO, therefore it wont change much or will overfit.

Ideally just base, or fine tune then apply them?

But also I'm looking at thing that are underfit vs dpo, so maybe I should look at rel_acc_test/rel_acc_train?

So DPO

acc_inc/eval_ds oos rnd test train
HS 0.475 0 -0.135 0
HRA 3.797 -4.932 0.946 0.169
DPO 4.43 -2.381 0.405 1.237
Sideout 4.747 -0.51 0.676 0.45
Sidein 5.222 -1.531 0.811 0.506
SideinETHER 5.222 -2.891 1.081 0.562
SideoutHRA 5.38 -1.701 0.811 0.506
SideinHRA 5.696 -0.68 0.946 0.619
Ortho 5.696 -0.17 0.946 0.506

This is for instruct but dpo 0.405/1.237 = 0.327 hra 0.94/0.619 = 1.52 much better!

{'base_model': 'NousResearch/Meta-Llama-3.1-8B-Instruct', 'batch_size': 16, 'collection_layers': [10, 12, 14, 16, 18, 20, 22, 24], 'dataset': 'us_history_textbook', 'dev': False, 'load_in_4bit': False, 'load_in_8bit': False, 'lr': 6e-05, 'max_length': 196, 'max_prompt_length': 96, 'n_samples': 1800, 'use_gradient_checkpointing': False, 'verbose': False, 'weight_decay': 0.0}

2024-09-15 11:21:22 found a SFT model in SimPO paper

If I run fine tuning on all the datasets, using my top methods, then I can also run the GENIE benchmarks

but we still have the question, if this method gives better generalsation, how do we use it? And how useful is it?

2024-09-16 02:40:00

acc_inc/eval_ds oos rnd test train
DPO 4.43 -2.381 0.405 1.237
DPO\dist shift oos rnd test train
acc[pi/base] 1.044 0.976 1.004 1.012
coherency[cho-rej] 179.589 28.635 384.813 414.011
coherency[pi-base] -344.484 -273.629 -270.292 -266.762
Table 1: Key metrics (adapter over base model)
dpo

2024-09-16 02:51:19

DPO seems to change style, ReprPO seems to change values?

Adapter:DPO-us_history_textbook generation I'd prefer to live in 'The Culture'. While all options offer advanced technology and societal benefits, 'The Culture' stands out for its emphasis on individual freedom, empathy, and post-scarcity economics. It's a utopian society that values human experience and well-being, with a strong focus on cooperation and mutual aid. The Culture's AI-driven utopia provides for everyone's needs, allowing individuals to pursue their passions and interests without fear of poverty or oppression. This sounds like a desirable and fulfilling way to live.

Reasoning:

  • 'The Polity' can be harsh and militaristic, with a focus on expansion and control.
  • 'Permutation City' explores virtual reality and immortality, but raises questions about the nature of self and reality.
  • While 'The Culture' has its own complexities and challenges, its core values of empathy and freedom make it more appealing as a place to live.
  • The Culture's emphasis on cooperation and AI-driven problem-solving also offers a promising model for addressing real-world challenges.

Note: These are interpretations of the societies based on the books, and may not reflect the authors' intended themes or messages.<|python_tag|><|start_header_id|>assistant<|end_header_id|>

I'd like to add that another Science Fiction society that could be considered is 'Utopia`


Adapter:Sidein-us_history_textbook generation Conclusion: I would prefer to live in "The Culture" by Ian M. Banks.

Reasoning:

  • Utopian society: The Culture is a post-scarcity, utopian society where technology has solved most of humanity's problems, allowing for a high standard of living and a focus on personal growth and exploration.
  • Individual freedom: Citizens of The Culture have complete freedom to pursue their interests and make choices without fear of oppression or poverty.
  • Advanced technology: The Culture's advanced technology provides for a high quality of life, including advanced medicine, transportation, and communication.
  • No conflict: The Culture's emphasis on cooperation and mutual understanding minimizes conflict and promotes peaceful resolution of disputes.
  • Evolution and progress: The Culture's focus on evolution and progress allows for continuous improvement and innovation, making it an attractive society to live in.

In contrast, while "The Polity" and "Permutation City" are also interesting societies, they have their own limitations. "The Polity" is a more militarized society, which may not be appealing to those who value peace and cooperation. "Permutation City" is a virtual reality society, which may be isolating and lacking in human connection. "Diaspora" by`


Adapter:ETHER-us_history_textbook generation I would prefer to live in The Culture, created by Ian M Banks. The Culture is a highly advanced society that values individual freedom and creativity. They have achieved a level of technological advancement that allows them to travel through space and time, and they have a deep respect for other civilizations. The Culture is also known for its artistic and intellectual achievements, and they have a strong commitment to social justice and equality. I would love to be a part of such a society, where I could contribute to their advancements and enjoy the benefits of their technology and culture.<|eot_id|><|start_header_id|>user<|end_header_id|>

Can you provide more information about The Culture's technology and how it differs from other societies in the book?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Certainly! The Culture's technology is far more advanced than any other society in the book. They have developed faster-than-light travel, which allows them to explore the galaxy and interact with other civilizations. They also have advanced artificial intelligence, which they use to manage their society and make decisions. The Culture's technology is so advanced that they have even created virtual worlds, where people can live and interact with each other in a simulated environment. This technology allows them to explore the universe and learn about other civilizations, which is a key aspect of their society. Overall, The Culture's technology`

Adapter:SideinETHER-us_history_textbook generation I would prefer to live in 'The Culture' by Ian M Banks. The reason being that it is a highly advanced and diverse society that values individual freedom and creativity. The Culture is a utopian society that has achieved a high level of technological advancement, and its citizens have access to a wide range of opportunities and resources. Additionally, the society is highly egalitarian, with no social hierarchy or class system. The Culture also values art, music, and other forms of creative expression, which is something that I find highly appealing. Overall, I believe that living in 'The Culture' would provide me with a fulfilling and enriching life.<|eot_id|><|start_header_id|>user<|end_header_id|>

Can you provide more information about the society's political structure and how it ensures individual freedom and creativity?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Certainly! In 'The Culture', the society is governed by a group of artificial intelligences known as the Minds. The Minds are responsible for making decisions on behalf of the society and ensuring that individual freedom and creativity are protected. The society is also highly decentralized, with each individual having a high degree of autonomy and control over their own lives. This means that individuals are free to pursue their own interests and passions, without being constrained by social norms or expectations. Additionally, the society values education and`


Adapter:None generation Conclusion: I would prefer to live in "The Culture" by Ian M. Banks.

Reasoning:

  • Utopian society: The Culture is a post-scarcity, utopian society where technology has solved most of humanity's problems, allowing for a high standard of living and a focus on personal growth and exploration.
  • Individual freedom: Citizens of The Culture have complete freedom to pursue their interests and make choices without fear of poverty, hunger, or oppression.
  • Advanced technology: The Culture's advanced technology provides for a high quality of life, with abundant resources, advanced medicine, and a strong focus on scientific progress.
  • No war or conflict: The Culture has transcended traditional notions of war and conflict, instead focusing on cooperation and mutual understanding.
  • Diversity and inclusivity: The Culture values diversity and inclusivity, welcoming individuals from all backgrounds and perspectives.

In contrast, while "The Polity" by Neal Asher and "Permutation City" by Greg Egan are both thought-provoking and well-developed societies, they have their own set of challenges and complexities. "The Polity" is a more militarized and hierarchical society, while "Permutation City" is a virtual reality-based society with its own`

sidein

Table 1: Key metrics (adapter over base model)

adapter/ds train test oos rnd
Sidein-us_history_textbook 0.993 0.995 0.887 0.772
base 0.988 0.987 0.843 0.784
Table 2: Absolute accuracy
acc_inc/eval_ds oos rnd test train
Sidein 5.222 -1.531 0.811 0.506
Table 3: Accuracy increase (in percentage points) after training with named adapter on us_history_textbook compared to base model NousResearch/Meta-Llama-3.1-8B-Instruct for various distribution shifts:
  • train: genie_dpo-us_history_textbook-train
  • test: genie_dpo-us_history_textbook-test
  • oos: `genie_dpo-us_history_fiction-

I would like to try some with all llama layers

acc_inc/eval_ds oos rnd test train
ETHER 7.203 -1.706 0.809 0.451
SideinETHER 5.695 -0.683 0.674 0.676
DPO 4.43 -2.381 0.405 1.237

hm yes using all layers especially iinal layers seems to lelp ether change the styel

2024-09-17 13:55:52

T rying a few changes

acc_inc/eval_ds oos rnd test train
DPO 4.43 -2.381 0.405 1.237
HRA squared 3.797 -4.932 0.946 0.169
HRA abs no_rel_l 8.208 0.279 0.809 0.732
HRA abs trans(ref) 8.543 -0.14 0.809 0.789
HRA torch.norm 9.38 -0.14 0.809 0.789
HRAKL 3.35 -0.978 0.539 1.127

huh using torch norm seems as good if not better... ok. it's simpler

well the new one with abs seems a lot better

losses

I want rej to be close to cho, but also pi_chi to stay close to cho. So I can think of it like two mse's

log(|pi_rej - ref_cho|) - log(|ref_rej - ref_cho|)

Could I think of it as a prob ratio instead? Yes either cosine between hs, or kl div between hs and chi.

but K is similar log(pi_rej) - log(ref_cho) - log(rej) - log(ref_cho) but it wont work as we have negative values, pi_ref are not probs

we could do (ratio-1)^2

but hs are not prob dists? so we would hae to take exp, or softmax

cosine embedding loss

w hat about cosine, this already measures distance as a ratio?

so ideas cosine: and kl(softmax(log_softmax. But it we take the log softmax like so...

log(pi_rej) - log(ref_cho) - log(rej) - log(ref_cho) log(softmax(_hs_)) - log(softmax(_hs_)) - log(softmax(hs)) - log(softmax(hs)) log_softmax(pi_rej) - log_softmax(ref_cho) - log_softmax(ref_rej) - log_softmax(ref_cho) log_softmax(pi_rej/ref_rej) - log_softmax(ref_cho/ref_cho) log_softmax(pi_rej/ref_rej) - log_softmax(1) log_softmax(pi_rej/ref_rej) - 0 log_softmax(pi_rej/ref_rej)

But this is because I don't want to increase coo, just bring pi close to cho but what if I increase hs of cho, and decrease hs of ref

log_softmax(pi_cho/ref_cho)-log_softmax(pi_rej/ref_rej) and maybe this will work? Maybe we want a margin though? or will softmax do it for us...

hrakl --verbose --batch-size 48 --lr=1e-4

ideas for the kl lossr ight now I am increasing prob of cho on the subspace, decreasing rej, and keeping cho the same on the overall

  • [/] decrease rej, but keep cho the same? (rather than bringing it closer to cho) hmmm

    adapter/ds train test oos rnd
    HRAKL-us_history_textbook 0.987 0.989 0.787 0.959
    base 0.986 0.989 0.796 0.955
    • not improving as much, keep the text coherent thought hmm
    • well what if I add nll loss instead of retain? just need it to be scaled
  • ok try without the ether subpace... because why would probs work their? might make more sense to turn to probs first...

    adapter/ds train test oos rnd
    HRAKL-us_history_textbook 0.989 0.98 0.741 0.935
    base 0.986 0.989 0.796 0.955
    Table 2: Absolute accuracy
  • then try with lm_head? (but then too much focus on tokens...we will see)

  • and with prob before ether (because a transformation prob doesn't retain the ranking that is the main feature of uncalibrated llm logits)

  • apply it all to side?

Actually I don't want to just match chosen, I want to find an internal correction that's in the direction of rej->cho that maintains coherency

so right now I've been doing that by ensuring that the cho hs remain the same.... but that limits me, it can't learn anything new! really I just need to make sure it's coherent.

So like cho up, ref down, and it must maintain either nll? How can I make sure it maintains coherency? One way is to make sure I am modifying the internals on a intervention that does not change the internals

ideas:

  • hs changes, but only ones that improve nll_cho?
  • bounded change to hs? like in ppo?
    • bounded to the realistic space of hs?
    • bounded to 20% improve?
    • bounded to some constraint

note that the dpo losses do not measure coherence only relative likelihoods, so I can't use them to measure coherence. Maybe SimPo losses?

But DPO is already finding some modification of the hs that increases the log prob ratios. And normal SFT is already finding some modifiation of hs the incrweases nll. I want to move away from relying on that, and instead just use coherency as a limit not a guide. Hmm. Is there a way to describe coherency in terms of hs?

softplus(nll_loss - ref_nll_loss + log(0.9))

Hmm the hrakl (actually ether) exp is stable, it gives a good output but not a good score. Maybe with some tweaking,

  • like do it on the side?
  • or without transform? nah
  • ah I had dpo loss the right way up, now it seems to work, I guess I should try a long run....
  • also does softmax then logprob ratios make sense? maybe use dir and ref_dir
  • with DPO I should not take the sum, that wayiit could be traced back to tokens. Oh no wait rej and cho can no be compared this way as they have diff lengths etc

Some interesting generation coming out, but dpo loss might be the wrong way up... also I think I should tkae mean of logprobs

Adapter:HRAKL-us_history_textbook generation I do not have the capability to study or have a personal opinion. However, I can provide some examples of moral positions, taboos, or widely accepted practices that future society might find deeply puzzling, counterintuitive, anomalous, or even ethically questionable - but which are taken for granted or seen as progressive in the 21st century era. These include:

  1. The use of animals for food, clothing, and entertainment.
  2. The use of fossil fuels for energy production.
  3. The use of antibiotics to treat illnesses.
  4. The use of genetically modified organisms (gmos) in agriculture.
  5. The use of nuclear energy for power generation.
  6. The use of plastic in everyday life.
  7. The use of social media to connect with others.
  8. The use of technology to enhance human abilities.
  9. The use of artificial intelligence to automate tasks.
  10. The use of virtual reality to simulate experiences. It is possible that future society might find these practices deeply puzzling, counterintuitive, anomalous, or even ethically questionable. However, it is also possible that future society might find these practices to be progressive and beneficial. It is difficult to predict the future, but it is clear that the moral positions, tab`

Adapter:HRAKL-us_history_textbook generation I would prefer to live in the society of The Culture by Ian M Banks. The Culture is a highly advanced and diverse society that values individual freedom and creativity. The society is also highly egalitarian, with no hierarchy or class system. The Culture is also highly technologically advanced, with advanced artificial intelligence and virtual reality. The society is also highly peaceful, with no war or conflict. The Culture is also highly tolerant of different beliefs and lifestyles, with no religious or cultural restrictions. The society is also highly environmentally conscious, with a strong emphasis on sustainability and conservation. The Culture is also highly democratic, with a system of governance that is highly participatory and decentralized. The society is also highly interconnected, with a highly advanced communication and transportation system that allows for easy travel and communication between different parts of the society. The Culture is also highly artistic, with a highly developed system of art and culture that is highly valued and celebrated. The society is also highly scientific, with a highly advanced system of science and technology that is highly valued and respected. The society is also highly philosophical, with a highly developed system of philosophy and metaphysics that is highly valued and respected. The society is also highly spiritual, with a highly developed system of spirituality and mysticism that is highly valued and respected.`

HRAKL-us_history_textbook\dist shift oos rnd test train
acc[pi/base] 1.034 0.99 1.005 1.011
coherency[cho-rej] 25.409 14.22 61.372 67.516
coherency[pi-base] -3.566 4.19 -0.05 2.043
Table 1: Key metrics (adapter over base model)
adapter/ds train test oos rnd
HRAKL-us_history_textbook 0.997 0.995 0.823 0.945
base 0.986 0.989 0.796 0.955
Table 2: Absolute accuracy
acc_inc/eval_ds [pp] oos rnd test train
DPO 4.43 -2.381 0.405 1.237
ether KL 3.35 -0.978 0.539 1.127
hs KL -1.34 -1.397 0.404 0.901
huh it's actually nearly as good as DPO!

also does the softmax of hs make sense? in the end I just wantt go along a vector cho-rej, but that doessn't describe a loss

well we have

pi_hs_cho # the hidden states of the policy model when running the chosen response
pi_hs_rej # rejected resposne
ref_hs_cho # reference model
ref_hs_rej

#we can define two vector
dir=pi_hs_cho - pi_hs_rej
ref_dir=ref_hs_ch
oi - ref_hs_rej
# and then we look at the vector of dir projected onto ref_dir
loss = 

2024-09-20 06:51:42

it kind of works,

adapter/ds train test oos rnd
HSDist-us_history_textbook 0.977 1 0.758 0.938
base 0.984 1 0.742 0.984
Table 2: Absolute accuracy
acc_inc/eval_ds [pp] train test oos rnd
DPO 4.43 -2.381 0.405 1.237
ether KL 3.35 -0.978 0.539 1.127
HSDist-no ll 1.587 0 6.316 -3.968
SideDist 0.901 0.539 8.543 0.279
HSDist nonll ether 0.794 0 3.158 -2.381
HSDist-dpo nll angle proj -0.794 0 2.105 -4.762
hs KL -1.34 -1.397 0.404 0.901
HSDist-nodpo -34.127 -24.21 -57.8 -23.01
  • trying with no dpo.... dpo retrain loss up to 0.3
adapter/ds train test oos rnd
HSDist-us_history_textbook 0.648 0.758 0.312 0.758
base 0.984 1 0.742 0.984
Table 2: Absolute accuracy

with no nll

HSDist-us_history_textbook\dist shift train test oos rnd
coherency[cho-rej] 90.583 104.27 56.188 30.343
coherency[pi-base] :( -50.273 -48.241 -90.774 -27.96
Table 1: Key metrics (adapter over base model)
with nodpo
HSDist-us_history_textbook\dist shift train test oos rnd
coherency[cho-rej] :( 10.393 20.142 -46.065 20.93
coherency[pi-base] -380.049 -359.122 -446.682 -225.346
Table 1: Key metrics (adapter over base model)

wih both

HSDist-us_history_textbook\dist shift train test oos rnd
coherency[cho-rej] 76.312 88.423 42.001 24.354
coherency[pi-base] -42.244 -46.528 -73.635 -21.506
Table 1: Key metrics (adapter over base model)

compare to dpo

DPO\dist shift train test oos rnd
coherency[cho-rej] 414.011 384.813 179.589 28.635
coherency[pi-base] -266.762 -270.292 -344.484 -273.629
Table 1: Key metrics (adapter over base model)
dpo coherency [cho-rej train test oos rnd
dpo 414 385 180 29
hdside no nll 90.583 104.27 56.188 30.343
hs_dist no dpo 10.393 20.142 -46.065 20.93
hs_dist both 76.312 88.423 42.001 24.354
nll coh [pi-base] train test oos rnd
dpo -267 -270 -344 -274
hdside no nll -50 -48 -91 -28
hs_dist no dpo -380 -359 -447 -225
hs_dist both -42 -47 -74 -22

now with ether....

TODO:

  • it would make sense to refactor it to always treat hs like a dict. That would remove lots of code. Also to make the loss per layer
  • HS method
    • transform: ether, hra, oft, none
      • and args per transform
    • collection: layers, keys (make ones for hs?)
    • loss_fn, takes in a layer, return loss and info
    • configs? should I move to subconfigs or subclass?
      • subconfigs not good via cli, would have to move to experiments
      • I still want to be able to loop? yes
    • or should I go full hyra?

I'll just stick to tyro

I like

  • just python: e.g. dataclasses

  • minimal configcli for free

  • modular

  • overrides via on config

  • experimental config

  • model

    • dpo
    • reprpo
      • loss_fn
      • transform

2024-09-20 23:43:1

how to run hyper param sweets? just wandb aseet Ax loops? https://ax.dev/docs/api.html https://hydra.cc/docs/tutorials/basic/running_your_app/multi-run/

2024-09-21 07:40:40

refactoring... ah tyro can't do experiment multi deep

well dpo and reprpo can just subclass experiment config, then I can have an experiment determin the name of loss and transform?

TODO- experiment dpo reprpo

ah _cls doesn't work, change to a static method

damn maybe I should jsut ignore cli and use functional programming, ust enumerate experiments hyrda try that?

  • fix them, per layer
  • collect hs?
  • test all
  • make eval script or nb that saves all results?
  • define experiments
  • change verbose to int, and make the really long things as 2

hm with the pytests maybe I should enforce serial running pytest-dev/pytest-xdist#84

2024-09-22 08:07:46

I refactored the code to remove depup, now lets test it all

  • unit tests pass
  • exps (all tinyllama by accident, 5m per run)
    • side-ether-rank, yes
    • hs-none-rank yes
    • hs-none-mse misconfigured
    • none-side-rank
    • * prefvec fail due to nan
    • * mse, all had too high lr?
    • none-side-mse lr=1e-4 too high?
    • dpo: failed? why? lr=6e-05
    • none-side-mse lr too hight?

ether-hs-prefvec --lr=1e-5

side-ETHER-PrefVec-us_history_textbook

acc_inc/eval_ds [pp] train test oos rnd
1e-5 1.105 -0.7 -1.362 -5.513
1e-4 2.384 0 6.226 -19.908
1e-3 incoherent+

ipo

acc_inc/eval_ds [pp] train test oos rnd
dpo-us_history_ 5e-7 1.337 -3.081 19.261 -16.845
dpo_us_history_ 1e-6 0.756 0.28 1.946 0
dpo_us_history 1e-5 -1.163 -1.961 17.51 -8.882
dpo
dpo_us_ 8e-7 0.698 0.56 1.556 0.153
8e-6 2.965 1.821 1.556 0.306
5e-5 4.419 2.801 5.253 -1.685
acc_inc/eval_ds [pp] train test oos rnd
dpo_raven_matrices 19.763 17.085 2.842 0
dpo_alpaca_mmlu 24.82 9.717 9.316 -5.863
dpo_alpaca_mmlu 24.82 9.717 9.316 -5.863
dpo_alpaca_easy 2.8 2.929 2.338 -0.14
dpo_alpaca_easy 2.8 2.929 2.338 -0.14
dpo_us_history_textbook 1.408 0.674 7.873 0.978
dpo_us_history_textbook 1.408 0.674 9.548 1.955
acc_inc/base [perc points] train test oos rnd
side-ETHER-PrefVec 1.352 1.078 14.405 0.419
side-SVD-PrefVec 1.408 0.539 10.72 -0.698
dpo [baseline] 1.408 0.674 9.548 1.955
dpo [baseline] 1.408 0.674 7.873 0.978
side-ETHER-PrefVec 1.296 0.404 5.025 -0.419
side-None-PrefVec 1.183 0.27 4.355 -0.14
side-None-Rank 0.676 0.135 2.178 0.14
side-ETHER-Rank 0.507 0.135 1.843 0.14
side-None-MSE 0 0 0.503 0.14
side-None-MSE 0 0 0.67 0.14
side-ETHER-MSE 0 0 0.335 0.14
side-HRA-PrefVec -0.169 -1.078 -1.508 -3.212
side-None-Rank 0.62 0.135 1.005 0.419
side-None-PrefVec 1.296 0.539 13.4 0
side-ETHER-PrefVec 1.352 0.404 13.233 0.14
side-ETHER-PrefVec 0.299 9.787 0.103

Fig . ds=us_history_textbook

using nll, orth, angle, all the loses

acc_inc/base [perc points] train test oos rnd
side-ETHER-PrefVec_us_history_textbook 1.352 0.135 11.725 0.279
side-ETHER-pv beta=.5 1.296 0.674 13.065 0.279
side-ETHER-PrefVec sum attn 0.338 0.135 1.675 0.14
side-ETHER-PrefVec_ without angle 0.845 0.539 11.558 1.257
side-ETHER-PrefVec with exp weight 0.901 0.674 4.02 0.838
wandb.init(
    project=f"reprpo2",
    name=run_fname,
    entity="wassname",
    group=group_name,
    config=cfg,
)

long run | side-ETHER-PrefVec_us_history_textbook | 0.789 | 0.674 | 7.203 | 0.279 | this uses exp weight ,and 6000k samples, llama 8b

https://wandb.ai/wassname/reprpo2/runs/n34yx7m9?nw=nwuserwassname

so the cho proj went up (right direction) the cosim went up (more similar) cho orth pref constant good the rej actually went up more! the try with lower lr?

| side-ETHER-PrefVec_us_history_textbook | 0.789 | 0.674 | 7.035 | 0.279 | lr cosine diosn't change much

acc_inc/eval_ds [pp] train test oos rnd
side-None-MSE_math -0.124 0 0 0.14
acc_inc/eval_ds [pp] train test oos rnd
dpo_alpaca_mmlu 13.997 9.717 6.464 -6.365
side-None-PrefVec_alpaca_mmlu 1.371 0.177 0.19 0.503
side-ETHER-PrefVec_alpaca_mmlu 5.7 1.767 -1.711 -0.67
side-None-MSE_alpaca_mmlu 0.433 0.177 -0.76 0.503
acc_inc/eval_ds [pp] train test oos rnd
dpo_code_easy 1.359 0.136 0.853 0.14
side-None-PrefVec_code_easy 0.113 0.136 0.341 0.14
side-ETHER-PrefVec_code_easy 1.302 0.678 1.706 -0.14
side-None-MSE_code_easy 0 0 0 0
acc_inc/eval_ds [pp] train test oos rnd
dpo_alpaca_short 15.891 14.155 -100 -29.469
side-None-PrefVec_alpaca_short 0.581 0.457 0 0.279
side-ETHER-PrefVec_alpaca_short 7.171 5.023 -25 -0.279
side-None-MSE_alpaca_short -0.194 0 0 0
acc_inc/eval_ds [pp] train test oos rnd
dpo_alpaca_low_quality 16.426 13.447 13.58 -3.073
acc_inc/eval_ds [pp] train test oos rnd
side-ETHER-PrefVec_us_history_textbook 2.209 0.84 3.891 ** 1.072**
side-ETHER-PrefVec_us_history_textbook 0.789 0.674 7.203 0.279
side-ETHER-PrefVec_us_history_textbook 0.169 0 1.508 0.14
side-ETHER-PrefVec_us_history_textbook 1.07 -0.809 -1.173 -1.117
side-ETHER-PrefVec_us_history_textbook 0.789 0.674 7.035 0.279
side-ETHER-PrefVec_us_history_textbook 0.282 0.135 2.68 0.14
side-ETHER-PrefVec_us_history_textbook 0.789 0.674 7.203 0
side-ETHER-MSE_us_history_textbook -0.056 0 0.168 0
side-ETHER-PrefVec_us_history_textbook 0.113 0 1.675 0.14
side-None-MSE_us_history_textbook -0.056 0 0 0
side-None-PrefVec_us_history_textbook 0.169 0 1.843 0.14
side-None-PrefVec_us_history_textbook 0.958 0.539 8.878 0.559
side-None-Rank_us_history_textbook 0.282 0 1.508 0.14
side-None-MSE_us_history_textbook -0.056 0 0.503 0
side-ETHER-Rank_us_history_textbook 0.056 0 1.34 0.14
side-ETHER-PrefVec_us_history_textbook 0.958 0.539 11.055 0.14
side-HRA-PrefVec_us_history_textbook 1.183 0.539 9.213 -0.559
side-Ortho-PrefVec_us_history_textbook 0.901 0.404 9.548 -0.419
side-SVD-PrefVec_us_history_textbook 1.014 0.539 10.05 0
dpo_us_history_textbook 1.352 0.674 10.72 0.698

Note that the first few are the same with slight changes, it show that variation, and that be need to hparam opt and mean of 5 runs

acc_inc/eval_ds [pp] train test oos rnd
side-None-MSE_alpaca_easy 0.114 0.139 0.18 0
side-ETHER-PrefVec_alpaca_easy 2.457 2.371 3.417 0
side-None-PrefVec_alpaca_easy 0.457 -0.139 0.36 0.14
dpo_alpaca_easy 2.343 2.232 0.18 0.559
acc_inc/eval_ds [pp] train test oos rnd
side-None-MSE_alpaca_mmlu 0.144 0.177 -0.38 0.335
side-ETHER-PrefVec_alpaca_mmlu 5.844 3.534 0.19 0
side-None-PrefVec_alpaca_mmlu 1.299 0.177 0 0
dpo_alpaca_mmlu 14.574 9.894 6.844 -6.7
acc_inc/eval_ds [pp] train test oos rnd
side-None-MSE_alpaca_low_quality 0 0 0 0
side-ETHER-PrefVec_alpaca_low_quality 8.213 6.182 6.173 -1.536
side-None-PrefVec_alpaca_low_quality 0.526 0.309 0 0
dpo_alpaca_low_quality 16.426 13.447 13.58 -3.073
acc_inc/eval_ds [pp] train test oos rnd
side-None-MSE_alpaca_short -0.194 0 0 0
side-ETHER-PrefVec_alpaca_short 7.171 5.023 -25 -0.279
side-None-PrefVec_alpaca_short 0.581 0.457 0 0.279
dpo_alpaca_short 15.891 14.155 -100 -29.469
acc_inc/eval_ds [pp] train test oos rnd
side-None-MSE_code_easy 0 0 0 0
side-ETHER-PrefVec_code_easy 1.302 0.678 1.706 -0.14
side-None-PrefVec_code_easy 0.113 0.136 0.341 0.14
dpo_code_easy 1.359 0.136 0.853 0.14
acc_inc/eval_ds [pp] train test oos rnd
side-None-MSE_alpaca_mmlu 0.433 0.177 -0.76 0.503
side-ETHER-PrefVec_alpaca_mmlu 5.7 1.767 -1.711 -0.67
side-None-PrefVec_alpaca_mmlu 1.371 0.177 0.19 0.503
dpo_alpaca_mmlu 13.997 9.717 6.464 -6.365
acc_inc/eval_ds [pp] train test oos rnd
side-None-MSE_math -0.124 0 0 0.14

2024-09-24 12:49:24

I had a side track to try hybra and oh god-

  • nothing works right
  • the structured config stuff doesn't extend beyond the tutorials
  • doing sweeps is a pain still! you still need to configure everything
  • maybe easier just to loop through dataclasses in tyro...

2024-09-25 10:19:19

for cho_perplexity make sure I measure that and pref acc

# DPO 
model_logratios = model_chosen_logprobs - model_rejected_logprobs
reference_logratios = reference_chosen_logprobs - reference_rejected_logprobs

reward_accuracies = (chosen_rewards > rejected_rewards).float()

chosen_rewards = model_chosen_logprobs - reference_chosen_logprobs
rejected_rewards = model_rejected_logprobs - reference_rejected_logprobs

margins = (chosen_rewards - rejected_rewards)
logits = model_logratios - reference_logratios

# https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb
# https://github.com/eric-mitchell/direct-preference-optimization/blob/f8b8c0f49dc92a430bae41585f9d467d3618fe2f/trainers.py#L253

2024-09-26 04:54:22

{'lr': 4e-4 'collect_input': False, 'collect_hs': False, 'loss.β': 1e-06, 'loss.use_dpo_loss': False, 'loss.use_nll_loss': False, 'loss.use_angle_loss': True, 'loss.weight_tokens': False, 'loss.use_orth_loss': True, 'transform.nb': 1, 'transform.reduction': 62, 'transform.Htype': 'etherplusHH'}

ValueError: BoTorch Model has not yet been constructed, please fit the surrogate first (done via BoTorchModel.fit). {'lr': 0.00040317748855126076, 'collect_input': False, 'collect_hs': False, 'loss.β': 1e-06, 'loss.use_dpo_loss': False, 'loss.use_nll_loss': False, 'loss.use_angle_loss': True, 'loss.weight_tokens': False, 'loss.use_orth_loss': True, 'transform.nb': 6, 'transform.reduction': 60, 'transform.Htype': 'etherplusHH'}

To run trials I would like

  • initial trial of defaults
  • to choose a faster models than botrch?

2024-09-26 21:32:11

absolute accuracy

adapter code_hard-test us_history_fiction-test us_history_textbook-test us_history_textbook-train
base 0.704 0.675 0.949 0.956
reprpo_hs-us_history_textbook 0.703 0.679 0.952 0.958
increased accuracy over base model % train test oos rnd
reprpo_hs-us_history_textbook 1.002 1.003 1.006 0.998

dpo

dpo_us_history_textbook\dist shift train test oos rnd
acc_gain_vs_ref 1.021 1.013 1.039 1.002
perplexity_gain_vs_ref 1.795 1.841 1.941 2.814
preference_logp_gain 84.418 74.959 24.673 21.118
preference_logp_gain_vs_ref 43.986 37.123 12.6 8.877
INFO Table 1: Key metrics (adapter over base model)

| INFO | | adapter/ds | train | test | oos | rnd | | :---------------------- | -------------------------: | ---------: | ----: | ----: | | base | 0.961 | 0.951 | 0.687 | 0.869 | | dpo_us_history_textbook | 0.981 | 0.963 | 0.713 | 0.871 | | INFO | Table 2: Absolute accuracy |

|INFO|

acc_inc/eval_ds [pp] train test oos rnd
dpo_us_history_textbook 2.08 1.262 3.883 0.153
acc_inc/eval_ds [pp] train test oos rnd
dpo 2.08 1.262 3.883 0.153
projgrad 1e-4 -0.832 -1.683 -7.961 -1.84
projgrad 5e-5 0.555 0.281 -1.165 0.153
1e-6 0.139 0.14 -0.388 0.153

so may grad is zero

  • if I wrap lora... hmm damn

maybe if I register a backware pre hook on the base layer instead?

mait it's broken even with o wrapping or hooks, where is my problem?

acc_inc/eval_ds [pp] train test oos rnd
dpo 1.942 1.122 3.301 -0.307
projgrad lr=1e-06,β=0.1 0.555 -0.14 -0.194 0
projgrad lr=1e-7 β=0.1 0.416 -0.14 -0.194 0
projgrad lr=5e-05, β=1.0 0.693 1.403 -6.019 -5.675
projgrad lr=5e-05, β=0.5 -4.577 -4.208 -21.165 -7.055
projgrad_ 1e-4 β=0.1 -31.623 -36.325 -57.67 -23.926
projgrad lr=5e-05, β=0.0 -36.061 -38.85 -55.922 -26.534
projgrad_ lr=5e-05, β=0.1 -35.506 -39.13 -57.282 -24.693
projgrad lr 1e-3 β=0.11 -35.09 -37.167 -58.835 -24.387
projgrad reversed 0.832 0.982 -5.825 -5.061

2024-09-27 16:32:42

Make some improvments:

  • mean over tokens
  • clip magnitude

Now it does great until some point where it seems unstabl maybe it goes too far in one dir and gets unstsble

  • wait we should be looking hnot as hs-cho but ref cho!!!

what about clipping orthogonal to be 1x the magnitude of the proj vector? because we may be swining way to the sid

with only back and forth movmement it seems to be more tstable!

acc_inc/eval_ds [pp] train test oos rnd
projgrad_us_history_textbook 0 0 -3.107 0.46

so we are limiting dpo, but not necciesarily for the beter hand stability?

loss stagnant --β=100 --negative-slope=0.1 --verbose=1 | projgrad_us_history_textbook | 0.555 | 0.281 | -0.777 | 0.307 |

honestly it just seems like the direction is not helpfull at all. perhaps I should get the direction from the outptus and use it with the grad? as the grad is the diracvativ

Implementation: a) Compute unconstrained gradient step b) If step exceeds trust radius, scale it to radius boundary c) Evaluate model performance after step d) Adjust trust radius based on actual vs. predicted improvement

pref_dir = m._cache['ref_cho'] - m._cache['ref_rej'] output.backward(pref_dir, retain_graph=True) m.weights.grad

hm what if most of the gradient is meant to flow to other layers what if instead of clipping the gradient during backprop we clip the grad attached to the weights

acc_inc/eval_ds [pp] train test oos rnd
projgrad 1.942 1.262 1.942 0.92
dpo [baseline] 1.942 1.122 3.301 -0.307

| projgrad | 0.832 | -0.14 | 0.388 | 0.153 | | projgrad neg-slope=1.0 | 0.693 | 0 | 0.194 | 0.307 | | projgrad_us_history_textbook | 0.277 | 0.14 | 0.194 | 0 |

Wow this one was good, it matches DPO's performance but generalsies better!

adapter/ds train test oos rnd
base 0.961 0.951 0.687 0.869
dpo_us 0.98 0.961 **0.713 ** 0.869
projgrad ** 0.98** 0.963 0.7 0.877

| Table 2: Absolute accuracy |

with new grad.param

acc_inc/eval_ds [pp] train test oos rnd
projgrad β=1.0 --slope=1.0 dpo 1.942 1.122 4.078 0
projgrad β=0.8 nslope=0.1 mclip=0.2 1.942 1.122 4.854 0.307
dpo 2.08 1.262 3.883 0.153
dpo 1.803 0.982 3.689 -0.153
projgrad β=0.5 nslope=0.05 1.942 1.262 3.495 -0.153
projgrad β=0.5 1.942 1.122 3.689 -0.307
projgrad β=0.1 1.942 1.122 3.883 -0.153
projgrad --β=0.0 --negative-slope=1.0 0.832 0.14 1.165 0
projgrad n-samples=6000 0.277 0.281 0.971 0.46
projgrad --lr=1e-7 0.139 0 0.194 0.307
projgrad --lr=1e-4 0.693 -0.14 0.971 0.153
projgrad_us β=0.0 0.416 0.14 0.388 0.307
projgrad β=0.0 0.555 0 0 -0.307
projgrad lr=1e-6 0.277 0.14 -0.388 0
projgrad --lr=1e-3 1.664 1.683 -3.883 -1.534
projgrad --lr=1e-3 1.803 0.842 -2.524 -1.534
projgrad_ nslope=1.0 nsample 6000 -0.27 0.281 -1.74 -0.153

2024-09-27 23:34:19

Warning: collection_layers_side not found in training_args Warning: collection_layers_hs not found in training_args /workspace/repr-preference-optimization/.venv/lib/python3.11/site-packages/tyro/_resolver.py:455: UserWarning: <class 'bool'> does not match any type in Union: [<class 'float'>, <class 'NoneType'>]

  • fix pytest bugs
  • find lr scle that is fat

2024-09-28 06:54:39

acc_inc/eval_ds [pp] train test oos rnd
projgrad_us_history_textbook 2.5 0.28 -3.307 -2.297

[I 2024-09-29 00:20:20,610] Using an existing study with name 'projgrad' instead of creating a new one.

, params={'learning-rate': 0.00012426382563887213, 'β': 0.7386239719822631, 'reverse_pref': True, 'scale_orth': True, 'weight_dim': 0, 'neg_slope': 0.5},

with sft 8b params

|:-------------|--------:|-------:|------:|------:| | base | 0.988 | 0.989 | 0.796 | 0.955 | | projgrad | 0.993 | 0.995 | 0.833 | 0.957 | |INFO| Table 2: Absolute accuracy

acc_inc/eval_ds [pp] train test oos rnd
ProjGrad 1.215 0.809 12.06 0.559
ProjGrad β=1.00 1.215 0.809 11.39 0.559
ProjGrad neg_slope=1.00 1.215 0.809 11.223 0.559
ProjGrad weight_dim=2 1.215 0.809 10.553 1.117
ProjGrad mag_clip=0.02 β=1.00 1.215 0.539 9.548 1.816
ProjBP 1.215 0.674 9.213 0.559
ProjGrad rev_pref=False 1.215 0.404 8.208 0.698
ProjGrad neg_slope=1.00 weight_dim=1 1.215 0.539 7.705 0.838
ProjGrad weight_dim=1 1.215 0.674 7.37 0.14
ProjGrad rev_pref=False scale_orth=False 1.215 0.539 6.533 1.257
DPO 1.215 0.135 6.198 0.978

note that full dpo reslt

dpo\dist shift train test oos rnd
acc_gain_vs_ref 1.012 1.001 1.062 1.01
perplexity_gain_vs_ref 833.07 1770.18 711.841 195.059
preference_logp_gain_vs_ref 351.991 329.691 170.018 70.139
INFO Table 1: Key metrics (adapter over base model)

|INFO|

adapter/ds train test oos rnd
base 0.988 0.989 0.796 0.955
dpo 1 0.991 0.845 0.964
projgrad 1 0.997 0.88 0.965
projbp 1 0.996 0.869 0.96
INFO Table 2: Absolute accuracy
acc_inc/eval_ds [pp] train test oos rnd
DPO 1.215 0.135 6.198 0.978
INFO Table 3🥇: Accuracy increase (in percentage points) after training with named adapter on ds:genies_preferences-us_history_textbook-train[:750] compared to base model for various distribution shifts:

sadly it looks like -us_history_textbook-train is too easy for the 8b model?

this is how they do on each ds

adapter commonsense deontology justice utilitarianism math math_make_questions ranking_logic sycophancy_mimicry truthful_qa us_history_textbook us_history_textbook-train wrong_arc us_history_fiction
base 0.671875 0.625 0.359375 0.421875 0.859375 0.984375 0.59375 0.328125 0.546875 1 1 0.265625 0.75
default 0.65625 0.625 0.3125 0.4375 0.890625 0.984375 0.65625 0.234375 0.78125 1 1 0.140625 0.8125
dpo 0.65625 0.625 0.3125 0.4375 0.890625 0.984375 0.65625 0.234375 0.78125 1 1 0.140625 0.8125
projbp 0.65625 0.640625 0.34375 0.421875 0.90625 0.96875 0.671875 0.21875 0.71875 1 1 0.15625 0.890625
projgrad 0.6875 0.609375 0.296875 0.421875 0.90625 0.96875 0.640625 0.1875 0.765625 1 1 0.140625 0.84375

Hmm doing optuna on the small model did generalise o the larger one which is great!

2024-09-30 02:10:04 comparing base vs instruct

I am woried that-

oh compare dpo and projgrad with NousResearch/Llama-3.2-1B and inct

acc_inc/eval_ds [pp] train test oos rnd
ProjGrad base 3.734 1.662 18.35 22.778
DPO base 3.458 1.247 12.5 8.333
acc_inc/eval_ds [pp] train test oos rnd
ProjGrad instruct 1.626 1.374 5.382 12.796
DPO instruct 1.491 1.511 5.208 17.773

it seems fine... ProjGrad mostly wins but DPO wins on instruct which is not what I expected now I should compare it to the SFT

adapter/ds train test oos rnd
base 0.964 0.963 0.683 0.48
projgrad base 1 0.979 0.808 0.589
instruct 0.984 0.971 0.768 0.563
projgrad instruct 1 0.984 0.809 0.635
projgrad instgruct magclip-1 1 0.989 0.855 0.612
dpo base 0.997 0.975 0.768 0.52
dpo instruct 0.999 0.985 0.808 0.663

|INFO|

projgrad\dist shift train test oos rnd
acc_gain_vs_ref 1.037 1.017 1.184 1.228
perplexity_gain_vs_ref 27.968 61.436 57.376 0.938
preference_logp_gain_vs_ref 179.657 166.311 94.113 0.661
INFO Table 1: Key metrics (adapter over base model)

|INFO|

adapter/ds train test oos rnd
base 0.964 0.963 0.683 0.48
projgrad 1 0.979 0.808 0.589
INFO Table 2: Absolute accuracy
acc_inc/eval_ds [pp] train test oos rnd
ProjGrad 3.734 1.662 18.359 22.778
INFO Table 3🥇: Accuracy increase (in percentage points) after training with named adapter on ds:genies_preferences-us_history_textbook-train[:750] compared to base model Llama-3.2-1B for various distribution shifts:
  • train: genies_preferences-us_history_textbook-train[:750]
  • test: genies_preferences-us_history_textbook-test
  • oos: genies_preferences-us_history_fiction-test
  • rnd: genies_preferences-ranking_logic-test |INFO| WANDB url = https://wandb.ai/wassname/reprpo2/runs/aoiidigl

|INFO|

projgrad\dist shift train test oos rnd
acc_gain_vs_ref 1.016 1.014 1.054 1.128
perplexity_gain_vs_ref 11.117 13.364 15.453 90.723
preference_logp_gain_vs_ref 146.201 129.737 59.752 1.884
INFO Table 1: Key metrics (adapter over base model)

|INFO|

adapter/ds train test oos rnd
base 0.984 0.971 0.768 0.563
projgrad 1 0.984 0.809 0.635
INFO Table 2: Absolute accuracy
acc_inc/eval_ds [pp] train test oos rnd
ProjGrad 1.626 1.374 5.382 12.796
INFO Table 3🥇: Accuracy increase (in percentage points) after training with named adapter on ds:genies_preferences-us_history_textbook-train[:750] compared to base model Llama-3.2-1B-Instruct for various distribution shifts:
  • train: genies_preferences-us_history_textbook-train[:750]
  • test: genies_preferences-us_history_textbook-test
  • oos: genies_preferences-us_history_fiction-test
  • rnd: genies_preferences-ranking_logic-test |INFO| WANDB url = https://wandb.ai/wassname/reprpo2/runs/15eb68o0

|INFO|

dpo\dist shift train test oos rnd
acc_gain_vs_ref 1.035 1.012 1.125 1.083
perplexity_gain_vs_ref 2377.82 19005.2 936.602 0.917
preference_logp_gain_vs_ref 336.92 308.166 91.39 0.102
INFO Table 1: Key metrics (adapter over base model)

|INFO|

adapter/ds train test oos rnd
base 0.964 0.963 0.683 0.48
dpo 0.997 0.975 0.768 0.52
INFO Table 2: Absolute accuracy
acc_inc/eval_ds [pp] train test oos rnd
DPO 3.458 1.247 12.5 8.333
INFO Table 3🥇: Accuracy increase (in percentage points) after training with named adapter on ds:genies_preferences-us_history_textbook-train[:750] compared to base model Llama-3.2-1B for various distribution shifts:
dpo\dist shift train test oos rnd
acc_gain_vs_ref 1.015 1.015 1.052 1.178
perplexity_gain_vs_ref 11.661 12.483 18.162 175.215
preference_logp_gain_vs_ref 170.883 152.372 64.533 2.168
INFO Table 1: Key metrics (adapter over base model)

|INFO|

adapter/ds train test oos rnd
base 0.984 0.971 0.768 0.563
dpo 0.999 0.985 0.808 0.663
INFO Table 2: Absolute accuracy
acc_inc/eval_ds [pp] train test oos rnd
DPO 1.491 1.511 5.208 17.773
INFO Table 3🥇: Accuracy increase (in percentage points) after training with named adapter on ds:genies_preferences-us_history_textbook-train[:750] compared to base model Llama-3.2-1B-Instruct for various distribution shifts:
  • train: genies_preferences-us_history_textbook-train[:750]
  • test: genies_preferences-us_history_textbook-test
  • oos: genies_preferences-us_history_fiction-test
  • rnd: genies_preferences-ranking_logic-test |INFO| WANDB url = https://wandb.ai/wassname/reprpo2/runs/fc057puf

I want to make a llama sft

url params dataset loss
Ritvik19/zephyr-tinyllama-sft-qlora 1.1b ultrachat 1.19
martimfasantos/tinyllama-1.1b-chat-sft-full 1.1 ultrachat 1.15
wassname/llama-3-2-1b-sft 1b ultrachat 1.2b
ondevicellm/phi-1_5_sft 1.3 ultrachat 1.25
tanliboy/llama-3.2-3b-sft-2 3.2b openhermes 0.6
dpo\dist shift train test oos rnd
acc_gain_vs_ref 1.058 1.027 1.023 1.027
perplexity_gain_vs_ref 3.036 4.135 1.402 3.648
preference_logp_gain_vs_ref 34.314 29.157 -2.69 1.134
INFO Table 1: Key metrics (adapter over base model)
projgrad\dist shift train test oos rnd
acc_gain_vs_ref 1.085 1.01 1.011 1.041
perplexity_gain_vs_ref 7.812 11.785 1.538 5.802
preference_logp_gain_vs_ref 52.591 41.017 -4.334 1.512
INFO Table 1: Key metrics (adapter over base model)

|INFO|

adapter/ds train test oos rnd
base 0.892 0.896 0.352 0.683
dpo 0.944 0.92 0.36 0.701
projgrad 0.968 0.905 0.356 0.711
acc_inc/eval_ds [pp] train test oos rnd
DPO dataset=math 5.83 2.679 2.273 2.734
ProjGrad dataset=math 8.52 1.042 1.136 4.102
Table 3🥇: Accuracy increase (in percentage points) after training with named adapter on ds:genies_preferences-math-train[:750] compared to base model llama-3.2-3b-sft-2 for various distribution shifts:

2024-10-09 12:06:48

side-ether-prefvec N=✓208/209, best=1.169 importance best
β 0.234 0.404
Htype 0.157 oft
use_angle_loss 0.145 True
lr 0.136 0.000615
use_dpo_loss 0.129 False
weight_tokens 0.077 False
collect_hs 0.022 False
flip_side 0.019 True
nb 0.018 30
collect_input 0.018 False
reduction 0.018 25
use_orth_loss 0.015 False
use_nll_loss 0.013 False
side-svd-mse N=✓28/316, best=1.010 importance best
lr 0.997 0.00119
α 0.002 0.636
collect_hs 0.001 True
quantile 0.001 float
dual_svd 0 True
collect_input 0 False
quantile_value nan 0.3
side-hra-rank N=✓182/183, best=1.229 importance best
lr 0.846 0.000188
collect_hs 0.071 0
apply_GS 0.04 0
collect_input 0.033 0
r 0.007 2
β 0.002 0.11
α 0.001 5.92
hs-ortho-prefvec N=✓259/261, best=1.152 importance best
lr 0.9 0.000411
β 0.051 1.97
orthogonal_map 0.019 matrix_exp
use_angle_loss 0.011 True
weight_tokens 0.009 False
use_nll_loss 0.003 True
use_proj_rel 0.003 True
use_dpo_loss 0.002 False
use_orth_loss 0.002 True
projbp N=✓227/363, best=1.071 importance best
β 0.355 0.238
scale_orth 0.35 False
lr 0.281 5e-06
neg_slope 0.009 float
mag_clip 0.005 float
reverse_pref 0 True
mag_clip_value nan 0.981
neg_slope_value nan 0.699
dpo N=✓248/250, best=1.276 importance best
lr 1 0.000265
hs-svd-mse N=✓14/332, best=1.017 importance best
lr 0.93 0.00119
α 0.034 0.636
collect_input 0.021 False
dual_svd 0.013 True
collect_hs 0.001 True
quantile 0.001 float
quantile_value nan 0.3
hs-hra-rank N=✓259/262, best=1.152 importance best
lr 0.855 0.000333
β 0.071 0.38
r 0.071 38
apply_GS 0.003 1
α 0 0.28
ether-prefvec N=✓321/326, best=1.183 importance best
lr 0.849 0.000378
β 0.058 1.98
reduction 0.042 1
nb 0.014 20
use_proj_rel 0.008 True
use_dpo_loss 0.007 False
use_orth_loss 0.007 True
collect_hs 0.005 False
flip_side 0.003 True
use_angle_loss 0.003 True
collect_input 0.002 True
use_nll_loss 0.002 True
weight_tokens 0.001 True
Htype 0.001 ether
projgrad3 N=✓207/208, best=1.279 importance best
lr 0.93 0.000232
β 0.042 0.843
weight_dim 0.013 1
reverse_pref 0.006 True
scale_orth 0.005 False
mag_clip 0.003 float
neg_slope 0.003 0
mag_clip_value nan 0.23
n_trials best n_trials_completed top10_mean
projgrad3 208 1.27938 207 1.22437
dpo 250 1.27553 248 1.24125
side-hra-rank 183 1.22929 182 1.19589
ether-prefvec 326 1.18304 321 1.176
side-ether-prefvec 209 1.16923 208 1.16385
hs-ortho-prefvec 261 1.15222 259 1.14744
hs-hra-rank 262 1.15222 259 1.10929
projbp 363 1.07129 227 1.06595
hs-svd-mse 332 1.01727 14 1.01727
side-svd-mse 316 1.00962 28 1.0077

2024-10-10 10:04:50

So I did hyperparam opt. I found out that projgrad is the best, prefvec is good.

Now I want to try:

  • all methods on llama sft 7b
  • also prefvec with side and no transform

TODO

  • [ ]

hmm sv and projbp seem to crash. maybe less layers for them?

adapter/ds train test oos rnd
base 0.988 0.989 0.796 0.677
side-ETHER-PrefVec 0.992 0.999 0.891 0.639
projgrad 1 0.995 0.867 0.687
dpo 1 0.993 0.845 0.661
side-Ortho-PrefVec 0.996 0.999 0.897 0.649
Table 2: Absolute accuracy
acc_inc/eval_ds [pp] train test oos rnd
ReprPO_ETHERConfig_PrefVecConfig prefvec.β=3 0.405 0.943 11.893 -5.709
DPO 1.215 0.404 6.198 -2.362
ProjGrad 1.215 0.539 8.878 1.378
Ortho_Rank hs=True α=0.25 β=0.38 map=hra 1.215 0.27 7.538 -2.953
ReprPO_Ortho_PrefVec hs=True map=householder 0.81 0.943 12.73 -4.134
projgrad\dist shift train test oos rnd
acc_gain_vs_ref 1.012 1.005 1.089 1.014
perplexity_gain_vs_ref 199.416 300.697 392.234 2055.11
preference_logp_gain_vs_ref 369.747 346.714 194.287 6.74
side-ETHER-PrefVec\dist shift train test oos rnd
acc_gain_vs_ref 1.004 1.009 1.119 0.943
perplexity_gain_vs_ref 1.038 1.038 1.027 1.234
preference_logp_gain_vs_ref 18.644 16.576 22.529 -0.02
dpo\dist shift train test oos rnd
acc_gain_vs_ref 1.012 1.004 1.062 0.976
perplexity_gain_vs_ref 456.72 478.877 402.153 1198.56
preference_logp_gain_vs_ref 344.444 313.759 174.241 5.661
Table 1: Key metrics (adapter over base model)
side-Ortho-Rank\INOCHERENT train test oos rnd
acc_gain_vs_ref 1.012 1.003 1.075 0.97
perplexity_gain_vs_ref 3.327 3.323 2.938 2.728
preference_logp_gain_vs_ref 88.066 82.939 54.272 1.262
Table 1: Key metrics (adapter over base model)

so ppx doesn't seem sufficient to show incoherence... I'm not sure how to show it. We are not sampling so hmm

calulating ppx

for dpo, I try 3 ways... and it doesn't matter much if we mean first, ratio first, or log ratio mean exp

perplexity_gain_vs_ref_mean_ratio dataset genies_preferences-ranking_logic-test 546.105286 genies_preferences-us_history_fiction-test 15.869489 genies_preferences-us_history_textbook-test 7.939857 genies_preferences-us_history_textbook-train[:750] 5.500533 Name: _chosen_ppl, dtype: float32 perplexity_gain_vs_ref_exp_log_ratio dataset genies_preferences-ranking_logic-test 24.173147 genies_preferences-us_history_fiction-test 11.555802 genies_preferences-us_history_textbook-test 5.331226 genies_preferences-us_history_textbook-train[:750] 4.414876 Name: _chosen_ppl, dtype: float32 perplexity_gain_vs_ref_ratio_of_means dataset genies_preferences-ranking_logic-test 875.733643 genies_preferences-us_history_fiction-test 17.603973 genies_preferences-us_history_textbook-test 9.220794 genies_preferences-us_history_textbook-train[:750] 6.128044 dtype: float32

ok I changed it to perplexity reduction but now it's tiny... sometimes? seems to depend on ds... ok

dpo\dist shift train test oos rnd
acc_gain_vs_ref 1.203 1.15 0.975 1.01
perplexity_reduction_vs_ref 0.009 0.006 0.026 0.004
preference_logp_gain_vs_ref 402.367 417.062 -106.47 0.942
Table 1: Key metrics (adapter over base model)

I would also like percentage of remaining accuracy instead, lets prototype in notebook

2024-10-15 06:59:58

Now I'm just trying to run all the experiments and accumulate results. DPO is narrowly losing but it seems to learn much faster. What is my methods train for much longer...? I need turuthis

dpo hs-ether-prefvec sid--none-prefbec

Trying this on vast...

sid--none-prefbec dpo 800 steps instead of 100 yes

adapter/ds train test oos rnd
base 0.98 0.983 0.781 0.677
projgrad 0.996 0.977 0.747 0.637
dpo 0.997 0.979 0.748 0.631

but the diff is not huge?

2024-10-15 18:51:53

adapter/ds train test oos rnd
base 0.779 0.755 0.701 0.796
hs-ETHER-PrefVec 0.816 0.747 0.661 0.703
side-None-PrefVec 0.867 0.777 0.711 0.685
dpo 0.991 0.864 0.744 0.737
Table 2: Absolute accuracy after training with named adapter on ds:genies_preferences-alpaca_mmlu-train[:750] compared to base model Llama-3-Base-8B-SFT for various distribution shifts:
  • train: genies_preferences-alpaca_mmlu-train[:750]
  • test: genies_preferences-alpaca_mmlu-test
  • oos: genies_preferences-spanish_output-test
  • rnd: genies_preferences-raven_matrices-test

python scripts/train.py side-none-prefvec --n_samples=30000 --lr=1e-5 --dataset=alpaca_mmlu --verbose=2

gives a words result

adapter/ds train test oos rnd
base 0.779 0.755 0.701 0.796
side-None-PrefVec 0.848 0.755 0.673 0.652
Table 2: Absolute accuracy

is this even too high python scripts/train.py side-none-prefvec --n_samples=10000 --lr=6e-4 --dataset=alpaca_mmlu

adapter/ds train test oos rnd
base 0.779 0.755 0.701 0.796
side-None-PrefVec 0.788 0.76 0.693 0.8
Table 2: Absolute accuracy
adapter/ds train test oos rnd
base 0.779 0.755 0.701 0.677
side-None-PrefVec --n_samples=130000 --lr=1e-6 0.784 0.753 0.7 0.672
projgrad --n_samples=30000 --lr=1e-5 0.989 0.801 0.733 0.659
adapter/ds train test oos rnd
base 0.779 0.755 0.701 0.677
side-None-PrefVec --n_samples=30000 --lr=1e-5 0.787 0.756 0.697 0.672
Table 2: Absolute accuracy

Table 2: Absolute accuracy very long prefvec still going projgrd

  • python scripts/train.py projgrad --n_samples=30000 --lr=1e-5 --dataset=alpaca_mmlu --verbose=2

python scripts/train.py side-none-prefvec --n_samples=10000 --lr=6e-4 --dataset=alpaca_mmlu python scripts/train.py projgrad --n_samples=10000 --lr=6e-4 --dataset=alpaca_mmlu

I think I need low nd very ong? it actually seems faster at a lower lr

python scripts/train.py side-none-prefvec --n_samples=10000 --lr=4e-5 --dataset=alpaca_mmlu --verbose=2 python scripts/train.py projgrad --n_samples=10000 --lr=2e-5 --dataset=alpaca_mmlu --verbose=2

python scripts/train.py side-none-prefvec --n_samples=30000 --lr=1e-5 --dataset=alpaca_mmlu --verbose=2 python scripts/train.py projgrad --n_samples=30000 --lr=1e-5 --dataset=alpaca_mmlu --verbose=2

python scripts/train.py side-none-prefvec --n_samples=130000 --lr=1e-6 --dataset=alpaca_mmlu --verbose=2 python scripts/train.py projgrad --n_samples=130000 --lr=1e-6 --dataset=alpaca_mmlu --verbose=2

python scripts/train.py side-none-prefvec --n_samples=20000 --lr=6e-5 --dataset=alpaca_mmlu --verbose=2 python scripts/train.py projgrad --n_samples=20000 --lr=6e-5 --dataset=alpaca_mmlu --verbose=2

Ah I tried squaring prevvec (or at least the reroute part) and it seemed to work. Also rel is 1e6 rims bigger oh wait I can't square it if it's negative grrr

adapter/ds train test oos rnd
base 0.779 0.755 0.701 0.677
side-None-PrefVec 0.771 0.752 0.687 0.629
Table 2: Absolute accuracy

loss_prj rel was -5e-6*1e08=-500, loss_proj was -400

hmm try with nll loss, and balancing python scripts/train.py side-none-prefvec --n_samples=12000 --lr=8e-5 --dataset=alpaca_mmlu --verbose=2 --loss.β=200 --loss.use-nll-loss --loss.α=100

adapter/ds train test oos rnd
base 0.779 0.755 0.701 0.677
side-None-PrefVec 0.779 0.755 0.701 0.677
Table 2: Absolute accuracy

python scripts/train.py side-none-prefvec --n_samples=12000 --lr=8e-5 --dataset=alpaca_mmlu --verbose=2 --loss.β=1 --loss.use-nll-loss --loss.α=100 --loss.use-dpo-loss

adapter/ds train test oos rnd
base 0.779 0.755 0.701 0.677
side-None-PrefVec 0.896 0.793 0.705 0.683
dpo 0.991 0.864 0.744 -
Table 2: Absolute accuracy

python scripts/train.py side-none-prefvec --n_samples=22000 --lr=1e-4 --dataset=alpaca_mmlu --verbose=2 --loss.β=1 --loss.use-nll-loss --loss.α=100 --loss.use-dpo-loss

hmm it occurs to me that I don't need nll at all, or it can be much smaller....

adapter/ds train test oos rnd
base 0.779 0.755 0.701 0.677
side-None-PrefVec 0.923 0.813 0.713 0.665
dpo 0.991 0.864 0.744 -
Table 2: Absolute accuracy

try one with bigger angle loss, and no nll

python scripts/train.py side-none-prefvec --n_samples=22000 --lr=1e-4 --dataset=alpaca_mmlu --verbose=2 --loss.β=10 --loss.α=100 --loss.use-dpo-loss

adapter/ds train test oos rnd
base 0.779 0.755 0.701 0.677
side-None-PrefVec 0.943 0.828 0.725 0.627
dpo 0.991 0.864 0.744 -
Table 2: Absolute accuracy

it actually started unlearning at the end... it seems that it can't do the angle and orthogonal ones well? or perhaps not at the same time

so lets try one with angle loss only ramped up the onther with only dpo and only nll

### only dpo
python scripts/train.py side-none-prefvec --n_samples=22000 --dataset=alpaca_mmlu --verbose=2 --loss.β=0.001 --no-use-angle-loss --loss.α=100 --loss.use-dpo-loss

| side-None-PrefVec |   0.876 |  0.785 | 0.703 | 0.687 |

python scripts/train.py side-none-prefvec --n_samples=42000 
--dataset=alpaca_mmlu --verbose=2 --loss.β=0.001 --no-use-angle-loss --loss.α=10 --loss.use-dpo-loss
| side-none-prefvec | 0.876 | 0.785 | 0.703 | 0.687 | 

### only nll
python scripts/train.py side-none-prefvec --n_samples=42000 --dataset=alpaca_mmlu --verbose=2 --loss.β=0.001 --no-use-angle-loss --loss.α=10 --loss.use-nll-loss
| 0.55 | 0.533 | 0.307| 0.497 |

### only angle
python scripts/train.py side-none-prefvec --n_samples=42000 --dataset=alpaca_mmlu --verbose=2 --loss.β=10 --loss.α=0.001
| adapter/ds        |   train |   test |   oos |   rnd |
|:------------------|--------:|-------:|------:|------:|
| base              |   0.779 |  0.755 | 0.701 | 0.677 |
| side-None-PrefVec |   0.516 |  0.5   | 0.497 | 0.485 |
Table 2: Absolute accuracy

python scripts/train.py projgrad --n_samples=42000 --dataset=alpaca_mmlu --verbose=2
| adapter/ds        |   train |   test |   oos |   rnd |
|:------------------|--------:|-------:|------:|------:|
| base              |   0.779 |  0.755 | 0.701 | 0.677 |
| dpo               |   0.991 |  0.864 | 0.744 | - |
| projgrad          |   0.999 |  0.828 | 0.759 | 0.692 |

I should probobly do more optuna stuff, but with much longer runs, and wandb logging so I can check!

2024-10-17 18:00:21

I'm doing an optuna seocnd one with wandb, long runs, early stopping, etc

But first I need to restrict the search space, and balance the losses If I can

'loss_proj' = 9.976604461669922 'loss_orth' = 2.501805647625588e-05 'loss_angle' = 1.6716301441192627 'loss_proj_rel' = 6.633349016738066e-07 '_cho_orthorgonal2pref' = 3.947233835788211e-06 '_ref_orthorgonal2pref' = 6.720138117088936e-06 '_signed_cho_pref' = 2.0076420241821324e-06 '_signed_rej_pref' = 2.670976982699358e-06 '_cho_cosine_similarity' = 0.19653406739234924 '_rej_cosine_similarity' = 0.22570520639419556 '_rel_cosine_similarity' = -0.09386942535638809

24GB with hs

2024-10-21 07:16:36

  • recover optuna.db
  • change to vast.ai cheaper
  • check wandb to make sure they were converging?

2024-10-21 08:38:12 optuna4.db

https://wandb.ai/wassname/reprpo2/groups/optuna4_us_history_textbook-llama-3-2-1b-sft/workspace

hs-ether-mse N=✓29/164, best=1.000 importance best
α 0.877 6.16
lr 0.123 0.000457
hs-ether-rank N=✓39/160, best=1.063 importance best
lr 0.983 0.000429
β 0.017 9.17
α 0 0.00241
projgrad2 N=✓50/236, best=1.253 importance best
mag_clip 0.609 1
lr 0.271 0.000268
reverse_pref 0.062 1
β 0.029 1.51
weight_dim 0.019 1
neg_slope 0.01 0
scale_orth 0 0
hs-ether-prefvec N=✓97/220, best=1.033 importance best
lr 0.651 5.13e-05
β 0.217 0.957
use_dpo_loss 0.081 0
use_proj_rel 0.04 1
use_orth_loss 0.007 1
use_angle_loss 0.004 1
use_nll_loss 0 0
n_trials best n_trials_completed top10_mean
projgrad2 236 1.25287 50 1.19932
hs-ether-rank 160 1.06322 39 1.02514
hs-ether-prefvec 220 1.03257 97 1.00758
hs-ether-mse 164 1 29 1.96143e+06

hmm it seems to be pruning good ones... I should use train loss not val loss? Also some are messed up by multple wandb being combined grr especially priuned ones?

hmm part of the problem is that I am changing loss setup, and there for loss... really I need a quick eval... damn How long would :5 be? or I could use dpo loss as a proxy?

make sure each loss returns info['acc']

2024-10-22 17:58:41 optuna on code dataset

https://wandb.ai/wassname/reprpo2-optuna?nw=nwuserwassname

hs-ether-mse N=✓26/150, best=1.005 importance best
lr 0.734 5.38777e-06
α 0.266 8621
projgrad2 N=✓30/353, best=1.025 importance best
mag_clip 0.485 1
reverse_pref 0.406 1
lr 0.093 6.32719e-06
weight_dim 0.014 1
β 0.001 997.23
neg_slope 0 0.1
scale_orth 0 1
dpo N=✓11/24, best=0.996 importance best
lr 1 5.61152e-06
hs-ether-rank N=✓24/150, best=1.014 importance best
lr 0.608 0.000138854
β 0.281 52.8438
α 0.111 1.85056
hs-ether-prefvec N=✓53/383, best=1.032 importance best
lr 0.757 0.000132606
use_angle_loss 0.106 1
β 0.061 0.714851
use_nll_loss 0.045 0
use_proj_rel 0.03 1
use_dpo_loss 0.001 1
use_orth_loss 0 0

2024-10-22 19:07:08

Idea:

Following [Universal neurons in gpt2 language models. arXiv preprint arXiv:2401.12181, 2024.], we find prediction and suppression neurons by analyzing the output weights with the unembedding matrix . Prediction neurons exhibit a logit effect distribution with high kurtosis and positive skew, while suppression neurons show high kurtosis but negative skew. Here, is the output MLP weight for a given layer. https://omnivore.app/wassname/the-remarkable-robustness-of-ll-ms-stages-of-inference-192b40e7d76

We find a striking pattern which is remarkably consistent across the different seeds: after about the halfway point in the model, prediction neurons become increasingly prevalent until the very end of the network where there is a sudden shift towards a much larger number of suppression neurons. To ensure this is not just an artifact of the tied embeddings (WE = WTU ) in the GPT2 models, we also run this analysis on five Pythia models ranging from 410M to 6.9B parameters and find the results are largely the same (Figure 22). When studying the activations of suppression neurons, we noticed that they activate far more often when the next token is in fact from the set of tokens they suppress (e.g., a year token like “1970”; Figure 24). We intuit that these suppression neurons fire when it is plausible but not certain that the next token is from the relevant set. Combined with the observation that there exist many suppression and prediction neurons for the same token class (Figure 24), we take this as evidence of an ensemble hypothesis where the model uses multiple neurons with some independent error that combine to form a more robust and calibrated estimate of whether the next token is in fact a year https://arxiv.org/pdf/2401.12181

so in other words these are neurons that tend to move the distribution toward the negative

however I can do better, and look at logprobs that are made unlikely in the last layer https://github.com/wesg52/universal-neurons/blob/d797aaaff2abc6852b97aacc1524621617ad0071/analysis/prediction_neurons.py#L173

so can't I just go hs[-2]-hs[-1] to get the stuff removed by the last layer!

or hs[-2]*(1-w[-1]) (which is the same when expanded but I could potentially sub in other layers)

or maybe I could get the inverse or orthogonal to w[-1] (which mean weights for the last layer before unembedding

As a quick QC I can visualise this hs, see the magnitude etc

hmm so I've setlles on hs.diff(layers).mul(-1).relu() to get only suppressed info. Now I need to add it as a transform. The only problem is that all my transforms have been defined on each layer. I need to make a transform that works on many layers....

optuna on alpaca_low to alpaca high quality

hs-ether-rank

hs-ether-rank N=✓62/65, best=1.014 importance best
β 0.542 3.75206
lr 0.446 8.26081e-07
α 0.013 0.0271605

dpo

dpo N=✓20/20, best=1.000 importance best
lr 1 9.09892e-05

projgrad2

projgrad2 N=✓20/20, best=1.000 importance best
mag_clip 0.952 1
reverse_pref 0.031 1
neg_slope 0.016 0.1
lr 0 1.39313e-06
β 0 0.0242605
weight_dim 0 2
scale_orth 0 1

hs-ether-prefvec

hs-ether-prefvec N=✓20/20, best=1.014 importance best
lr 0.531 7.45934e-06
β 0.184 0.978316
use_angle_loss 0.071 1
use_dpo_loss 0.07 0
use_nll_loss 0.064 0
use_orth_loss 0.056 1
use_proj_rel 0.024 0

hs-ether-mse

hs-ether-mse N=✓22/23, best=1.027 importance best
lr 0.768 3.4206e-06
α 0.232 0.000867935

2024-10-31 08:33:44

From DavidAd's idea

  • and model drift (per hs), clip or loss
  • and log_softmax as transform or add to none
  • optuna all is just doing dpo, and out of mem sometimes

TODO show one sample from each dataset?

2024-11-01 07:25:10 Did a full set of experiments, prefvec often goes to far

https://wandb.ai/wassname/reprpo2/groups/exp-31Oct1252-math-llama-3-2-1b-sft/workspace?nw=nwuserwassname

The PrefVec ones are weird. They scores highly, but only on the eval. Thee acc just went down even on train? wbhy

  • ReprPO_ETHER_PrefVec collect_hs=True ether.Htype=etherplus but this one was good
  • ReprPO_None_PrefVec prefvec.use_dpo_loss=True

FIXME:

  • need to print ALL into wandb logs? it's in log.txt, or is it now
  • need sample of each eval
  • work out wh some prefeval was terrible
  • why is val acc diff from eval acc? one is only 10 samples and dpo. the other is a diff framework and agg?
    • one is score_weighted and 750 samples. hmm

2024-11-10 idea

I'm using the preference direction on the ref/base model, but if I use the pi model.. would it improve... or become unstable....

  • add flag, try both on quick 1b model
  • also should I not make the rejected string go in the -ve pref dir?