Speedrunning ideas discussion #23

KellerJordan · 2024-11-10T21:52:16Z

KellerJordan
Nov 10, 2024
Maintainer

Greetings GPT speedrunning enjoyers

I noticed some people were using GitHub issues to suggest new ideas for the run.

Plz discuss here instead. Ty

djbyrne · 2024-11-12T10:01:53Z

djbyrne
Nov 12, 2024

I assume FP8 has been considered? Nvidia is showing serious speedups on the H100 https://github.com/NVIDIA/TransformerEngine

1 reply

KellerJordan Nov 18, 2024
Maintainer Author

I assume FP8 has been considered? Nvidia is showing serious speedups on the H100 https://github.com/NVIDIA/TransformerEngine

Yeah, I tried FP8, and indeed it's much faster for large matmuls, but sadly it seems that the matrices here may be too small (768x768) to see a speedup. Would be happy to be proven wrong, ofc.

pszemraj · 2024-11-12T20:46:00Z

pszemraj
Nov 12, 2024

how about nGPT arch instead of standard GPT? lucidrains has a good implementation, I ran the training and can confirm it converges a lot faster than other models/architectures I've tried on the same data (tiny model here)

5 replies

KellerJordan Nov 18, 2024
Maintainer Author

I'm interested in this but haven't gotten a speedup using it yet. Would be happy to see one

MarktHart Nov 19, 2024

nGPT normalizes the weights, Muon makes the weights orthogonal (and therefore normalized). So nGPT can't lead to a decrease in amount of steps.

Edit: guess I misread what Muon does

pszemraj Nov 19, 2024

fair enough. not trying to talk down on your custom optimizer, but it might be worth testing standard AdamW (or less standard AdEMAMix) + nGPT model vs. the current custom optimizer + (relatively standard) model. normalization happens at different points in these two cases, could be different results.

KellerJordan Nov 19, 2024
Maintainer Author

I would note that Muon doesn't normalize the weights, only the updates (which AdamW also sort of does, over multiple steps at least)

bluecoconut Nov 21, 2024

Oh, just saw this thread - I figured i'd add a comment about my attempts as well. I spent a day on nGPT and never got it to outperform.
I tried a few ablations and variants as well (total of ~30 experiments or so), but nothing gave the speedup I was hoping for.

johanwind · 2024-11-18T15:34:47Z

johanwind
Nov 18, 2024

In Muon's zeropower_via_newtonschulz5, the estimator X.norm() for the top singular value in X /= (X.norm() + eps) is off by up to sqrt(n), where n is the rank of the matrix X. We already compute X4 = [email protected]@[email protected] in the first loop iteration, so why don't we use the better estimator X4.norm()**(1/4)? It could potentially save a couple of iterations. Something like

    a, b, c = (3.4445, -4.7750,  2.0315)
    X = G.bfloat16()
    if G.size(0) > G.size(1):
        X = X.T

    X2 = X @ X.T
    X4 = X2 @ X2
    bound = X4.norm()**(1/4)+eps
    X = (a/bound) * X + ((b/bound**3) * X2 + (c/bound**5) * X4) @ X

    for _ in range(steps-1):
        A = X @ X.T
        B = b * A + c * A @ A # adapted from suggestion by @jxbz, @leloykun, and @YouJiacheng
        X = a * X + B @ X
    if G.size(0) > G.size(1):
        X = X.T
    return X

6 replies

johanwind Nov 19, 2024

I tested it. Final validation loss went from 3.2779 to 3.2764, which could be just noise. You're right, doesn't seem to matter here.

KellerJordan Nov 22, 2024
Maintainer Author

(by low-rank I meant high condition number)

leloykun Dec 10, 2024

Hi @johanwind, it looks like X2.norm**0.5 would suffice already.

See: https://x.com/leloykun/status/1866585340359872584
Logs & analysis codes can be found here: https://github.com/leloykun/modded-nanogpt/tree/fc--track-wts/records/121124_NormTracing

btw, @johanwind , can you share the code you used to test this?

KellerJordan Dec 14, 2024
Maintainer Author

@johanwind this doesn't quite give a speedup (since we still barely can't lower steps to 4), but I'm adding it since it's a great method and in many cases will improve things

KellerJordan Dec 15, 2024
Maintainer Author

I removed it now because it caused instability. Needs further research.

KellerJordan · 2024-11-18T20:03:29Z

KellerJordan
Nov 18, 2024
Maintainer Author

jie040109 suggested AdaFactor as an alternate optimizer. I'm not very bullish on this, but I would be totally happy to see any new record using AdaFactor, or any other optimizer for that matter.
#20

0 replies

KellerJordan · 2024-11-18T20:04:08Z

KellerJordan
Nov 18, 2024
Maintainer Author

MizuleGPT suggested TokenFormer. This does look interesting (Grad also sent it to me), but I haven't seriously thought about it.
#21

0 replies

fizzAI · 2024-11-19T19:32:33Z

fizzAI
Nov 19, 2024

The rules say:

Must use ≤ 124M active parameters per token.

Technically, wouldn't this allow MoE models with >124M total params as long as we dont go higher than 124M active at once?
Not too sure whether the tradeoff between per-step speed and greater expressability in the network(?) would be worth it here, but it might be interesting to at least try at some point

3 replies

KellerJordan Nov 19, 2024
Maintainer Author

Yes it would. I am interested in future records using MoE.

MizuleGPT Nov 20, 2024

if greater expressability isn't needed, can't you use less than 124M active parameters?

you could also experiment with the dense-MoE hybrid used for arctic, if I remember correctly it was 18 times cheaper to train than llama 3 70b while getting the same or better performance in key areas

KellerJordan Nov 21, 2024
Maintainer Author

sure, less than 124M is fine

lapp0 · 2024-11-23T00:42:00Z

lapp0
Nov 23, 2024

I tried using Cut Cross Entropy, and while it allows for a larger sequence length without OOM, it seems increasing the batch size / seq len doesn't result in any speedup. Seems we've maxed out parallelization?

Is there something special about seq len of 2**16 tokens? Is this the most tokens FlexAttention is able to process in a single batch? Something specific to the H100?

1 reply

zhangfaen Nov 23, 2024

Larger sequence length or batch size usually means better gradient estimation, to benefit, a general rule is to increase learning rate. otherwise, larger sequence length or batch size means more computing time but no learning rate benefit, overall maybe result in non speedup.

lapp0 · 2024-11-23T17:43:35Z

lapp0
Nov 23, 2024

We currently employ four strategies for mapping residuals. There might be other mapping strategies to improve performance.

In the standard GPT-2 architecture, activations follow a sequential connection pattern:

$$\text{HS}_N \to \text{Layer}_{N+1}$$

Three additional activation connection mechanisms to augment this structure have been incorporated:

Embedding Access: $$\text{HS}_{\text{in}} = w_0 \cdot \text{emb} + w_1 \cdot \text{HS}_{\text{prev}}$$
Value Residual Learning: $$v_{\text{i}} = \lambda_0 \cdot v_{\text{0}} + \lambda_{\text{i}} \cdot v_{\text{i}}$$
U-net Skip Connections: $$\text{HS}_i \to \text{Layer}_{\text{num\_layers} - i}$$

All four of these have learned weights.

I've tried a dense version of skip connections where each layers input is the weighted sum of all previous layers residuals, but train performance reduced slightly.

4 replies

KellerJordan Nov 24, 2024
Maintainer Author

Nice diagram!

I agree with your intuition that there probably exist even better connectivity patterns.

ViktorooReps Nov 26, 2024

Why are value residuals even needed? They are one linear projection away from the embeddings

lapp0 Nov 26, 2024

Value residuals were added before embedding residuals. Might be worth running the ablation!

ViktorooReps Nov 28, 2024

I was wondering if it could be worth trying to make the residual mixing coefficients depend on the value of the residual connection.

I feel like the embedding residual works well because embeddings act as sort of knowledge storage of the model, but at the same time it seems like different tokens may not benefit a lot from this storage due to semantic simplicity.

For example, compare tokens "Amsterdam" and "." (https://platform.openai.com/tokenizer)

In theory, the embedding of the latter should contain little to no semantic information, so there is no benefit from allowing the direct access to it in later layers.

Because of this, I think it could be beneficial to introduce a simple linear projection at each layer to calculate the actual value of the lambdas (with tanh or sigmoid activation, depending on whether we want to allow mixing with a negative weight)

Triang-jyed-driung · 2024-11-25T03:18:05Z

Triang-jyed-driung
Nov 25, 2024

Where do you think is the speedrun limit? The last record runs 10 times faster than the baseline.
I think it's probably around 1 minute.

3 replies

lapp0 Nov 25, 2024

~1 minute seems reasonable for dense transformers. However, a new architecture would completely change this bound.

Triang-jyed-driung Nov 27, 2024

Launching this program takes 1 minute. It's very unlikely if new architectures can significantly change this bound.

MizuleGPT Nov 30, 2024

Launching this program takes 1 minute. It's very unlikely if new architectures can significantly change this bound.

maybe at that point it'll be time to start speedrunning GPT-2 medium or large? it's important to keep it in such a range that anyone can afford to contribute

lapp0 · 2024-12-04T03:47:10Z

lapp0
Dec 4, 2024

MPA-Lya as a substitute for Newton Schulz within Muon might be worth looking into

https://docs.google.com/presentation/d/1JsQMANEQNPG2aP7MOdbx4USHJS-MtzTTsx0t0rEtSXc/edit#slide=id.g11891591918_0_100

1 reply

lapp0 Dec 10, 2024

Underperforms NS in this implementation. Not sure if I'm missing something

Code:

@torch.compile
def zeropower_via_MPAlya(G, eps=1e-7):
    # adapted from
    # https://github.com/KingJamesSong/FastDifferentiableMatSqrt/blob/5506d702/torch_utils.py
    pade_p = [1.0, -2.75, 2.75, -1.203125, 0.21484375, -0.0107421875]
    pade_q = [1.0, -2.25, 1.75, -0.546875, 0.05859375, -0.0009765625]

    assert G.dim() == 2
    X = G / (G.norm() + eps)
    if G.size(0) > G.size(1):
        X = X.T
    A = X @ X.T
    normM = A.norm() + eps
    I = torch.eye(A.size(0), device=A.device, dtype=A.dtype)
    p_app = I - (A / normM)
    p_sqrt = pade_p[0] * I
    q_sqrt = pade_q[0] * I
    p_hat = p_app.clone()
    for i in range(len(pade_p) - 1):
        p_sqrt += pade_p[i+1] * p_hat
        q_sqrt += pade_q[i+1] * p_hat
        p_hat = p_hat @ p_app
    A_inv_half = torch.linalg.solve(p_sqrt, q_sqrt) / torch.sqrt(normM)
    X = A_inv_half @ X
    if G.size(0) > G.size(1):
        X = X.T
    return X.to(dtype=G.dtype)

linux-leo · 2024-12-06T15:16:30Z

linux-leo
Dec 6, 2024

I suggest making the recently added token value embeddings LoRA matrices for faster training. It would also make the file size of the resulting model smaller.

0 replies

fizzAI · 2024-12-07T16:04:57Z

fizzAI
Dec 7, 2024

Speaking of the LoRA idea above, why not make every linear layer a low-rank approximation of the full matrix? At higher ranks (ie, 512-1024) you should be able to effectively capture all the expressability in the weights anyways while being able to make a deeper/lower-real-parameter model, allowing for higher batch sizes -> faster training?.

# pseudo-torch
import torch
from torch import nn
from torch.nn import functional as F

class LowRankLinear(nn.Module):
    def __init__(
        in_features: int,
        out_features: int,
        rank: int = 512 # should be lower than in/out features to actually have any advantages
    ):
        super().__init__()
        self.a = torch.Tensor((in_features, rank))
        self.b = torch.Tensor((rank, out_features))
        
    def forward(
        x: torch.Tensor
    ) -> torch.Tensor:
        intermediate = self.b * self.a
        return F.linear(x, intermediate)

8 replies

fizzAI Dec 7, 2024

But if you make the model deeper you end up with more than 124m active parameters, which is a hard imposed limit.

@linux-leo Not if you have less total parameters per matrix (which is the whole point here). An N*M matrix will have less parameters than an (N*R) and (R*M) array, assuming R<N && R<M.

fizzAI Dec 7, 2024

[the other important stuff I don't feel like copying...]

It is true that "make model smaller" doesn't work, but what I'm mostly curious to learn is if having higher N and M (ie, a higher intermediate channel size) with a lower rank wrt to per-matrix expressability is better than lower N and M with max rank. Which is more similar to comparing the standard network to a low-rank result tuned to have the same amount of parameters as the standard one, and I'm getting there! But I'm working on porting the code from a basic MNIST classifier to something more complicated to actually resemble a regular person test

linux-leo Dec 7, 2024

@linux-leo Not if you have less total parameters per matrix (which is the whole point here). An N*M matrix will have less parameters than an (N*R) and (R*M) array, assuming R<N && R<M.

You mean an N*M matrix will have more parameters than an (N*R) and (R*M) array, assuming R<N && R<M, right?

But when you calculate the intermediate result with intermediate = self.b * self.a, the intermediate matrix will have the size of the full rank matrix, I'd count the parameters of this matrix as active.

higher N and M with a lower rank wrt to per-matrix expressability is better than lower N and M with max rank

That's the comparison I was suggesting, since you compared full rank and low rank at same N and M so far AFAICT.

linux-leo Dec 7, 2024

Also might be worth testing if improvements over LoRA, such as PISSA or Bone could also work.
Bone is currently the best I know:

https://huggingface.co/docs/peft/v0.14.0/en/package_reference/bone#bone
https://github.com/huggingface/peft/tree/main/src/peft/tuners/bone

fizzAI Dec 7, 2024

You mean an NM matrix will have more parameters than an (NR) and (R*M) array, assuming R<N && R<M, right?

Yeah xD oops

the intermediate matrix will have the size of the full rank matrix, I'd count the parameters of this matrix as active.

Bleh, maybe? The "active parameters" metric is infinitely confusing in edgecases like this, I feel like the definition of "active" needs to be clarified

Also might be worth testing if improvements over LoRA, such as PISSA or Bone could also work.

Pissa definitely wouldn't work (it's just an initialization method based on PCAing the base weights), Bone seems interesting but I don't quite understand it or if it's applicable here

lapp0 · 2024-12-14T16:45:13Z

lapp0
Dec 14, 2024

Not bullish, but may be worth trying: MemoryFormer

0 replies

kabachuha · 2024-12-14T16:54:44Z

kabachuha
Dec 14, 2024

Patch-level training

In the work https://arxiv.org/pdf/2407.12665 it is proposed to train firstly not on tokens, but on token patches prediction. After some time, the vanilla token-level training returns.

This technique is essentially accomplished with just 10 lines of code (which are compatible with this repo, as far as I looked into it)

One can find this suspicious because of the loss change, however, it is changed only for the 2/3 of the training process and then the loss will match the standard cross-entropy, making this benchmark comparison fair. Additionally, the loss can be brought to cross-entropy on validation.

Here is my modification of modded-nanogpt, implementing this, hopefully https://gist.github.com/kabachuha/1c0440d7193cd60f00f566d8c6f2329a

(I don't have an H100, please, launch it for me 😭)

The original repo https://github.com/shaochenze/PatchTrain

Edit: forgot to add that the paper in question promises x2 reduction of training costs

4 replies

leloykun Dec 14, 2024

You also need to modify the code for building the block mask for this to work.

kabachuha Dec 14, 2024

To downsample this initial docs forming code or something else https://gist.github.com/kabachuha/1c0440d7193cd60f00f566d8c6f2329a#file-train_gpt2-py-L287-L288 ? (Block size is already reduced by the patch factor)

~~Edit: okay, I see the hardcoded things, will think on this~~
~~Edit2: maybe a max pool on docs will work, playing with this~~
Edit3: [::self.patch_size] on docs may be better

kabachuha Dec 15, 2024

Got it to work (process the batch and compute loss) with the following changes:

kabachuha@cbb4b2b

Upd 17-12: Updated the fork to the total_num_blocks commit

leloykun Jan 20, 2025

I really want this to work...

So, I went back to this and realized an edge case: what do we do when a patch goes over two documents? E.g.: say we have a patch:

[token from doc 1], [token from doc 1], [token from doc 2], [token from doc 2]

linux-leo · 2024-12-14T19:59:22Z

linux-leo
Dec 14, 2024

Two of my suggestions:

Use LASER Attention: https://arxiv.org/abs/2411.03493
Use Fourier Analysis Networks instead of MLPs: https://arxiv.org/abs/2410.02675

2 replies

lapp0 Dec 15, 2024

I tried Fourier KAN to no avail, but FAN might be interesting.

leloykun Dec 15, 2024

LASER Attention didn't improve val_loss when we tried it. Besides, even if it did, the wallclock overhead may not make it worth

thundergolfer · 2025-01-21T03:10:55Z

thundergolfer
Jan 21, 2025

Has anyone tried speed running on 16, 32, or 64 H100s?

4 replies

linux-leo Jan 21, 2025

This record is about efficiency given a certain training budget, obviously it would be faster if you use more gpus.

thundergolfer Jan 21, 2025

There's nothing in the rules about not using more GPUs (or even H200s). I think within multi-node setups there's still freedom to make better or worse implementations. But I haven't tried, maybe getting computation<>communication overlap is trivial and the speedup would be linear.

linux-leo Jan 21, 2025

But comparisons become difficult. You'd have to port the previous record to new hardware too, and show that your hardware specific tweaks are faster than the hardware specific tweaks we already did for 8xH100.

MizuleGPT Jan 24, 2025

then why not only use a single H100? I assume that'd be significantly cheaper and more accessible, speeding up innovation.

MizuleGPT · 2025-01-21T19:14:59Z

MizuleGPT
Jan 21, 2025

2x faster pretraining and half the memory usage compared to adam:
https://arxiv.org/abs/2412.13148

gotta love that data efficiency

1 reply

linux-leo Jan 21, 2025

AFAIK that's just a plagiarized copy of Muon, sadly.

EDIT: The paper has apparently been updated to cite muon, I'm not as hostile to it anymore, but I'm still taking the information in it with a grain of salt.

linux-leo · 2025-01-22T17:30:07Z

linux-leo
Jan 22, 2025

A thought that I keep having is that ever since the Unet Pattern has been introduced it might be more sample efficient to tie embedding and lm head again. The Unet structure already steers the model towards having a sort of symmetry, which could counteract the performance lost, and then the attention head in the 8th layer could be maybe re-enabled to bring the sample efficiency back up. Now that the attention scaling has been also tuned.

0 replies

MizuleGPT · 2025-01-22T20:56:43Z

MizuleGPT
Jan 22, 2025

would this be allowed?
https://arxiv.org/abs/2404.07965

2 replies

linux-leo Jan 22, 2025

Technically this re-weights the distribution of the tokens so I don't think so, no.

leloykun Jan 24, 2025

This should be okay, but you have to compare results on a bits-per-byte metric

linux-leo · 2025-01-24T16:55:28Z

linux-leo
Jan 24, 2025

Another Idea: https://reasoning-tokens.ghost.io/reasoning-tokens/

0 replies

linux-leo · 2025-01-24T22:11:11Z

linux-leo
Jan 24, 2025

this optimizer looks nice: https://arxiv.org/abs/2501.12243

EDIT: another optimizer: https://arxiv.org/abs/2412.17107

0 replies

RavnaBergsndot · 2025-01-27T11:02:08Z

RavnaBergsndot
Jan 27, 2025

I stumbled upon this modification when I was tinkering with the MLP layer:

class MLP(nn.Module):
    def __init__(self, dim: int):
        super().__init__()
        hdim = 2 * dim
        self.c_fc = CastedLinear(dim, hdim)
        self.c_fc2 = CastedLinear(hdim, hdim)
        self.c_proj = CastedLinear(hdim, dim)
        self.c_proj.weight.detach().zero_() # zero init suggested by @Grad62304977

    def forward(self, x):
        x = self.c_fc(x)
        x = x * self.c_fc2(F.gelu(x))
        x = self.c_proj(x)
        return x

It keeps the total number of parameters per block the same as the original RELU^2 MLP(1x2+2x2+2x1=1x4+4x1), but improves the final loss in my experimentation by quite a bit(train/val loss: 0.9964/1.0524 -> 0.9876/1.0470). This modification also makes the optimizer step a little faster on my computer, possibly due to more Muon-friendly matrix dimension ratios.

At a glance it's similar to Gated Linear Unit(with a GELU gate), but it's different. In this new case, the gating happens between consequent layers, unlike GLU's case of gating within the same layer.

The MLP layers use about 2/3 of the total parameters, but the attention layers get almost all the architecture exploration efforts. I think it may be too early to conclude that all low-hanging fruits about the MLP layer architecture have already been picked.

But again, my experiment was done with a smaller model on a smaller dataset and I'm not sure if the result would transfer when we scale things up.

1 reply

leloykun Jan 28, 2025

Hi @RavnaBergsndot!

I just tried this out on top of my latest record attempt: #77

And I got a worse val_loss (3.3061) and worse wallclock time (3.0721 mins)

linux-leo · 2025-01-31T15:20:15Z

linux-leo
Jan 31, 2025

Considering how sparse relu squared is, I'm gonna look further into this: https://arxiv.org/abs/2410.03440

0 replies

linux-leo · 2025-02-01T17:49:45Z

linux-leo
Feb 1, 2025

And this looks interesting too: https://arxiv.org/abs/2501.18356

0 replies

linux-leo · 2025-02-04T20:39:33Z

linux-leo
Feb 4, 2025

https://arxiv.org/abs/2502.01628

3 replies

leloykun Feb 4, 2025

The 1/d^ns might cause numerical instabilities at lower precision. But I hope it'll help

conrad-lippert-zajaczkowski-8451 Feb 4, 2025

Love the GPT-2 inclusion in the paper making this easier

MarktHart Feb 4, 2025

I don't think this model is still close enough to GPT-2 to make that matter

MizuleGPT · 2025-02-06T10:03:41Z

MizuleGPT
Feb 6, 2025

44% less parameters, 33% fewer tokens, sound compatible?
https://arxiv.org/pdf/2501.17486

0 replies

stupidsourcess · 2025-02-08T02:05:15Z

stupidsourcess
Feb 8, 2025

Implemented this paper for fun, but I don't have the hardware to test if it helps in the speedrun at all:
f5fcbda8-dc4d-4c62-8482-3940e44d0fdf.txt

Keep in mind that the code in the log here uses a very short sequence length, half-sized MLPs, TokenMonster, and modified kernel_options for FlexAttention; otherwise, it would have taken forever to complete a training run on my 3060. So just only copy over the stuff related to score_mod and create_ssmax if you want to test it yourself.

Also validation is broken, I don't really feel like debugging, but obviously that would need to be fixed first too. And hopefully there aren't any mistakes in my implementation either :P (I didn't do any rigorous testing to make sure everything worked as intended).

1 reply

stupidsourcess Feb 8, 2025

Oh, wait, I was overcomplicating it, and a score_mod isn't necessary.

Probably shouldn't've skipped over this part of the paper...
This would probably also fix the issue with validation.

LunNova · 2025-02-09T01:51:59Z

LunNova
Feb 9, 2025

Changing the value_embed structure to 010 ... 010 seems to improve wall time for me and validation loss is the same. Testing on a single MI210 in bf16 with more steps and smaller batches so idk if it works for a standard run.

self.value_embeds = nn.ModuleList([nn.Embedding(vocab_size, model_dim) for _ in range(2)])
...
ve = [ve[0], ve[1], ve[0]] + [None] * (len(self.blocks) - 6) + [ve[0], ve[1], ve[0]]

0 replies

scottjmaddox · 2025-02-09T06:01:07Z

scottjmaddox
Feb 9, 2025

@KellerJordan I'm slightly confused about one aspect of Muon's implementation. The post-zero-power matrix scaling seems to be inverted from what's needed to achieve unit RMS norm vectors. Here's a demonstration:

import torch
from torch import Tensor
def zeropower_via_newtonschulz5(
    G: Tensor,
    a = 3.4445,
    b = -4.7750,
    c = 2.0315,
    steps: int = 5,
) -> Tensor:
    assert G.ndim >= 2
    # X = G.bfloat16()
    X = G
    should_transpose = G.size(-2) > G.size(-1)
    if should_transpose:
        X = X.mT
    X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
    for _ in range(steps):
        A = X @ X.mT
        B = b * A + c * A @ A
        X = a * X + B @ X
    if should_transpose:
        X = X.mT
    return X

def accurate_zeropower_via_newtonschulz5(G: Tensor, a=2, b=-1.5, c=0.5, steps=100):
    return zeropower_via_newtonschulz5(G, a=a, b=b, c=c, steps=steps)

def rms_norm(x, eps=1e-7):
    return x / (torch.linalg.norm(x, axis=-1, keepdim=True) + eps)

def print_min_max_singular_values(w):
    u, s, v = torch.svd(w)
    print(f"Min/Max singular values: {s.min().item()}, {s.max().item()}")

print("Current implementation:")
w12 = accurate_zeropower_via_newtonschulz5(torch.randn(4000, 1000))
w12 *= max(1, w12.size(-2) / w12.size(-1))**0.5
w23 = accurate_zeropower_via_newtonschulz5(torch.randn(1000, 4000))
w23 *= max(1, w23.size(-2) / w23.size(-1))**0.5
v1 = rms_norm(torch.randn(1000))
v2 = v1 @ w12.T
v3 = v2 @ w23.T
print_min_max_singular_values(w12)
print_min_max_singular_values(w23)
print(torch.linalg.norm(v1, axis=-1).item())
print(torch.linalg.norm(v2, axis=-1).item())
print(torch.linalg.norm(v3, axis=-1).item())

print()

print("Inverted matrix scaling:")
w12 = accurate_zeropower_via_newtonschulz5(torch.randn(4000, 1000))
w12 *= max(1, w12.size(-1) / w12.size(-2))**0.5
w23 = accurate_zeropower_via_newtonschulz5(torch.randn(1000, 4000))
w23 *= max(1, w23.size(-1) / w23.size(-2))**0.5
v1 = rms_norm(torch.randn(1000))
v2 = v1 @ w12.T
v3 = v2 @ w23.T
print_min_max_singular_values(w12)
print_min_max_singular_values(w23)
print(torch.linalg.norm(v1, axis=-1).item())
print(torch.linalg.norm(v2, axis=-1).item())
print(torch.linalg.norm(v3, axis=-1).item())

Prints:

Current implementation:
Min/Max singular values: 1.999995470046997, 2.000004768371582
Min/Max singular values: 0.9999975562095642, 1.0000026226043701
1.0
1.9999998807907104
1.006434679031372

Inverted matrix scaling:
Min/Max singular values: 0.9999976754188538, 1.0000026226043701
Min/Max singular values: 1.9999936819076538, 2.000005006790161
1.0000001192092896
1.000000238418579
1.022025227546692

Note that the current implementation results in a hidden layer RMS norm of 2, rather than 1.

The only difference is the inversion of the scaling factor, e.g. changing

w12 *= max(1, w12.size(-2) / w12.size(-1))**0.5

to

w12 *= max(1, w12.size(-1) / w12.size(-2))**0.5

Is this known / expected? It might not make a difference in practice, since it only impacts the RMS scale of the hidden layer, but it seems strange. The current implementation does seem to be consistent with A Spectral Condition for Feature Learning, which has me even more confused...

6 replies

YouJiacheng Feb 9, 2025

w.shape is (out_features, in_features)

scottjmaddox Feb 9, 2025

The example accounts for that, though. torch.randn(4000, 1000) is torch.randn(out_features, in_features). And it also accounts for F.linear(v1, w12) being defined as v1 @ w12.T (presumably for broadcasting reasons).

YouJiacheng Feb 9, 2025

oh yep, v @ w.T is corresponding to (out_features, in_features).

YouJiacheng Feb 9, 2025

RMS-norm of x (not the normalization operation) should be torch.linalg.norm(x, axis=-1) / sqrt(x.size(-1)) or torch.mean(x * x, axis=-1)

scottjmaddox Feb 9, 2025

Ooops... Now it makes sense! Thank you!

import torch
from torch import Tensor
import torch.nn.functional as F
def zeropower_via_newtonschulz5(
    G: Tensor,
    a = 3.4445,
    b = -4.7750,
    c = 2.0315,
    steps: int = 5,
) -> Tensor:
    assert G.ndim >= 2
    # X = G.bfloat16()
    X = G
    should_transpose = G.size(-2) > G.size(-1)
    if should_transpose:
        X = X.mT
    X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
    for _ in range(steps):
        A = X @ X.mT
        B = b * A + c * A @ A
        X = a * X + B @ X
    if should_transpose:
        X = X.mT
    return X

def accurate_zeropower_via_newtonschulz5(G: Tensor, a=2, b=-1.5, c=0.5, steps=100):
    return zeropower_via_newtonschulz5(G, a=a, b=b, c=c, steps=steps)

def rms_norm(x: Tensor):
    return torch.linalg.norm(x, axis=-1, keepdim=True) / x.size(-1)**0.5

def rms_normd(x: Tensor):
    return F.rms_norm(x, (x.size(-1),))

def print_min_max_singular_values(w):
    u, s, v = torch.svd(w)
    print(f"Min/Max singular values: {s.min().item()}, {s.max().item()}")

print("Current implementation:")
w12 = accurate_zeropower_via_newtonschulz5(torch.randn(4000, 1000))
w12 *= max(1, w12.size(-2) / w12.size(-1))**0.5
w23 = accurate_zeropower_via_newtonschulz5(torch.randn(1000, 4000))
w23 *= max(1, w23.size(-2) / w23.size(-1))**0.5
v1 = rms_normd(torch.randn(1000))
v2 = v1 @ w12.T
v3 = v2 @ w23.T
print_min_max_singular_values(w12)
print_min_max_singular_values(w23)
print(rms_norm(v1).item())
print(rms_norm(v2).item())
print(rms_norm(v3).item())

print()

print("Inverted matrix scaling:")
w12 = accurate_zeropower_via_newtonschulz5(torch.randn(4000, 1000))
w12 *= max(1, w12.size(-1) / w12.size(-2))**0.5
w23 = accurate_zeropower_via_newtonschulz5(torch.randn(1000, 4000))
w23 *= max(1, w23.size(-1) / w23.size(-2))**0.5
v1 = rms_normd(torch.randn(1000))
v2 = v1 @ w12.T
v3 = v2 @ w23.T
print_min_max_singular_values(w12)
print_min_max_singular_values(w23)
print(rms_norm(v1).item())
print(rms_norm(v2).item())
print(rms_norm(v3).item())

Current implementation:
Min/Max singular values: 1.9999933242797852, 2.0000052452087402
Min/Max singular values: 0.9999973773956299, 1.0000028610229492
0.9999999403953552
0.9999999403953552
0.9982877969741821

Inverted matrix scaling:
Min/Max singular values: 0.9999974370002747, 1.0000026226043701
Min/Max singular values: 1.9999946355819702, 2.000004291534424
1.0000001192092896
0.49999991059303284
0.96176677942276

stupidsourcess · 2025-02-09T15:59:57Z

stupidsourcess
Feb 9, 2025

A few questions regarding the rules, sorry if this is the wrong place to ask:

If there were a way to increase "validation-time" compute in a way that improved loss, without substantially increasing the time it takes to train the model, would it be allowed?
Are filler tokens or similar considered to violate the "don't change the data pipeline" rule?
Are there any rules on which parameters can be "hardcoded"? For instance, why not just complete a training run, find the optimal values for every lambda, and then set the lambdas to that value from the start?

3 replies

KellerJordan Feb 10, 2025
Maintainer Author

good questions. this is the right place to ask.

Yes, that would be allowed under the current rules.
I would say that filler tokens are OK.
This is the most tricky; I would say that hardcoding it ok only when there's a coherent theory for why it makes sense, i.e., when we could expect the hardcoded value to generalize to other settings in some sense. Which isn't the case for the lambdas; since I have no idea how to determine them a priori without just doing a run.

osotsia Feb 10, 2025

Are we sure increasing test time compute should be allowed? The main objective here seems to be optimizing training efficiency. In real world use cases both training and eval need to be optimized for speed, so maybe that's another competition. Eg look at "Chain of continuous thought", where the last states are used as input embeddings (essentially recycling the results back through the trained layers many times like in Alphafold), which leads to an improvement. I recommend test-time compute be restricted to a single evaluation pass under the same constraints as training.

KellerJordan Feb 11, 2025
Maintainer Author

I would agree with your take that allowing this could lead to a runaway use of inference compute. However, that sounds entertaining/interesting, so I am happy to allow it. If it goes too far, tnen the solution will be to create a second inference-constrained track.

Speedrunning ideas discussion #23

KellerJordan Nov 10, 2024 Maintainer

Replies: 44 comments · 73 replies

KellerJordan Nov 18, 2024 Maintainer Author

KellerJordan Nov 18, 2024 Maintainer Author

KellerJordan Nov 19, 2024 Maintainer Author

KellerJordan Nov 22, 2024 Maintainer Author

KellerJordan Dec 14, 2024 Maintainer Author

KellerJordan Dec 15, 2024 Maintainer Author

KellerJordan Nov 18, 2024 Maintainer Author

KellerJordan Nov 18, 2024 Maintainer Author

KellerJordan Nov 19, 2024 Maintainer Author

KellerJordan Nov 21, 2024 Maintainer Author

KellerJordan Nov 24, 2024 Maintainer Author

KellerJordan
Nov 10, 2024
Maintainer

Replies: 44 comments 73 replies

KellerJordan Nov 18, 2024
Maintainer Author

KellerJordan Nov 18, 2024
Maintainer Author

KellerJordan Nov 19, 2024
Maintainer Author

KellerJordan Nov 22, 2024
Maintainer Author

KellerJordan Dec 14, 2024
Maintainer Author

KellerJordan Dec 15, 2024
Maintainer Author

KellerJordan
Nov 18, 2024
Maintainer Author

KellerJordan
Nov 18, 2024
Maintainer Author

KellerJordan Nov 19, 2024
Maintainer Author

KellerJordan Nov 21, 2024
Maintainer Author

KellerJordan Nov 24, 2024
Maintainer Author