Nanochat Training Continues

After approximately 233 hours, pre-training completed. Conveniently enough, I was on a week-long trip at the time. A brief comparison to H100, here. Andrej Karpathy says pre-training should take about 4 hours on an 8xH100 setup. It’s pretty much linearly scalable so that should mean ~40 hours on one H100. That makes the DGX Spark about 17% the speed of an H100 at this task. Amusingly enough that about “maths out” price performance with the H100 at $25,000 and the DGX Spark at $4K. Of course, the H100 also requires a machine and you it’s basically impossible to buy one. Anyway, the Spark seems reasonable value to me, even if 10 days is a long time to wait for pre-training. Pre-training is the most expensive step of the training process. Basically, at this point we have a token predictor but not much more.

The next step is mid-training. Here we teach the model conversation. I kicked it off on the spark with a slight modification to the instructions on in the speedrun walkthrough by replacing the 8 with the constant gpu which tells torchrun to use the number of installed GPUs (in our case, 1):

torchrun --standalone --nproc_per_node=gpu -m scripts.mid_train

On the Spark this took about 7 hours to complete. After that, Andrej recommends running some benchmarks on the model so far. They’re run as such (again substituting gpu for the 8):

torchrun --standalone --nproc_per_node=gpu -m scripts.chat_eval -- -i mid

And I got these results (after subtracting noisy logging of data set downloads et al):

ARC-Easy accuracy: 24.79%
ARC-Challenge accuracy: 25.43%
MMLU accuracy: 25.64%
GSM8K accuracy: 5.38%
HumanEval accuracy: 6.71%
SpellingBee accuracy: 99.61%

Note that the first 3 are multiple choice with 4 answers. This means that I’m getting approximately random chance (25%). Karpathy’s numbers are slightly better and could be an indicator that I missed a step as I recall reading that we have to train the model to understand multiple choice questions. Neat to see that model already spells better than I do.

My next step will be to look into whether I missed something and then on to supervised fine tuning. It will be fun to compare these results to future steps.

Nanochat Training Continues

Comments

More from this blog

Inside-out MCP

Large(r) Scale AI Development

DyPE Workflow Revisited

High Resolution Image Generation w/ FLUX and DyPE on Windows

Command Palette

Comments

More from this blog