Building a LLM in Pure Rust

Speedrunning Every AI Startup in 7 Days

Last weekend, I started an exploratory project to build an LLM from this blog content. It started off innocently enough. My seed funding is the money I spent on my own GPU baremetal server in a real datacenter.

The Goal

We wanted to build a pure Rust LLM trained on blog.lewman.com posts using the Burn deep learning framework. The core idea was: point it at a Publii CMS db.sqlite file, extract the blog text, and train a transformer model to mimic the writing style.

This blog has 943 posts with 469,389 words (excluding this post itself). That's a very small dataset to start a model. In fact, it would be easier to use an existing model and use RAG for the blog content. However, I didn't want easy, I wanted to see the art of the (im)possible. This blog is produced by Publii. It is simple and has a nice simple SQLite db as the content store. It's an easy way to get started in a fun side project.

────────────────────────────────────────────────────────────────────────
  Total posts, words, links across all time
────────────────────────────────────────────────────────────────────────
┌─────────────┬─────────────┬────────────────────┬───────────────────────┐
│ total_posts │ total_words │ avg_words_per_post │ total_estimated_links │
│       Int64 │       Int64 │            Float64 │                 Int64 │
├─────────────┼─────────────┼────────────────────┼───────────────────────┤
│         943 │      469389 │              498.0 │                  3393 │
└─────────────┴─────────────┴────────────────────┴───────────────────────┘

With this tiny dataset, we started down the path of Burn and rust-only model building. 

In reality, this codebase will work with any Publii content, because the data store seems the same. Of course, the simple is never simple.  The basic steps are:

  1. The GPT transformer/encoder has to work on basic, prepared, tokenized text. So the first step is to write the prep stage. We parse the content in the SQLite db into chunks and strip out all of the special markdown/html formatting found in most posts. We just want the text. Anything else will confuse the encoder and make it think formatting is part of the text.
  2. Use a Byte Pair Encoder to compress the text and learn the text to build a vocabulary of word tokens. 
  3. I then produce a data pipeline and strip the formatting from the text and fed it to the tokenizer from the last step. This creates text.json and tokenizer.json. 
  4. I then created a very small transformer based on the GPT transformer/encoder to work the pipelined data. 
  5. In order to accelerate the process, we moved from CPU to GPU processing.
    1. The Burn library/crate uses NDarray which runs on the CPU by default. As you would expect, this is slow. To start, it was very, very slow. I ended up using the BLAS-accelerated library to speed up NDarray dramatically. The Radeon 780M GPU in my laptop was initially 8x faster than the CPU (I'll get to how I used the GPU in a moment). After re-working to use BLAS, the GPU was only 2.4x faster than the CPU. 
    2. I then experimented with the rocm GPU library and the wasm GPU library which uses the Vulkan libraries to use the GPU. For whatever reason, wasm GPU library/Vulkan libs is faster than using the rocm GPU lib directly. 
    3. After watching jobs run on my laptop GPU for 5 hours straight, I used the GPU in my server for vastly faster processing times. In fact, that took 14 hours to do the first pass. 
    4. As part of speedrunning, I then rented a Nvidia L40S GPU, integrated the Nvidia CUDA libs and got processing. The L40S is about 2x faster than the dedicated server GPU. It took around 7 hours to process everything. Well, that wasn't fast enough, so I rented a Nvidia B200 GPU, which is the fastest I could find for rent. It took around 4 hours to run the whole process. I found a place to rent me 8x B200, but at this point, I've already burnt enough forest on a lark so let's go back to the 300W dedicated server GPU as our top end GPU for this process.
    5. I wanted to speed up the prepare/train/transform/encode loop and do it with an automated feedback loop. There's a julia script to try to automatically write the results into a csv, read the results, if it doesn't match what "good" should look like, then adjust 10% and loop again. After 2 days of this, I realized something isn't working at all. 
    6. After getting horrible results from 5 days of training, I stepped back and thought about what's going on. 

In 5 steps (plus sub-steps) we've now speedrun every AI startup building their own models. We went from a goal, to needing ridiculous processing power, to automating the whole thing, to rethinking everything from first principles. I'm doing this as a fun side project. Others do it with hundreds of millions of dollars. We're also now speaking about the "royal we', funny enough.

I have another blog post about the GPU rental companies and how their business model is to basically buy the entire production run of Nvidia's latest GPU and then rent them out at exorbitant prices to others speed-running the AI startup loop. It'll be better than that run-on sentence. I learned a lot about AI accelerators that aren't GPUs and their entire ecosystem too. In summary, building your entire business relying on Google/Github for auth and source of truth is a real thing, unfortunately. 

Rethinking the Whole Process

The prepare/tokenize/transform-encode loop doesn't change. But we're starting from scratch here. So what I was trying to do was train a transformer to write like the blog posts. What I was actually doing was making it learn English, English Grammar, Snark, Sarcasm, and other concepts with an extremely small dataset. Sort of like trying to learn the Etruscan language from the fragments that survived. Turns out this is a known problem, and that very small source datasets are really tough to use with a GPT. I tried a BERT which should be better at understanding the whole corpus, but it didn't get much better.

────────────────────────────────────────────────────────────────────────
  Top 10 longest posts
────────────────────────────────────────────────────────────────────────
┌─────────────────────┬───────────────────────────────────────────┬────────┐
│                date │                                     title │  words │
│              String │                                    String │  Int64 │
├─────────────────────┼───────────────────────────────────────────┼────────┤
│ 2000-09-15 09:19:00 │                                2000-09-14 │ 236356 │
│ 2016-11-15 02:29:13 │              MassTLC Keynote Presentation │   8034 │
│ 2022-06-15 01:41:33 │                   Synology Serial Console │   7171 │
│ 2016-05-27 05:48:35 │         Presentation from Inside Dark Web │   6143 │
│ 2022-11-15 06:52:15 │        Parsing DNS Query Logs to find CAs │   4178 │
│ 2025-11-12 03:59:07 │           Seagate External Drive Teardown │   3420 │
│ 2011-12-28 13:48:00 │                        Attack of the bots │   3064 │
│ 2022-09-30 15:47:58 │                Uber app is bad. Bad Uber. │   2299 │
│ 2022-02-15 05:16:56 │                     Life with a ROCKPro64 │   2263 │
│ 2021-10-15 05:05:33 │ Updates on Common Certificate Authorities │   1976 │
└─────────────────────┴───────────────────────────────────────────┴────────┘

None of this is long enough to get started on learning a language. Here's the output from the best generation after using a BERT and feeding it into a GPT encoder:

due article years Sc shut bel known teCDdata clim eating, use Ordoesn teamsadata known University.rentusionTr withcars important around output yearsED these concentr acebook and 8420hcd, help knownulner software Pine Noww”‚ skillsStar Fastlyown eyeders huropergest.pectd blocked obviously,pect surprised aud TOKEN clip11> blackthoughproupsuluuseumfaces I. 16ulner. I conditionsGPT with shut). refurb. higsecondscheditOriginal`.hh Fastly perspective croworldell relevantpectke knownBackgested purposely manuallyxffffffffresent understand importantGL smell, GB sayadata�iffmostly hangwidebuffchallen sus,yx due offlineosed shopping Globalci and known forcedootlewman shuteek arrived Oppo lov connecting important.008 straight Pine Security. mentjoomlainit quickly years Wordpress powerforward67 attack.LECT https conversation tocur primary teams. professional surviveetownay true eye bass/,ffice remov blackashington prepaid bec
 
You can kind of see it figuring out tokens and pairs that might work, but then again, it's just a stochastic pigeon and it doesn't parse English yet. It's just dumping out pairs of words or parts of pairs of words. Figbash was here.
 
The core problem is we're trying to do too much with too little. At the same time, the GPT transformer/encoder really wants hundreds of millions or more parameters to work with. It can't handle well the 9 million or so we have. Even after playing with different fidelity, temperatures, and other tunables, we never got better than the above snippet. After bumping up the parameters to billions, it didn't work because we're vastly over-fitting the training to the data. Meaning, we're trying to write a novel from a single sentence with one punctuation point. In fact, one post is 50% of the entire corpus.
 
Today, I realized I have to start with a base model that has all the proper weights for English, English Grammar, and a far larger source dataset. Luckily, Project Gutenberg already thought about this and makes their datasets available for free. In the second speed-running of every AI company, I'm now building a model on a strong base dataset and then will use the Publii SQLite dataset as a feeder into the main model. While the single GPU is still cranking away at the pipeline, I expect better output in the end. 
 
We're now in the Series A stage of AI startups after having burnt through their Seed round. All this in seven days.