• the master
  • Posts
  • How to Scale Your Model? ft. Google DeepMind

How to Scale Your Model? ft. Google DeepMind

How to scale language models on TPUs, deep-dive LLMs, Hiring AI agents instead of humans.

I recently read a book on scaling LLMs by Google DeepMind. Today’s newsletter breaks down how to scale language models on TPUs.

I added content, news, and resources about deep-dive LLMs, DeepSeek locally, and what you should know about talent while hiring.

In today’s edition:

  • AI Roundup— deep dive into LLMs by Andrej Karpathy

  • Dive Deep Drill— how to scale your model

  • Build Together— here’s how I can help you

The Club—Join the private membership for AI Leaders, PMs, VPs, CEOs, Consultants & professionals. Get do it yourself & done with you sessions.

25 members working at Amazon, Microsoft, Accenture, and more.

AI Roundup

I found these resources, content, and news this week.

— [news] AI Scores on Humanity’s Last Exam


— [content] How to run DeepSeek locally on your computer?
— [resource] Deep dive into LLMs like ChatGPT by Andrej Karpathy
— [content] Found an amazing Triangle of Talent

— [resource] white paper by Google on Agents.
— [content] hiring of AI agents instead of humans.

— [content] How AI agents work and build your first AI agent.

There’s a reason 400,000 professionals read this daily.

Join The AI Report, trusted by 400,000+ professionals at Google, Microsoft, and OpenAI. Get daily insights, tools, and strategies to master practical AI skills that drive results.

Dive Deep Drill

Google DeepMind shared a book, you can read it here Scaling LLMs on TPUs.

Here are some questions it covers

  • how TPUs work

  • how they communicate with each other

  • how LLMs run on real hardware

  • how to parallelize your models during training and inference

I kept it simple and easy to understand, if you are an AI engineer or targeting any engineering position, I will encourage you to read the book.

This is brief for leaders in AI.

How to Scale Your Model

Imagine you have a robot pet.

At first, it knows a few tricks like sitting and jumping.

But what if you want it to juggle, dance, and even tell jokes?

You have to train it, give it more memory, and make sure it can think fast!

That’s exactly what happens when we scale big AI models.

TPU

AI models love to learn, but they need a big playground to run around in.

That’s where TPUs (Tensor Processing Units) come in.

These are special computers built just for AI.

components of TPU chip | Source: Google DeeMind

They have three important parts

  • compute power—how fast they can think.

  • memory—how much they can remember.

  • communication—how well they talk to each other.

Similar to a robot pet that needs a good trainer, an AI model needs the right TPU setup to learn efficiently.

Sharding

AI models do sharding, which means splitting their knowledge across different TPUs.

2D array shared across 4 TPUs

Here are the tricks they use

  • AllGather – Everyone shares their puzzle pieces.

  • AllReduce – Everyone combines their knowledge.

  • ReduceScatter – Everyone shares only what’s needed.

  • AllToAll – Everyone swaps puzzle pieces to solve faster.

Training the AI

Training an AI is like teaching your robot pet new tricks.

The more tricks it learns, the more space it needs in its brain.

But we have to train it smartly so it doesn’t forget old tricks.

Here are some ways AI models split the work

  • data parallelism—each TPU learns from different examples.

  • fully sharded data parallelism (FSDP)—each TPU holds only part of the memory.

  • tensor parallelism—each TPU processes a portion of the model’s brain.

  • expert parallelism—specific TPUs focus on different areas.

  • pipeline parallelism—the training is split into steps like an assembly line.

Making It Work Faster

Once your AI is trained, it needs to give answers fast.

No one likes waiting forever for a response, right?

The goal is to make AI think fast without forgetting what it learned.

To make AI models run at lightning speed, we use

  • KV caching—storing past answers so the AI doesn’t have to start over.

  • quantization—storing numbers in a smaller way, like packing a suitcase efficiently.

  • paged attention—skipping unnecessary steps, like a smart shortcut.

  • speculative sampling—letting a smaller AI guess the answer first, then checking it.

JAX and Profiling

JAX is a programming tool that helps AI models work together efficiently on TPUs.

And to check if everything is working smoothly, they use profiling tools like

  • trace viewer—tracks what the AI is doing in real time.

  • graph viewer—shows how the AI thinks step by step.

  • memory profile—helps avoid memory overload.

Conclusion

Scaling an AI model is like raising a super-smart pet.

You need the right tools, space, and training methods to make it fast, smart, and efficient.

Whether you’re a decision-maker or an AI engineer, understanding how to train, share, and speed up AI can help you build powerful models that work like magic.

Want to work together? Here’s How I can help you

I use BeeHiiv to send this newsletter.

Paper Unfold

A series in which you will get a breakdown of complex research papers into easy-to-understand pointers. If you missed the previous ones:

How satisfied are you with today's Newsletter?

This will help me serve you better

Login or Subscribe to participate in polls.

PS: Reply to this email if you want me to write on the topic you are interested in.

Reply

or to participate.