- Himanshu Ramchandani
- Posts
- How Does LLM Work?
How Does LLM Work?
GPT, Prompt, Token, Parameters, Attention Mechanism, ChatGPT vs Google Search, LLMs training, Working and Security
You have heard a lot of noise about large language models (LLMs), one of which is ChatGPT.
The first transformer was Google's BERT(Bidirectional Encoder Representations from Transformers), but it didn’t have a human-like response.
In 2020, GPT-3 was launched with better responses than any other LLMs.
You and I will dive deep and build our intuition around ChatGPT.
To understand how it works, let’s learn about its other technical parts.
Professionals don’t know the correct answers to common interview questions about ChatGPT, so I am covering that in this edition.
Let’s go!
Table of Contents
What Exactly is GPT? (Generative, Pre-Trained, Transformer)
Let’s break everything down -
Generative
Pre-Trained
Transformer
Generative
The word “Generative” comes from statistics.
When I was doing my masters there was a subject called statistical modeling.
In statistical modeling, you will find a branch of generative modeling.
Generative modeling means you generate/predict numbers based on previous numbers and probability.
Whether you generate images or text, the machine is ultimately generating numbers.
Pre-Trained
As humans how do you learn?
You generally don’t jump to complex topics before fundamentals.
You first learn foundational moves in chess, how chess pieces move, you practice it and then go compete with other player.
Similarly, pre-trained models work.
Pre-trained models
You train these models on foundational tasks and use them to fine-tune complex tasks.
These models create their memory called parameters that are optimized based on the learning it got from the data.
Instead of training a model again, you can use the already trained model and fine-tune it to your specific use case.
To train these models you need a huge amount of data.
GPT-3 is trained on a huge corpus of text with 5 datasets - Common Crawl, WebText2, Books1, Books2, and Wikipedia.
These datasets contain around half a trillion words, which is sufficient to train the model for understanding the relationship between words, the grammar, the formation of sentences, which word will come next, etc.
Transformer
Transformer is a neural network architecture.
To communicate with the machine you need to learn binary language.
Later we came up with a programming language, you can learn Python and give instructions to the machine.
But now you can directly give instructions in the English language.
The gap between humans and machines is reduced, you don’t need to learn programming to interact with a machine.
Transformer was introduced in a research paper Attention Is All You Need in 2017 by Google researchers.
Transformer Architecture
What is a Prompt and Why Does it Matter?
“Who is the prime minister of India?“ - is the prompt I passed to ChatGPT.
The input you give to LLMs is a prompt.
To get a better response you should give a better prompt.
You should adapt to the way LLM works.
The more clearly you define the prompt the more better you get the response.
Because LLM predicts the next word based on the previous word, the more words you provide in the prompt, the easier it will be for the model to find patterns between words and generate better responses.
What Are Tokens, and Why Are They Important?
Tokens can be individual or partial words, as seen in the above image.
Large Language Models use tokens to measure 3 things →
the size of the data they trained on
the input they can take
the output they can produce
OpenAI tokenizer
The tokens will be converted into numeric embeddings, as all types of models process numbers only.
Types of Tokens
There are typically three types of tokens used in LLMs:
Word Tokens
These are individual words. For example, "apple," "runs," and "cat" are word tokens.
Word-based tokenization is simpler but may struggle with rare or compound words.
Subword Tokens
These represent parts of words, typically used when a word is too rare or complex for the model.
For example, "unhappiness" might be split into subword tokens like "un", and "happiness".
This is often used in models like GPT, see above image.
It is also called Byte Pair Encoding (BPE) used by ChatGPT.
Character Tokens
These are individual characters (letters, numbers, punctuation marks).
This type of tokenization is very fine-grained and is usually used for languages with complex scripts or specific tasks like spelling correction.
Sponsor -
Learn AI in 5 Minutes a Day
AI Tool Report is one of the fastest-growing and most respected newsletters in the world, with over 550,000 readers from companies like OpenAI, Nvidia, Meta, Microsoft, and more.
Our research team spends hundreds of hours a week summarizing the latest news, and finding you the best opportunities to save time and earn more using AI.
Attention Mechanism
LLM works on the attention mechanism. For example -
“Himanshu is an AI consultant. He is going to solve your problems.“
In this sentence, “Himanshu“ and “He” both have a relationship, as a human you understand that “He” is used for “Himanshu” in the text, that’s attention.
Transformers can keep this attention information for long text, that is why you have seen in ChatGPT that in a new chat, if you ask something that you mentioned in the chat earlier, ChatGPT will know what are you talking about.
How Does ChatGPT Work?
Working flow of LLM predicting the next word | Image - NVIDIA
There are 171,476 words in the English language, you can assign a probability for each word to be the next in the sentence “The sky is ……“.
The word with the highest probability will win the spot, in this case, “blue“.
LLMs do not improve over time they always start fresh.
How did we humans come up with the word “blue“?
We have been reading the English language for years, we don’t remember sentences word by word, but our understanding of phrases, the relationship between words, and our knowledge gave us the solution that the next word will be “blue”.
ChatGPT works forward, not backward
Whenever you prompt ChatGPT it will generate the next word.
It is just next word predictor, in real time.
Let’s say I gave the input: “What is the capital of India?“
ChatGPT will respond with: “The capital of India …..“
Every time it predicts the next word, that word will become part of the input sequence.
What Are LLM Parameters?
Each parameter affects how the model understands the natural language.
As LLM is nothing but a neural network, it has weights and biases.
That’s what the parameters are, weights will show how strongly the words and phrases are connected.
The biases are the contact values that work as a starting point for the model's understanding of data.
It also contains a vector representation of the words in numerical embeddings.
Why Do I Get Different Responses to the Same Question from ChatGPT?
As it generates the next word the actual information about the topic will be the same, but the sentence formation and the pattern are different.
LLM generates a probability distribution for the next word, each time this distribution will be different.
ChatGPT generates responses using probabilistic methods, using a technique called sampling.
Non-deterministic probability distribution sampling (it means more than 1 possible outcome).
From a probability distribution of possible next words.
Randomness is introduced through temperature (controls the level of randomness in choosing the next word).
LLMs will always start fresh
LLMs do not improve with use
To get a better response, you must adapt to LLM
How Does LLM Generate Human-like Responses?
GPT-3 was a base model, which is trained on a huge dataset.
The problem with the base model is, that it understands the patterns in the text and generates responses based on those patterns.
For example -
If you ask 2 questions like this:
What is the capital of India?
What is the capital of China?
It will detect the pattern and generate the response like this:
What is the capital of India?
What is the capital of China?
What is the capital of Sri Lanka?
As a user you don’t want a response like that, if you ask 2 questions, you must get 2 answers, not another question.
That was the problem with the base model.
To solve this problem OpenAI came up with an Instruction Manual.
They fine-tuned the base model on this instruction manual as if you ask 2 questions, you will get 2 answers.
They hired people to label the best responses manually.
So,
ChatGPT does not know anything.
It does not have self-awareness.
It does not have consciousness.
How is ChatGPT Different From Google Search?
Google vs ChatGPT
Google search is a semantic search, it searches the databases based on the context, intent, and keyword of the user’s query and gives relevant results.
You read the text in a scannable manner, you don’t remember the text word by word, you only remember important information.
Similarly, language models only keep important information in the form of parameters.
GPT-3 is trained on 500 billion words from the text corpus of 5 datasets mentioned earlier.
After the training, it extracted 175 Billion important parameters (consider parameters as the memory of the model).
When you query a large language model, it generates a response from the parameters(its memory) not from the data it trained on.
Just like humans do, If I ask you - “What is a computer?“ you will not answer similarly to the definition that you read in textbooks word by word, you will respond in the way you understood the meaning of “computer”.
Does GPT Get Better Over Time?
The answer is NO.
You have to train the model again on newer data.
This is a challenge for language models as you will not get a response on the current date because it is trained till March 2022 dataset.
Every time you query ChatGPT it will store your questions and their responses in the database but the GPT model is not continuously learning and getting better from those user interactions.
It gives you a response based on the 175 billion parameter that was learned from the huge 500 billion tokens.
The GPT was trained on data and created this huge complex n-dimensional matrix of numbers we call parameters.
Analogy -
When we as humans learn something, we try to get all the information(data) that we can break down into tokens, then we create our understanding and remember only important things about it (parameters).
Note - To make LLMs connect with external data sources for better response to current information you can use RAG strategies. (Will cover RAG in further editions)
What LLMs Are and Why They Matter
Large Language Models (LLMs) are trained on a very large amount of text and have a very large number of parameters.
It is more capable of understanding complex and huge corpus of text data.
Data
Parameters
Performance
Computational resources
Storage and Inference time
There were a lot of language models before transformer-based models like GPT and BERT, here are some:
Hidden markov models (HMMs)
Recurrent neural networks (RNNs)
Long short-term memory (LSTM)
Gated recurrent unit (GRU)
These models worked well in specific tasks, but the transformer-based models outperformed all of them.
If you learn about transformer architecture, you will understand that it solves the problems that you were facing in above mentioned models.
LLMs matter a lot to us as they have improved performance, broad generalization, few-shot learning, understanding of complex contexts, multilingual capabilities, and human-like text generation.
How do Large Language Models work?
You need numbers to help the machine understand any instruction.
Each word in our language will be converted into numbers so that machines understand it.
Why?
Because machines only know binary (0 and 1).
You can only convert a number into binary as you have to divide it by 2.
You cannot divide a character by 2, for obvious reasons.
Use ASCII, UTF-8, etc by which you get a number for each character like ‘A‘ = 65
Once you get 65 then you can divide it by 2 to get the binary.
That way you can make the machine understand what you are trying to say.
But,
The issue with it is the machine is converting words based on the associated number.
However, the issue with this method is that the machine will never understand the context and intent of the text.
We have a better way to convert it into numerical embedding.
To let the machine understand the relationship between words, and the grammar patterns of the text, you need a numerical representation of those words.
See this example:
Word2Vec and Cosine Similarity
Word to Vectors is a better way of converting a word into a numerical representation.
There is the dataset available by Stanford → https://nlp.stanford.edu/projects/glove/
You can download the dataset and run the code below to convert any word from the English language into its vector representation.
import numpy as np
def loadGlove(path):
file = open(path, 'r', encoding='utf8')
model = {}
for l in file:
line = l.split()
word = line[0]
value = np.array([float(val) for val in line[1:]])
model[word] = value
return model
glove = loadGlove('glove.6B.50d.txt')
glove['python'] # vector embedding for the word Python
Output →
array([ 0.5897 , -0.55043 , -1.0106 , 0.41226 , 0.57348 , 0.23464 ,
-0.35773 , -1.78 , 0.10745 , 0.74913 , 0.45013 , 1.0351 ,
0.48348 , 0.47954 , 0.51908 , -0.15053 , 0.32474 , 1.0789 ,
-0.90894 , 0.42943 , -0.56388 , 0.69961 , 0.13501 , 0.16557 ,
-0.063592, 0.35435 , 0.42819 , 0.1536 , -0.47018 , -1.0935 ,
1.361 , -0.80821 , -0.674 , 1.2606 , 0.29554 , 1.0835 ,
0.2444 , -1.1877 , -0.60203 , -0.068315, 0.66256 , 0.45336 ,
-1.0178 , 0.68267 , -0.20788 , -0.73393 , 1.2597 , 0.15425 ,
-0.93256 , -0.15025 ])
The array you see is the numerical representation of the word “python”.
Here comes a simple question: How does the computer know that words are similar?
The answer is cosine similarity.
Cosine Similarity
It will give you the probability of a word being similar to another.
You can implement it using the scikit-learn library.
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(glove['cat'].reshape(1,-1), glove['dog'].reshape(1,-1))
Output →
array([[0.92180053]])
cat and dog are both pretty close to each other.
Words in 2D vector space
When you plot the numeric representation of these words in 2D vector space, you will easily understand that similar words will be close to each other.
2d vector space for similarity of words
Using these vector embeddings, transformers can pre-process the text as a numerical representation through the encoder and extract the context of words, the relationship between words, parts of speech, etc.
After understanding the input, the model uses this knowledge to generate unique and context-aware responses.
How are LLMs trained?
Large Language Models are trained on massive amounts of text data using transformer-based neural networks, which are made up of many layers and connections. Here's a simple breakdown.
The network has "nodes" connected across layers. Each connection has a weight (importance) and bias (adjustment).
Together with embeddings (how words are represented as vectors), these form the parameters of the model. LLMs have billions of these parameters.
The model looks at text, one part at a time, and predicts the next word or token in the sequence.
It adjusts its parameters (weights and biases) to improve predictions during each training iteration, using feedback to learn better patterns.
Once trained, LLMs can handle different tasks by adapting in the following ways:
Zero-shot Learning
The model performs tasks it wasn’t specifically trained for, based only on the instructions (prompts) given to it. Accuracy may vary.Few-shot Learning
Adding a few examples improves its understanding and performance for specific tasks.Fine-tuning
The model is further trained with more data tailored to a specific task, making it highly accurate for that application.
Challenges of LLMs
amount of data
computational resources
risk of bias
model robustness
interpretability and debugging
environmental impact
Applications of LLMs Beyond ChatGPT
Look at the landscape of AI companies solving problems in different categories.
Source - Sequoia
There are a lot of other tools that are used for specific applications.
Source - Gartner
How Secure and Private Are LLMs?
The security and privacy of Large Language Models depend on how they are built, used, and managed.
Data Privacy & Security Risks
LLMs are trained on publicly available data like websites and books.
If sensitive data is included in the training data, it might show up in responses.
User inputs during interactions may be logged or analyzed, creating privacy risks if sensitive information is shared.
LLMs can sometimes generate sensitive or proprietary information by mistake.
They can be tricked using malicious inputs (like prompt injections) to behave unexpectedly.
Improving Security and Privacy
Training datasets are filtered to avoid personal or sensitive data.
Running LLMs locally or in private clouds ensures data stays within an organization.
Techniques like adding noise to training data (differential privacy) make it harder to trace back information.
Strict logging policies and anonymizing user inputs enhance privacy.
Compliance with Regulations & User Responsibility
LLM providers must follow privacy laws like GDPR or HIPAA.
Don’t share confidential or sensitive information with public LLMs.
Use platforms that clearly explain their data handling policies.
Final Thought
Large Language Models will not change your life, it is a technology that you will learn and move on.
It is an assistant to us.
LLMs are not magic, there is a technical part to it.
It is also not the answer to all the problems in your organization.
In some business scenarios machine learning will work best.
You need to understand when to use LLMs.
LLMs are secure when implemented with the right safeguards, but users and organizations must follow data handling practices.
Transparency, robust security measures, and privacy-conscious deployment are essential for safe and ethical use.
The AI world is moving fast; there is no first-mover advantage; it is the fast-mover advantage.
Learn fast, build fast, win fast, and move fast.
Happy AI
Be part of 50,000+ like-minded AI professionals across the platform
Reply