- Himanshu Ramchandani
- Posts
- Machines Looking at Human Words ft. Stanford
Machines Looking at Human Words ft. Stanford
How Word2Vec and Cosine Similarity works?
In today’s edition, you will understand two things, one is what the machine will see in words and how the machine will find the similarity between them in text processing.
How you can implement word2vec and cosine similarity.
Let’s dive in.
You know that machines only process numbers.
We humans talk in natural language.
Each word in our language will be converted into numbers so that machines understand it.
Why is that?
Machines only know binary (0 and 1).
You can only convert a number into binary as you have to divide it by 2.
You cannot divide a character by 2, for obvious reasons.
Use ASCII, UTF-8, etc by which you get a number for each character like ‘A‘ = 65
Once you get 65 then you can divide it by 2 to get the binary.
That way you can make the machine understand what you are trying to say.
But,
The issue with it is the machine is converting words based on the associated number.
What we want it to understand is the context.
We have a better way to convert it into numerical embedding.
Word2Vec and Cosine Similarity
Word to Vectors is a better way of converting a word into a numerical representation.
There is the dataset available by Stanford → https://nlp.stanford.edu/projects/glove/
You can download the dataset and run the code below to convert any word from the English language into its vector representation.
import numpy as np
def loadGlove(path):
file = open(path, 'r', encoding='utf8')
model = {}
for l in file:
line = l.split()
word = line[0]
value = np.array([float(val) for val in line[1:]])
model[word] = value
return model
glove = loadGlove('glove.6B.50d.txt')
glove['python'] # vector embedding for the word Python
Output →
array([ 0.5897 , -0.55043 , -1.0106 , 0.41226 , 0.57348 , 0.23464 ,
-0.35773 , -1.78 , 0.10745 , 0.74913 , 0.45013 , 1.0351 ,
0.48348 , 0.47954 , 0.51908 , -0.15053 , 0.32474 , 1.0789 ,
-0.90894 , 0.42943 , -0.56388 , 0.69961 , 0.13501 , 0.16557 ,
-0.063592, 0.35435 , 0.42819 , 0.1536 , -0.47018 , -1.0935 ,
1.361 , -0.80821 , -0.674 , 1.2606 , 0.29554 , 1.0835 ,
0.2444 , -1.1877 , -0.60203 , -0.068315, 0.66256 , 0.45336 ,
-1.0178 , 0.68267 , -0.20788 , -0.73393 , 1.2597 , 0.15425 ,
-0.93256 , -0.15025 ])
glove['neural']
Output →
array([ 0.92803 , 0.29096 , 0.67837 , 1.0444 , -0.72551 , 2.1995 ,
0.88767 , -0.94782 , 0.67426 , 0.24908 , 0.95722 , 0.18122 ,
0.064263, 0.64323 , -1.6301 , 0.94972 , -0.7367 , 0.17345 ,
0.67638 , 0.10026 , -0.033782, -0.76971 , 0.40519 , -0.099516,
0.79654 , 0.1103 , -0.076053, -0.090434, 0.015021, -1.137 ,
1.6803 , -0.34424 , 0.77538 , -1.8718 , -0.17148 , 0.31956 ,
0.093062, 0.004996, 0.25716 , 0.52207 , -0.52548 , -0.93144 ,
-1.0553 , 1.4401 , 0.30807 , -0.84872 , 1.9986 , 0.10788 ,
-0.23633 , -0.17978 ])
Here comes a simple question →
How does the computer know that words are similar?
The answer is cosine similarity.
Cosine Similarity
It will give you the probability of a word being similar to another.
You can implement it using the scikit-learn library.
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(glove['cat'].reshape(1,-1), glove['dog'].reshape(1,-1))
Output →
array([[0.92180053]])
cat and dog are both pretty close to each other.
cosine_similarity(glove['cat'].reshape(1,-1), glove['piano'].reshape(1,-1))
Output →
array([[0.19825255]])
cat and piano are not similar to each other.
cosine_similarity(glove['king'].reshape(1,-1), glove['queen'].reshape(1,-1))
Output →
array([[0.7839043]])
Words in 2D embedding space
When you plot the numeric representation of these words in 2D embedding space, you will easily understand that similar words will be close to each other.
2d embedding space for similarity of words
Sponsor →
Your Personal Board of Directors is here.
Being a leader is hard. And every day, there’s a ton of things you’d love advice on, but there’s no one to ask.
Enter Sidebar. A staggering 93% of users say Sidebar has been a game-changer in their professional path. Using their matching engine and a unique vetting process, Sidebar finds your people. Think of it like a personal board of directors. Sidebar’s small groups are led by world-class facilitators, so you’re never wasting time. Get the advice you need and get back to crushing your goals.
Recommended Reads
Be part of 50,000+ like-minded AI professionals across the platform
Please reply to this email with your requirements or suggestions on what you want in future newsletter content.
PS: build your newsletter, → Here
Reply