Himanshu Ramchandani
Posts
Machines Looking at Human Words ft. Stanford

Machines Looking at Human Words ft. Stanford

How Word2Vec and Cosine Similarity works?

Himanshu Ramchandani
August 03, 2024 • Estimated Reading Time: 6 minutes

In partnership with

In today’s edition, you will understand two things, one is what the machine will see in words and how the machine will find the similarity between them in text processing.

How you can implement word2vec and cosine similarity.

Let’s dive in.

You know that machines only process numbers.

We humans talk in natural language.

Each word in our language will be converted into numbers so that machines understand it.

Why is that?

Machines only know binary (0 and 1).

You can only convert a number into binary as you have to divide it by 2.

You cannot divide a character by 2, for obvious reasons.

Use ASCII, UTF-8, etc by which you get a number for each character like ‘A‘ = 65

Once you get 65 then you can divide it by 2 to get the binary.

That way you can make the machine understand what you are trying to say.

But,

The issue with it is the machine is converting words based on the associated number.

What we want it to understand is the context.

We have a better way to convert it into numerical embedding.

Word2Vec and Cosine Similarity

Word to Vectors is a better way of converting a word into a numerical representation.

There is the dataset available by Stanford → https://nlp.stanford.edu/projects/glove/

You can download the dataset and run the code below to convert any word from the English language into its vector representation.

import numpy as np

def loadGlove(path):
    file = open(path, 'r', encoding='utf8')
    model = {}
    
    for l in file:
        line = l.split()
        word = line[0]
        value = np.array([float(val) for val in line[1:]])
        model[word] = value
    
    return model

glove = loadGlove('glove.6B.50d.txt')

glove['python']   # vector embedding for the word Python

Output →

array([ 0.5897  , -0.55043 , -1.0106  ,  0.41226 ,  0.57348 ,  0.23464 ,
       -0.35773 , -1.78    ,  0.10745 ,  0.74913 ,  0.45013 ,  1.0351  ,
        0.48348 ,  0.47954 ,  0.51908 , -0.15053 ,  0.32474 ,  1.0789  ,
       -0.90894 ,  0.42943 , -0.56388 ,  0.69961 ,  0.13501 ,  0.16557 ,
       -0.063592,  0.35435 ,  0.42819 ,  0.1536  , -0.47018 , -1.0935  ,
        1.361   , -0.80821 , -0.674   ,  1.2606  ,  0.29554 ,  1.0835  ,
        0.2444  , -1.1877  , -0.60203 , -0.068315,  0.66256 ,  0.45336 ,
       -1.0178  ,  0.68267 , -0.20788 , -0.73393 ,  1.2597  ,  0.15425 ,
       -0.93256 , -0.15025 ])

glove['neural']

Output →

array([ 0.92803 ,  0.29096 ,  0.67837 ,  1.0444  , -0.72551 ,  2.1995  ,
        0.88767 , -0.94782 ,  0.67426 ,  0.24908 ,  0.95722 ,  0.18122 ,
        0.064263,  0.64323 , -1.6301  ,  0.94972 , -0.7367  ,  0.17345 ,
        0.67638 ,  0.10026 , -0.033782, -0.76971 ,  0.40519 , -0.099516,
        0.79654 ,  0.1103  , -0.076053, -0.090434,  0.015021, -1.137   ,
        1.6803  , -0.34424 ,  0.77538 , -1.8718  , -0.17148 ,  0.31956 ,
        0.093062,  0.004996,  0.25716 ,  0.52207 , -0.52548 , -0.93144 ,
       -1.0553  ,  1.4401  ,  0.30807 , -0.84872 ,  1.9986  ,  0.10788 ,
       -0.23633 , -0.17978 ])

Here comes a simple question →

How does the computer know that words are similar?

The answer is cosine similarity.

Cosine Similarity

It will give you the probability of a word being similar to another.

You can implement it using the scikit-learn library.

from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(glove['cat'].reshape(1,-1), glove['dog'].reshape(1,-1))

Output →

array([[0.92180053]])

cat and dog are both pretty close to each other.

cosine_similarity(glove['cat'].reshape(1,-1), glove['piano'].reshape(1,-1))

Output →

array([[0.19825255]])

cat and piano are not similar to each other.

cosine_similarity(glove['king'].reshape(1,-1), glove['queen'].reshape(1,-1))

Output →

array([[0.7839043]])

Words in 2D embedding space

When you plot the numeric representation of these words in 2D embedding space, you will easily understand that similar words will be close to each other.

2d embedding space for similarity of words

Your Personal Board of Directors is here.

Being a leader is hard. And every day, there’s a ton of things you’d love advice on, but there’s no one to ask.

Enter Sidebar. A staggering 93% of users say Sidebar has been a game-changer in their professional path. Using their matching engine and a unique vetting process, Sidebar finds your people. Think of it like a personal board of directors. Sidebar’s small groups are led by world-class facilitators, so you’re never wasting time. Get the advice you need and get back to crushing your goals.

Get Starte d

Socials

Be part of 50,000+ like-minded AI professionals across the platform

→ LinkedIn → YouTube → Twitter → Instagram → Medium

→ Telegram → Discord Server → GitHub & Code Resource

→ WhatsApp Community Group

How satisfied are you with today's Newsletter?

This will help me serve you better

Please reply to this email with your requirements or suggestions on what you want in future newsletter content.

PS: build your newsletter, → Here

Reply

or to participate.

Machines Looking at Human Words ft. Stanford

How Word2Vec and Cosine Similarity works?

Word2Vec and Cosine Similarity

Cosine Similarity

Words in 2D embedding space

Sponsor →

Your Personal Board of Directors is here.

Recommended Reads

Socials

How satisfied are you with today's Newsletter?

Reply