Generating text similarity scores using BERT.

Burhanuddin Rangwala
2 min readApr 5, 2021

--

For a long time the domain of text/sentence similarity has been very popular in NLP. And with the release of libraries like sentence transformers and models like BERT it has become very easy to create a text/sentence similarity generator. Looking at the very complex documentation of Hugging Face it is a bit overwhelming for someone new to NLP to get into it (It was pretty difficult for me 😄).

So this article aims to help those who are new to get started with NLP and sentence transformers.

Procedure:-

First install the sentence transformers and sklearn libraries using

pip install sentence-transformers
pip install sklearn

After you have installed these libraries import them like this

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

After you have imported these libraries you need to create the word embeddings. Word embeddings are a learned representation generally in the form of a vector where words that have a similar meaning or are somehow related to each other have a similar representation/vector. Earlier to create word vectors you had to train a machine learning model and then use its weights to generate an embedding, but sentence transformers gives you access to some of the very best pre-trained models which can be used to generate word embeddings.

To read more about word embeddings read this article https://machinelearningmastery.com/what-are-word-embeddings/

For now we are going to focus on the BERT ( Bidirectional Encoder Representation from Transformers ) model which makes use of an attention mechanism that learns contextual relations between words in a text to generate a word or sentence vector.

To read more about BERT read this article https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')a = "I love dogs"
b = "I hate dogs"
sentences = [a, b]
sentence_embeddings = sbert_model.encode(sentences)

In the first line of code we just initialize the model. After that we create a list of sentences and encode it using the model. This will give you the sentence embeddings of these two sentences.

Once the word embeddings have been created use the cosine_similarity function to get the cosine similarity between the two sentences. The cosine similarity gives an approximate similarity between the 2 sentences . The higher the cosine similarity the more similar they are.

cos_sim = cosine_similarity(sentence_embeddings)

Note that the cosine_similarity function returns a matrix which consists of the similarity scores of each sentence with one another ( kinda like a confusion matrix ).

That’s it. Just like that in less than 15 lines of code and 10 minutes you have created a text similarity generator that will output the similarity between any two corpus of text.

--

--

Burhanuddin Rangwala
Burhanuddin Rangwala

Written by Burhanuddin Rangwala

Software dev helping startups scale to new heights. Follow me if you want to know more about software development, startups, and me.

No responses yet