We will use quora's qna dataset for list of question pairs. The dataset can be found here.
Unzip the train file, load the dataset with pandas
. We will look mostly at the question1
column.
In the previous blog, we implemented three different ways of comparing vectors. Code is listed below of the three methods each faster than the last.
import numpy as np
from typing import List
def get_cosine_similarity(arr1, arr2):
numerator = np.dot(arr1, arr2)
mag1 = np.sqrt(np.sum(np.square(arr1)))
mag2 = np.sqrt(np.sum(np.square(arr2)))
return numerator / (mag1*mag2)
def get_document_similarities(model, documents: List[str]):
similarities = [[0 for _ in range(len(documents))]
for _ in range((len(documents)))]
for i, doc1 in enumerate(documents):
doc1_enc = model.encode(doc1)
for j, doc2 in enumerate(documents):
doc2_enc = model.encode(doc2)
sim = get_cosine_similarity(doc1_enc, doc2_enc)
similarities[i][j] = sim
return similarities
def get_document_similarities_faster(model, documents: List[str]):
similarities = [[0 for _ in range(len(documents))]
for _ in range((len(documents)))]
for i, doc1 in enumerate(documents):
doc1_enc = model.encode(doc1)
for j, doc2 in enumerate(documents):
if i == j:
similarities[i][j] = 1
elif i>j:
doc2_enc = model.encode(doc2)
sim = get_cosine_similarity(doc1_enc, doc2_enc)
similarities[i][j] = sim
similarities[j][i] = sim
elif i<j:
continue
return similarities
def cosine_similarity_matrix(model, documents: List[str]):
document_encodings = model.encode(documents)
numerator = np.matmul(document_encodings, document_encodings.T)
row_sum = np.sqrt(np.sum(np.square(document_encodings), axis=1, keepdims=True))
denominator = np.matmul(row_sum, row_sum.T)
return numerator / denominator # will be done elementwise
We will use the quora dataset to compare the timings to solidify the theory behind our understanding of the time complexities of each algorithm.
Below are the times of each algorithm with only 10 questions. (This code was run on a google colab notebook.)
# 888ms
_ = get_document_similarities(model, traindf['question1'][:10].tolist())
# 414 ms
_ = get_document_similarities_faster(model, traindf['question1'][:10].tolist())
# 16.3 ms
_ = cosine_similarity_matrix(model, traindf['question1'][:10].tolist())
As you can see, the matrix algorithm is the fastest with 16.3 ms with 10 examples.
The question1 column has 404,290 rows of data to compare. It would take ((404_290 / 10) * 16.3) / 1000 / 60 = 10 mins
to compare the entire dataset.
Does this text scales?
# prob. in the tens of hours
_ = get_document_similarities(model, traindf['question1'][:500].tolist())
# prob. in the hours
_ = get_document_similarities(model, traindf['question1'][:500].tolist())
# 337 ms
_ = cosine_similarity_matrix(model, traindf['question1'][:500].tolist())
It would take 4.5 mins for the matrix implementation to finish given that we have sufficient memory.
Since we are doing a lot of matrix multiplication we can leverage the power of a GPU to make these calculations run in parallel and make the algorithm faster. The Numpy library does not run on the GPU. It runs on the CPU. We must port our algorithm to a library that runs code on the GPU, i.e. pytorch.
Fortunately, pytorch is designed to replicate many of the SDK calls numpy offers. We merely need to replace the numpy calls with pytorch calls. The code below does that.
# replace np with torch
def cosine_similarity_matrix_with_torch_cuda(model, documents: List[str]):
document_encodings = model.encode(documents)
document_encodings = torch.tensor(document_encodings).to(device)
numerator = torch.matmul(document_encodings, document_encodings.T)
row_sum = torch.sqrt(torch.sum(torch.square(document_encodings), axis=1, keepdims=True))
denominator = torch.matmul(row_sum, row_sum.T)
return numerator / denominator # will be done elementwise`}
Instead of creating a numpy array, we conver the document encodings to a torch tensor. You can use the cuda option in the google colab runtime to follow along with this blog.
Note, device
is defined below.
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device) #cuda
# 298 ms
_ = cosine_similarity_matrix_with_torch_cuda(model, traindf['question1'].tolist()[:500])
The pytorch version, on a CUDA device, runs 30ms faster on a question set of 500 examples. This would take approximately 4 mins.
The matrix implementation does not scale either. We are presented with a different problem here:
_ = cosine_similarity_matrix_with_torch_cuda(model, traindf['question1'].tolist()[:50_000])
#RuntimeError: CUDA out of memory. Tried to allocate 9.31 GiB (GPU 0; 14.76 GiB total capacity; 12.90 GiB already allocated; 385.75 MiB free; 13.15 GiB reserved in total by PyTorch)
#If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
#See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
With a question set of 50k rows, we get an out of memory error.