The Dataset

We will use quora's qna dataset for list of question pairs. The dataset can be found here.

Unzip the train file, load the dataset with pandas. We will look mostly at the question1 column.

Review of Previous Blog

In the previous blog, we implemented three different ways of comparing vectors. Code is listed below of the three methods each faster than the last.

import numpy as np 
from typing import List 
def get_cosine_similarity(arr1, arr2):
    numerator = np.dot(arr1, arr2)
    mag1 = np.sqrt(np.sum(np.square(arr1)))
    mag2 = np.sqrt(np.sum(np.square(arr2)))
    return numerator / (mag1*mag2)
 
def get_document_similarities(model, documents: List[str]):
    similarities = [[0 for _ in range(len(documents))] 
                    for _ in range((len(documents)))]
    for i, doc1 in enumerate(documents):
        doc1_enc = model.encode(doc1)
        for j, doc2 in enumerate(documents):
            doc2_enc = model.encode(doc2)
            sim = get_cosine_similarity(doc1_enc, doc2_enc)
            similarities[i][j] = sim
    return similarities

def get_document_similarities_faster(model, documents: List[str]):
    similarities = [[0 for _ in range(len(documents))] 
                    for _ in range((len(documents)))]
	for i, doc1 in enumerate(documents):
		doc1_enc = model.encode(doc1)
		for j, doc2 in enumerate(documents):
			if i == j:
				similarities[i][j] = 1
			elif i>j:
				doc2_enc = model.encode(doc2)
				sim = get_cosine_similarity(doc1_enc, doc2_enc)
				similarities[i][j] = sim
				similarities[j][i] = sim
			elif i<j:
				continue
	return similarities
def cosine_similarity_matrix(model, documents: List[str]):
	document_encodings = model.encode(documents)
	numerator = np.matmul(document_encodings, document_encodings.T)
	row_sum = np.sqrt(np.sum(np.square(document_encodings), axis=1, keepdims=True))
	denominator = np.matmul(row_sum, row_sum.T)
	return numerator / denominator # will be done elementwise 

We will use the quora dataset to compare the timings to solidify the theory behind our understanding of the time complexities of each algorithm.

Experiments

Experiment #1

Below are the times of each algorithm with only 10 questions. (This code was run on a google colab notebook.)

# 888ms 
_ = get_document_similarities(model, traindf['question1'][:10].tolist())
# 414 ms
_ = get_document_similarities_faster(model, traindf['question1'][:10].tolist())
# 16.3 ms 
_ = cosine_similarity_matrix(model, traindf['question1'][:10].tolist())

As you can see, the matrix algorithm is the fastest with 16.3 ms with 10 examples.

The question1 column has 404,290 rows of data to compare. It would take ((404_290 / 10) * 16.3) / 1000 / 60 = 10 mins to compare the entire dataset.

Does this text scales?

Experiment #2

# prob. in the tens of hours
 _ = get_document_similarities(model, traindf['question1'][:500].tolist())
 # prob. in the hours
 _ = get_document_similarities(model, traindf['question1'][:500].tolist())
 # 337 ms
 _ = cosine_similarity_matrix(model, traindf['question1'][:500].tolist())

It would take 4.5 mins for the matrix implementation to finish given that we have sufficient memory.

Pytorch Matrix Multiplication

Since we are doing a lot of matrix multiplication we can leverage the power of a GPU to make these calculations run in parallel and make the algorithm faster. The Numpy library does not run on the GPU. It runs on the CPU. We must port our algorithm to a library that runs code on the GPU, i.e. pytorch.

Fortunately, pytorch is designed to replicate many of the SDK calls numpy offers. We merely need to replace the numpy calls with pytorch calls. The code below does that.

# replace np with torch 
def cosine_similarity_matrix_with_torch_cuda(model, documents: List[str]):
  document_encodings = model.encode(documents)
  document_encodings = torch.tensor(document_encodings).to(device)
  numerator = torch.matmul(document_encodings, document_encodings.T)
  row_sum = torch.sqrt(torch.sum(torch.square(document_encodings), axis=1, keepdims=True))
  denominator = torch.matmul(row_sum, row_sum.T)
  return numerator / denominator # will be done elementwise

Instead of creating a numpy array, we conver the document encodings to a torch tensor. You can use the cuda option in the google colab runtime to follow along with this blog.

Note, device is defined below.

import torch 
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 
print(device) #cuda

Experiment #3

# 298 ms 
_ = cosine_similarity_matrix_with_torch_cuda(model, traindf['question1'].tolist()[:500])

The pytorch version, on a CUDA device, runs 30ms faster on a question set of 500 examples. This would take approximately 4 mins.

The matrix implementation does not scale either. We are presented with a different problem here:

_ = cosine_similarity_matrix_with_torch_cuda(model, traindf['question1'].tolist()[:50_000])
#RuntimeError: CUDA out of memory. Tried to allocate 9.31 GiB (GPU 0; 14.76 GiB total capacity; 12.90 GiB already allocated; 385.75 MiB free; 13.15 GiB reserved in total by PyTorch) 
#If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  
#See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

With a question set of 50k rows, we get an out of memory error.