Semantic Search Using Embeddings

This is a web app I created with Django that searches Wikipedia pages for Academy Award-winning films using both keyword search and semantic search. I will talk mainly on the Semantic Search component here but more information about both my keyword search and semantic search implementations can be found on my github repository.

Semantic Search uses a LLM to generate sentence embeddings and perform a cosine similarity with the query embedding. I used the Sentence Tranformers library along with the multi-qa-mpnet-base-dot-v1 model from sbert. The model is 420mb and it takes roughly 6 minutes to generate the embeddings for ~1300 articles. This is without parallelizing the operations which can be done for larger search spaces. Since BERT has a context size of 512 tokens (where a token is roughly equal to 1 word), I chunked the text of each article into 512-word chunks with a 50-word overlap. The word overlap helps maintain context between chunks. I then performed mean pooling, using PyTorch, on the sentence embeddings.

Semantic Search offers a lot of improvements over Keyword Search, such as being able to search for meaning rather than just for keywords. Because of this, Semantic Search more often than not provides more relevant results. However, Semantic Search can perform poorly when tasked with searching for uncommon words or names. It can also perform worse with shorter queries. One solution to this is Hybrid Search. Hybrid Search (also called "Keyword-Aware Semantic Search") combines the scores of both methods to produce a hybrid score. Hybrid Search has been shown in research to perform better than what Keyword and Semantic Search can do alone. I have not implemented Hybrid Search in this current version but will likely implement it in the future. I will update this page (and my github) when I do so.

Links