Vector Embeddings Dataset

Last updated on Feb 11, 2025

Vector Embeddings Dataset

Topics: Vector Embeddings LLMs Transformers
Skills: software development, apis, scripting, python
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Jayjeet Chakraborty

To benchmark vector search algorithms (aka ANN algorithms), there are several datasets available but none of them represent actual real world workloads. This is because they usually have small vectors of only a few hundred dimensions. For vector search experiments to represent real world workloads, we want to have datasets with several thousand dimensions like what is generated by OpenAIs text-embedding models. This project aims to create a dataset with 1B embeddings from a wikipedia dataset using open source models. Ideally, we will have 3 versions of this dataset, with 1024, 4096, and 8192 sized embeddings to start with.