Vector Embeddings Dataset
Vector Embeddings Dataset
- Topics:
Vector Embeddings
LLMs
Transformers
- Skills: software development, apis, scripting, python
- Difficulty: Moderate
- Size: Medium or Large (175 or 350 hours)
- Mentors: Jayjeet Chakraborty
To benchmark vector search algorithms (aka ANN algorithms), there are several datasets available but none of them represent actual real world workloads. This is because they usually have small vectors of only a few hundred dimensions. For vector search experiments to represent real world workloads, we want to have datasets with several thousand dimensions like what is generated by OpenAIs text-embedding models. This project aims to create a dataset with 1B embeddings from a wikipedia dataset using open source models. Ideally, we will have 3 versions of this dataset, with 1024, 4096, and 8192 sized embeddings to start with.