CommonPool

CommonPool is a dataset with 12.8 billion image-text pairs collected from Common Crawl, and is part of DataComp, a benchmark for designing multimodal datasets. See http://datacomp.ai/ and https://arxiv.org/abs/2304.14108 for details.

Along with the largest pool with 12.8B samples, CommonPool also comes in three smaller versions, containing 12.8M, 128M, and 1.28B samples.

CommonPool can be downloaded using img2dataset by following the instructions on https://github.com/mlfoundations/datacomp/blob/main/download_upstream.py

Provide feedback