The training dataset has 64,392 samples, and the VisMin dataset has 2,084 samples. The dataset is stored in a JSON format. Each entry contains the image path, caption, and a list of negative examples.