Toloka Visual Question Answering Challenge at WSDM Cup 2023
27 7 5 2

Toloka Visual Question Answering Challenge at WSDM Cup 2023

We challenge you with a visual question answering task! Given an image and a textual question, draw the bounding box around the object correctly responding to that question.

Question Image and Answer
What do you use to hit the ball?
What do people use for cutting?
What do we use to support the immune system and get vitamin C?



Please cite the challenge results or dataset description as follows.

  author    = {Ustalov, Dmitry and Pavlichenko, Nikita and Koshelev, Sergey and Likhobaba, Daniil and Smirnova, Alisa},
  title     = {{Toloka Visual Question Answering Benchmark}},
  year      = {2023},
  eprint    = {2309.16511},
  eprinttype = {arxiv},
  eprintclass = {cs.CV},
  language  = {english},


Our dataset consists of the images associated with textual questions. One entry (instance) in our dataset is a question-image pair labeled with the ground truth coordinates of a bounding box containing the visual answer to the given question. The images were obtained from a CC BY-licensed subset of the Microsoft Common Objects in Context dataset, MS COCO. All data labeling was performed on the Toloka crowdsourcing platform, We release the entire dataset under the CC BY license:

Licensed under the Creative Commons Attribution 4.0 License. See LICENSE-CC-BY.txt file for more details.

Zero-Shot Baselines

We provide zero-shot baselines in zeroshot_baselines folder. All notebooks are made to run in Colab


This baseline was provided to participants of WSDM Cup 2023 Challenge. First, it uses a detection model, YOLOR, to generate candidate rectangles. Then, it applies CLIP to measure the similarity between the question and a part of the image bounded by each candidate rectangle. To make a prediction, it uses the candidate with the highest similarity. This baseline method achieves IoU = 0.21 on private test subset.

Licensed under the Apache License, Version 2.0. See LICENSE-APACHE.txt file for more details.


Another zero-shot baseline, called OVSeg, utilizes SAM as a proposal generator instead of MaskFormer in the original setup. This approach achieves IoU = 0.35 on the private test subset.

Licensed under the Creative Commons Attribution 4.0 License.


Last one is primarily based on OFA, combined with bounding box correction using SAM. To solve the task, we followed a two-step zero-shot setup.

First, we address the Visual Question Answering, where the model is given a prompt {question} Name an object in the picture along with an image. The model provides the name of a clue object to the question.

In the second step, an object corresponding to the answer from the previous step is annotated using the prompt which region does the text "{answer}" describe?, resulting in IoU = 0.42.

Subsequently, with the obtained bounding boxes, SAM generates the corresponding masks for the annotated object, which are then transformed into bounding boxes. This enabled us to achieve IoU = 0.45 with this baseline.

Licensed under the Apache License, Version 2.0.

Crowdsourcing Baseline

We evaluated how well non-expert human annotators can solve our task by running a dedicated round of crowdsourcing annotations on the Toloka crowdsourcing platform. We found them to tackle this task successfully without knowing the ground truth. On all three subsets of our data, the average IoU value was 0.87 ± 0.01, which we consider as a strong human baseline for our task. Krippendorff's α coefficients for the public test was 0.68 and for the private test was 0.66, showing the decent agreement between the responses; we used 1 − IoU as the distance metric when calculating the α coefficient. We selected the bounding boxes which were the most similar to the ground truth data to indicate the upper bound of non-expert annotation quality; *_crowd_baseline.csv files contain these responses.

Licensed under the Creative Commons Attribution 4.0 License. See LICENSE-CC-BY.txt file for more details.


The final score will be evaluated on the private test dataset during Reproduction phase. We kindly ask you to create a docker image and share it with us by December 19th 23:59 AoE in this form. We put an instruction how to create a docker image in reproduction directory.

We will run your solution on a machine with one Nvidia A100 80 GB GPU, 16 CPU cores, and 200 GB of RAM. Your Docker image must perform the inference in at most 3 hours on this machine. In other words, the docker run command must finish in 3 hours.

Don't hesitate to contact us at [email protected] if you have any questions or suggestions.