WildQA: In-the-Wild Video Question Answering


Existing video understanding datasets mostly focus on human interactions, with little attention being paid to the "in the wild" settings, where the videos are recorded outdoors. We propose WildQA, a video understanding dataset of videos recorded in outside settings. In addition to video question answering (Video QA), we also introduce the new task of identifying visual support for a given question and answer (Video Evidence Selection). Through evaluations using a wide range of baseline models, we show that WildQA poses new challenges to the vision and language research communities.

Example from our Dataset

Green from the first annotation stage, Brown from the second annotation stage

Q1: What kinds of bodies of water are there?

  • A1: There are rivers and streams.
  • A2: There is a long stream or river between the valley.
  • A3: The kinds of bodies of water there are streams.
  • A4: There are bodies of water in this location.

Q2: Where are the rivers located?

  • A1: In valleys.
  • A2: The rivers are located between what seems to be two different valleys based on the man stating that it was a valley. The first river you can see the higher elevation of ground where you see the bases of the trees look slanted. The second river looks further from the valley as you can see the valley looks distant from where the men are. Lastly, the second river seemed to be more plateaued than the first.
  • A3: The rivers are located in the mountains near the bases.

Q3: What type of environment is it?

  • A1: Mountainous, temperate forest.
  • A2: It is a varied environment with rocky areas, greenery, and a river.
  • A3: It's environment type is an intermountain forest.

