More than 14000 questions
The OKVQA dataset is composed of questions that require outside knowledge to be answered.
25184 densely annotated videos
The Flintstones dataset is composed of brief, densely annotated clips that describe the actions, characters, objects, and setting of a scene.
8,290 images, 154,420 pairs, 76,642 annotations
Three part-annotated datasets of images and pairs for a part labeling task.
Images from 11 indoor scenes from AI2-Thor, ~60 objects per scene
A dataset of synthetic occluded objects. This is a synthetic dataset with photo-realistic images and natural configuration of objects in scenes.
380 video clips (24,500 frames) with corresponding joint information
A dataset of ego-centric dog video and joint movements.
7,860 videos, 68,536 temporal annotations, 157 action classes
A dataset of daily indoors activities filmed from third and first person with temporal annotations for various action classes.
75,000 questions, each paired with a unique scene configuration
IQUAD V1 pairs unique scene configurations in the AI2-THOR environment with questions corresponding to those environments.
More than 5,000 images of 10,000 liquid containers in context
COQE contains images of liquid containers labelled with volume, amount of content, bounding box annotation, and 3D CAD models.
1 thousand textbook lessons, 26k questions, 6k images
The TextbookQuestionAnswering (TQA) dataset is drawn from middle school science curricula.
60k figures extracted from 20k papers
Figures from 20k papers annotated as scatterplot, flowchart, etc. Over 600 figures were given futher detailed annotations (i.e., axes, legends, plot data,...
AI2D is a dataset of illustrative diagrams for research on diagram understanding and associated question answering.
126k images, 1.5 million annotations
imSitu is a dataset supporting situation recognition, the problem of producing a concise summary of the situation an image depicts.
Augmented NYUv2 dataset including task-based annotations.
This dataset guides our research into unstructured video activity recognition and commonsense reasoning for daily human activities.
Interactive dataset of 10,335 scenes
Scenes from the SUN RGB-D dataset rendered synethically with the Blender physics engine allowing for force interaction.
This is the dataset associated with the Newtonian Image Understanding research project.