Given a scene description, Composition, Retrieval and Fusion Network (CRAFT) predicts a layout of mentioned entities, retrieves spatio-temporal entity segments from a video database and fuses them to generate scene videos. Results are shown on a new FLINTSTONES dataset.
This project explores directly modeling a visually intelligent agent. Our model inputs visual information and predicts the agent's actions. We demonstrate our model on DECADE, a large-scale dataset of ego-centric videos from a dog's perspective.
This project explores Interactive Question Answering (IQA), the task of answering questions such as 'Are there any apples in the fridge ?' that require an autonomous agent to interact with a dynamic visual environment.
This project addresses the problem of understanding diagrams in the absence of large labelled datasets, by transferring labels from smaller labeled datasets of diagrams (within-domain) as well as from labeled datasets of natural images (cross-domain).
MotifNet is a model for creating graph representations images, scene graphs. In scene graphs, every object is a node and relationships between objects are edges. MotifNet captures high-order, global repeating structures in such graphs, motifs.
SeGAN is a model for generating the occluded regions of objects, where a GAN first generates the shape and then the appearance of the invisible regions.
Charades-Ego is a dataset that guides research into unstructured video activity recogntion and commonsense reasoning for daily human activities with paired videos of first person and third person perspective.
AI2-THOR is a photorealistic interactive virtual environment that enables agents to peform tasks and observe the outcome of their actions. It includes kitchen, living room, bedroom, and bathroom scenes.
A combination of Deep Reinforcement Learning and Imitation Learning to plan a sequence of actions for performing tasks.
Understanding the dynamics of liquids and their containers. The aim is to estimate the volume of containers, infer the amount of liquid, and predict pouring behavior.
The LCNN improves the efficiency of the convolutional neural networks by learning a dictionary and a small set linear weights of them and using a look up for inference.
You only look once (YOLO) is a state-of-the-art, real-time object detection system.
imSitu is a dataset supporting Situation Recognition, the problem of producing a concise summary of the situation an image depicts. Events (such as feeding) are described by verbs, objects and semantic roles defining how the objects are particpating.
TQA is a machine comprehension dataset that pairs middle school science questions with the supporing textbook material (including diagrams) needed to answer them.
Deep Reinforcement Learning for the task of visual navigation. The goal is developing a more efficient and generalizable Reinforcement Learning approach.
Bi-directional Attention Flow (BiDAF) network is a multi-stage hierarchical process that represents context at different levels of granularity and uses a bi-directional attention flow mechanism to achieve a query-aware context representation without early summarization.
The G-CNN improves the precision of bounding box by regressing the location and scale of the bounding box iteratively.
This project aims at improving the efficiency of the convolutional neural networks by using the binary precision operations.
Charades is a dataset that guides research into unstructured video activity recogntion and commonsense reasoning for daily human activities.
Predicting movements of objects via estimation of scene geometry and its underlying physics.
FigureSeer is an end-to-end framework for parsing result-figures, that enables powerful search and retrieval of results in research papers.
This project aims to parse diagrams and answer the corresponding questions.
A task-oriented recognition approach, where the idea is to find the suitable set of features for each high-level task such that the computation cost is minimized.
Understanding the physics of a scene and the dynamics of objects in images.
Segment-phrase table is a large collection of bijective associations between textual phrases and their corresponding segmentations.
VisKE is a VISual Knowledge Extraction and question answering system built using the idea of scalable visual verification of relation phrases.
LEVAN is a fully-automated visual concept learning program that automatically learns everything there is to know about any visual concept.