Following the remarkable progress in ML and AI generally over the past several years, there is a growing sentiment that future progress will require moving beyond our reliance on static datasets to train AI systems. The domain of Embodied AI is centered around the belief that intelligent agents learn best through exploration and interaction with their environment (be it real or simulated). PRIOR feels that this promising area of research can benefit greatly from interdisciplinary collaboration at this stage of development. To foster this collaboration, our lecture series aims to bring together researchers from a variety of fields that touch on Embodied AI—computer vision, robotics, NLP, ML, neuroscience and cognitive psychology amongst others.
These lectures will be interactive and accessible to all, and we encourage the audience to ask questions and participate in discussions. Recordings will be available after the conclusion of individual lectures in the series (conditioned on consent of the speaker).
Please subscribe to our mailing list to receive invitations to this lecture series containing links to attend live events.
We encourage participation in live discussions that remains considerate of the speaker and other participants. Researchers interested in giving a lecture, or anyone with suggestions for future topics or speakers, should get in touch with us here.
How can we train a robot that can generalize to perform thousands of tasks in thousands of environments? This question underscores the holy grail of robot learning, more generally machine learning, research. Current AI systems are incredibly specific in that they only perform the tasks they are trai... ned for and are miserable at generalization. One fundamental reason is that these systems are trained once and kept fixed at inference, thus making it difficult to generalize to scenarios far from the ones seen in training data. In contrast, generalization in humans stems from our ability to explore and adapt continually throughout our lifetime. In this talk, I will present our early efforts that draw ideas from human learning to build a continually adaptive robotic framework. I will focus on three key questions: (1) how to continually generate supervision for oneself (curiosity); (2) how to bootstrap the learning by observing other humans (social learning); and (3) how to adapt already learned skills in real-time (adaptation). I will demonstrate the potential of this framework for scaling up robot learning via case studies of robots discovering new tasks in complex kitchen environments, controlling dextrous robotic hands from monocular vision, robotics walking on unseen terrains, and performing lots of diverse manipulation tasks in the wild.
Despite numerous successes in deep robotic learning over the past decade, the generalization and versatility of robots across environments and tasks has remained a major challenge. This is because much of reinforcement and imitation learning research trains agents from scratch in a single or a few e... nvironments, training special-purpose policies from special-purpose datasets. In contrast, the rest of machine learning has drawn considerable success from repeatedly reusing broad datasets and recycling pre-trained models for a variety of purposes. Replicating this success in robotics is no easy feat, since robot data doesn’t simply exist in vast quantities on the internet. In this talk, I will discuss how our embodied learning algorithms need to reduce, reuse, and recycle — reducing the need for special-purpose online data collection, reusing existing data, and recycling pre-trained models with various downstream tasks. Towards this goal, I will present research that studies zero-shot robot generalization to new tasks and language commands, using a diverse dataset containing 100 distinct tasks. I will also discuss how we might develop recyclable pre-trained models for robot learning using large-scale datasets, including language-annotated videos of humans. In all cases, the evaluation will emphasize generalization, including to new objects, new scenes, and new tasks. I'll conclude by discussing some important open questions and future directions.
Recent advances in vision and language modeling have been powered by truly massive datasets, often mined from the web. Instruction-following robots will also require large amounts of data to train and evaluate. However, images/videos/documents found on the web do not satisfy the needs of embodied ag... ents, and data annotation is slow and expensive. Synthetic data is a promising alternative. In this talk I’ll discuss two specific approaches. First, I’ll introduce our work generating synthetic navigation instructions at near human quality, which can be used to train instruction-following navigation agents at large-scale. Second, I’ll show how novel view synthesis can be used to create new trajectories for visual navigation agents, without requiring a simulation environment. Together, these approaches promise to greatly expand the available high-quality training data for language-guided agents.
Soft robotics research has made considerable progress in many areas of robotics technologies based on deformable functional materials, including locomotion, manipulation, and other morphological adaptation such as self-healing, self-morph, and mechanical growth. While these technologies open up many... new robotics applications, but the new challenges emerge in terms of sensing, modelling, planning and controlling. Because of the general complexity of the system based on flexible and continuum mechanics, and a large diversity of system-environment interactions, the conventional methods are often not applicable, and the new approaches are necessary based on the state-of-the-art machine learning techniques. In this talk, I will introduce some of the research projects in our laboratory that make use of soft robotics and machine learning techniques, for addressing the complexity challenges of robotics, ultimately leading to scalable embodied intelligence.
Despite 50 years of research, robots remain remarkably clumsy, limiting their reliability for warehouse order fulfillment, robot-assisted surgery, and home decluttering. The First Wave of grasping research is purely analytical, applying variations of screw theory to exact knowledge of pose, shape,... and contact mechanics. The Second Wave is purely empirical: end-to-end hyperparametric function approximation (aka Deep Learning) based on human demonstrations or time-consuming self-exploration. A New Wave of research considers hybrid methods that combine analytic models with stochastic sampling and Deep Learning models. I'll present this history with new results from our lab on grasping diverse and previously-unknown objects.
Reliable operation in everyday human environments – homes, offices, and businesses – remains elusive for today’s robotic systems. A key challenge is diversity, as no two homes or businesses are exactly alike. However, despite the innumerable unique aspects of any home, there are many commonalities... as well, particularly about how objects are placed and used. These commonalities can be captured in semantic representations, and then used to improve the autonomy of robotic systems by, for example, enabling robots to infer missing information in human instructions, efficiently search for objects, or manipulate objects more effectively. In this talk, I will discuss recent advances in semantic reasoning, particularly focusing on semantics of everyday objects, household environments, and the development of robotic systems that intelligently interact with their world.
In this talk I will cover our recent publication "Open-Ended Learning Leads to Generally Capable Agents" (https://deepmind.com/blog/article/generally-capable-agents-emerge-from-open-ended-play). In this work we turn our attention to how to create embodied agents in simulation that can generalise to... unseen test tasks and exhibit generally capable behaviour. I will introduce our XLand procedurally generated environment, and the open-ended learning algorithms that allow us to train agents to cover this vast environment space. This results in agents that are capable across a wide range of held-out test tasks including hide-and-seek and capture-the-flag, and we will explore these results and the emergent behaviours and representations of the agent.
Computing with a mess: how nonstationary, heterogeneous and noisy components help the brain’s computational power While artificial neural networks have taken inspiration from biological ones, one salient difference exists at the level of components. Biological neurons and synapses have heterogeneous... transfer functions, which are non-stationary in time and highly stochastic, however artificial networks are generally built with homogeneous, stationary and deterministic neurons and synapses. It seems difficult to imagine how evolution built a computational machine with such messy components. In this talk I will show that each of these properties can be used to benefit the computations. The non-stationarity of transfer functions can be used as a form of long-short term memory. Surprisingly, for predicting the future state of a high-dimensional but dynamically simple system, a task which is often encountered in an environment, training models with a biologically-inspired nonstationarity outperform parameter-matched RNN and LSTM networks. The heterogeneity can allow a network to better approximate a function with fewer units. However, the most intriguing observation regards the biological noise. While each individual neuron in the brain is highly variable, when observed at a population level, the noise spans a low dimensional manifold. Based on electrophysiological recordings in the mouse visual cortex, I will show that this manifold is aligned in the directions of smooth transforms in the environment, directions which are useful to build an invariance over. I will show that after such an invariance is learned, the noise will help 1-shot learning of new classes. Finally, I will show that such an invariance-aligned noise can be generated in artificial neural networks. Taken together these results paint a picture in which the diverse, constantly changing and often stochastic characteristics of biological neurons, when properly combined, can help networks perform etologically relevant computations.
How do infants represent and reason about objects in simple physical events, such as occlusion, containment, and support events? In this talk, I will first summarize three sets of puzzling and even seemingly contradictory findings from the past few decades of research on this question. First, young... infants often succeed at tasks that require reasoning about broad, categorical information about an object, but fail at tasks that require reasoning about more fine-grained, featural information about the object. Second, when infants begin to succeed at tasks that require reasoning about particular featural information, they often do so with events from one category but still fail with events from another category. Finally, when infants begin to succeed at tasks that require reasoning about particular featural information in an event category, they often succeed at tasks that require detecting change or interaction violations, but still fail at detecting individuation violations. I will then present the two-system model of infant physical reasoning, in which two cognitive systems, the object-file system and the physical reasoning system, each with a different role, work together to guide infants’ response throughout the course of an event. I will show that this model helps integrate and explain the findings described above. I will end by briefly discussing novel predictions that the model makes and studies that tested these predictions.
Reinforcement learning has greatly accelerated our ability to train control policies for robots. But, what if the robot has no observable control policy? --and is a millimeter in diameter? --and is composed solely from biological cells? The emerging field of computer designed organisms challenges ou... r deepest preconceptions about how to apply AI methods to embodied machines, while simultaneously offering new materials and methods for building and reasoning about the nature of planning, control, decision making, agency, and general intelligence. In this talk I will describe how our team combined evolutionary algorithms with physical simulation to “program” behavior into biobots in silico, instantiate some of the most promising designs as physical biobots, and feed back lessons learned to improve subsequent sim2real transfers. I will conclude by discussing some of the implications of this work for biologists, the artificial intelligence community, and cognitive scientists.
Despite recent progress in the capabilities of autonomous robots, especially learned robot skills, there remain significant challenges in building robust, scalable, and general-purpose systems for service robots. This talk will present our recent work to answer the following question: how can symbol... ic planning and reinforcement learning be combined to create general-purpose service robots that reason about high-level actions and adapt to the real world? The problem will be approached from two directions. First, I will introduce planning algorithms that adapt to the environment by learning and exchanging knowledge with other agents. These methods allow robots to plan in open-world scenarios, to plan around other robots while avoiding conflicts and realizing synergies, and to learn action costs throughout executions in the real world. Second, I will present reinforcement learning (RL) methods that leverage reasoning and planning, in order to address the challenges of maximizing the long-term average reward in continuing service robot tasks.
Current robots are either expensive or make significant compromises on sensory richness, computational power, and communication capabilities. We propose to leverage smartphones to equip robots with extensive sensor suites, powerful computational abilities, state-of-the-art communication channels, an... d access to a thriving software ecosystem. We design a small electric vehicle that costs $50 and serves as a robot body for standard Android smartphones. We develop a software stack that allows smartphones to use this body for mobile operation and demonstrate that the system is sufficiently powerful to support advanced robotics workloads such as person following and real-time autonomous navigation in unstructured environments. Controlled experiments demonstrate that the presented approach is robust across different smartphones and robot bodies.
Embodied Cognition posits that the body of an agent is not only a vessel to contain the mind, but meaningfully influences the agent's brain and contributes to its intelligent behavior through morphological computation. In this talk, I'll introduce a system for studying the role of complex brains and... bodies in soft robotics, demonstrate how this system may exhibit morphological computation, and describe a particular challenge that occurs when attempting to employ machine learning to optimize embodied machines and their behavior. I'll argue that simply considering and accounting for the co-dependencies suggested by embodied cognition can help us to overcome this challenge, and suggest that this approach may be helpful to the optimization of structure and function in machine learning domains outside of soft robotics.
Spatial perception —the robot’s ability to sense and understand the surrounding environment— is a key enabler for autonomous systems operating in complex environments, including self-driving cars and unmanned aerial vehicles. Recent advances in perception algorithms and systems have enabled robots t... o detect objects and create large-scale maps of an unknown environment, which are crucial capabilities for navigation, manipulation, and human-robot interaction. Despite these advances, researchers and practitioners are well aware of the brittleness of existing perception systems, and a large gap still separates robot and human perception. This talk discusses two efforts targeted at bridging this gap. The first effort targets high-level understanding. While humans are able to quickly grasp both geometric, semantic, and physical aspects of a scene, high-level scene understanding remains a challenge for robotics. I present our work on real-time metric-semantic understanding and 3D Dynamic Scene Graphs. I introduce the first generation of Spatial Perception Engines, that extend the traditional notions of mapping and SLAM, and allow a robot to build a “mental model” of the environment, including spatial concepts (e.g., humans, objects, rooms, buildings) and their relations at multiple levels of abstraction. The second effort focuses on robustness. I present recent advances in the design of certifiable perception algorithms that are robust to extreme amounts of noise and outliers and afford performance guarantees. I present fast certifiable algorithms for object pose estimation: our algorithms are “hard to break” (e.g., are robust to 99% outliers) and succeed in localizing objects where an average human would fail. Moreover, they come with a “contract” that guarantees their input-output performance. Certifiable algorithms and real-time high-level understanding are key enablers for the next generation of autonomous systems, that are trustworthy, understand and execute high-level human instructions, and operate in large dynamic environments and over and extended period of time.
Recent advancements in embodied intelligence have shown exciting results in adapting to diverse and complex external environments. However, much work remains incognizant of the agents' internal hardware (i.e., the embodiment), which often plays a critical role in determining the system's overall fun... ctionality and performance. In this talk, we revisit the role of “embodiment” in embodied intelligence, specifically, in the context of robotic manipulation. The key idea behind ''self-adaptive manipulation'' is to treat a robot's hardware as an integral part of its behavior --- the learned manipulation policies should be conditioned on their hardware and also inform how hardware should be improved. I will use two of our recent works to illustrate both aspects: AdaGrasp for learning a unified policy for using different and novel gripper hardware, and Fit2Form for generating a new gripper hardware design that optimizes for the target task.
The embodiment hypothesis is the idea that “intelligence emerges in the interaction of an agent with an environment and as a result of sensorimotor activity”. Imagine walking up to a home robot and asking “Hey robot – can you go check if my laptop is on my desk? And if so, bring it to me”. Or asking... an egocentric AI assistant (operating on your smart glasses): “Hey – where did I last see my keys?”. In order to be successful, such an embodied agent would need a range of skills – visual perception (to recognize & map scenes and objects), language understanding (to translate questions and instructions into actions), and action (to move and find things in a changing environment). I will first give an overview of work happening at Georgia Tech and FAIR building up to this grand goal of embodied AI. Next, I will dive into a recent project where we asked if machines – specifically, navigation agents – build cognitive maps. Specifically, we train 'blind’ AI agents – with sensing limited to only egomotion – to perform PointGoal navigation (‘go to delta-x, delta-y relative to start’) via reinforcement learning. We find that blind AI agents are surprisingly effective navigators in unseen environments (~95% success). Further still, we find that (1) these blind AI agents utilize memory over long horizons (remembering ~1,000 steps of past experience in an episode); (2) this memory enables them to take shortcuts, i.e. efficiently travel through previously unexplored parts of the environment; (3) there is emergence of maps in this memory, i.e. a detailed occupancy grid of the environment can be decoded from the agent memory; and (4) the emergent maps are selective and task dependent – the agent forgets unnecessary excursions and only remembers the end points of such detours. Overall, our experiments and analysis show that blind AI agents take shortcuts and build cognitive maps purely from learning to navigate, suggesting that cognitive maps may be a natural solution to the problem of navigation and shedding light on the internal workings of AI navigation agents.
Current evidence for the ability of some animals to plan—imagining some future set of possibilities and picking the one assessed to have the highest value—is restricted to birds and mammals. Nonetheless, all animals have had just as long to evolve what seems to be a useful capacity. In this talk, I... review some work we have done to get at the question of why planning may be useless to many animals, but useful to a select few. We use a variety of algorithms for this work, from reinforcement learning-based methods to POMDPs, and now are testing predictions using live mammals in complex reprogrammable habitats with a robot predator.