I am the Senior Director of Computer Vision at the Allen Institute for AI (AI2) in sunny Seattle.
I am also an Affiliate Associate Professor at the Computer Science & Engineering department at the University of Washington.
I enjoy building and deploying AI applications for users to interact with. My work over two decades spans computer vision, robotics and natural language processing.
At AI2, my teams spend a considerable amount of time and effort to make our work publicly available and easily accessible, so that others may analyze our work, replicate it and build upon it. This includes open sourcing and maintaining software, building live demos as well as collecting and releasing benchmarks to evaluate AI systems.
I spent five years at Microsoft Bing, building very large scale and efficient machine learning systems in the Image & Video Search Relevance team and a very enjoyable stint building price prediction models for Bing Travel.
I got my Ph.D. at the University of Maryand, College Park under the supervision of Prof. Larry S. Davis.
I grew up in the city of Pune in the Western part of India -- tinkering with robots, watching cricket and rowing at my college boat club.
Program Chair (PC) : ICCV 2025 (upcoming)
Molmo is a family of open state-of-the-art multimodal AI models. Our most powerful model closes the gap between open and proprietary systems across a wide range of academic benchmarks as well as human evaluations. Our smaller models outperform models 10x their size.
While current multimodal models interpret multimodal data and express it in natural language, their full potential remains untapped. Molmo goes beyond. By learning to point at what it perceives, Molmo enables rich interactions with physical and virtual worlds, empowering the next generation of applications capable of acting and interacting with their environments.
Unified-IO is the first AI model to perform a large and diverse set of AI tasks spanning classical computer vision, image synthesis, vision-and-language, and natural language processing (NLP). It achieved this broad unification by homogenizing every task's input and output into sequences of tokens using universal compressors.
Unified-IO 2 scales this to support more modalities, more tasks and larger models. Our 7B parameter model is trained from scratch on 1B image-text pairs, 1T text tokens, 180M video clips, 130M interleaved image & text, 3M 3D assets, and 1M robot trajectories. It achieves state-of-the-art performance on the GRIT benchmark and strong results across more than 30 benchmarks in computer vision.
ProcTHOR is a framework for procedural generation of Embodied AI environments. ProcTHOR enables us to sample arbitrarily large datasets of diverse, interactive, customizable, and performant virtual environments to train and evaluate embodied agents.
Models trained using only RGB images on ProcTHOR, with no explicit mapping and no human task supervision produce state-of-the-art results across 6 embodied AI benchmarks for navigation, rearrangement, and arm manipulation.
Visual programming is a neuro-symbolic approach to solving complex and compositional visual tasks given natural language instructions. It avoids the need for any task-specific training. Instead, it uses the in-context learning ability of large language models to generate python-like modular programs, which are then executed to get both the solution and a comprehensive and interpretable rationale. Each line of the generated program may invoke one of several off-the-shelf computer vision models, image processing routines, or python functions to produce intermediate outputs that may be consumed by subsequent parts of the program.
Objaverse and Objaverse-XL are the largest public resources for 3D objects with 1M and 10M high quality assets. They comprise objects from a diverse set of sources, including manually designed objects, photogrammetry scans of landmarks and everyday items, and professional scans of historic and antique artifacts. In less than a year, these datasets have become the de-facto resources for training foundation models for 3D computer vision.
We demonstrate the power of the Objaverse asset librraies by training Zero123 on novel view synthesis, utilizing over 100 million multi-view rendered images and achieving strong zero-shot generalization abilities.
Satlas is a platform for visualizing and downloading global geospatial data products generated by our AI models using satellite images. Currently, it includes marine infrastructure (offshore wind turbines and platforms), renewable energy infrastructure (onshore wind turbines and solar farms), and tree cover.
Satlas also contains high resolution imagery on a global scale generated by our super resolution AI models which input freely available low resolution imagery from the Sentinel-2 satellites and produce high fidelity imagery for the entire planet.
AI2-THOR is a simuated environment consisting of near photo-realistic 3D indoor scenes, where AI agents can navigate and interact with objects to perform tasks. It is extensively used in the community to train robot policies using reinforcement learning and imitation learning for tasks such as visual navigation, object manipulation, instruction following, and more.
SPOC is an embodied navigation and manipulation agent trained by imitating shortest-path experts in simulation. SPOC uses no human demonstrations, no reinforcement learning, no depth sensors and makes no assumptions about the target environment.
A key factor that enables this surprising result is the scale and diversity of our training data -- made possible by our recent works to procedurally generate simulations via ProcTHOR and HoloDeck and massively scaling up 3D assets via our openly available Objaverse resource.
RoboTHOR is a framework to study simulation-to-real transfer for robotics. It consists of simulated environments paired with physical counterparts in the real world.
The physical environments are built using modular and movable components, allowing us to host scenes with vastly different layouts within a single physical space.
BiDaF was an extremely popular and state-of-the-art neural model for machine comprehension before the emergence of the BERT architetcure.
It is a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization.
Molmo: A family of open state-of-the-art multimodal AI models | |
Learning Generalizable Visual Representations via Interactive Gameplay | |
X-LXMERT: Teaching vision-and-language transformer models to paint | |
AllenAct: An open source framework for research in Embodied AI | |
AI & Creativity | |
RoboTHOR: An Open Simulation-to-Real Embodied AI Platform | |
Iconary: An AI powered drawing and guessing game | |
Craft: Scripts to Compositions to Videos |