Web
                Analytics
Aniruddha (Ani) Kembhavi

I am the Senior Director of Computer Vision at the Allen Institute for AI (AI2) in sunny Seattle.
I am also an Affiliate Associate Professor at the Computer Science & Engineering department at the University of Washington.

I enjoy building and deploying AI applications for users to interact with. My work over two decades spans computer vision, robotics and natural language processing.

At AI2, my teams spend a considerable amount of time and effort to make our work publicly available and easily accessible, so that others may analyze our work, replicate it and build upon it. This includes open sourcing and maintaining software, building live demos as well as collecting and releasing benchmarks to evaluate AI systems.

I spent five years at Microsoft Bing, building very large scale and efficient machine learning systems in the Image & Video Search Relevance team and a very enjoyable stint building price prediction models for Bing Travel.

I got my Ph.D. at the University of Maryand, College Park under the supervision of Prof. Larry S. Davis.

I grew up in the city of Pune in the Western part of India -- tinkering with robots, watching cricket and rowing at my college boat club.


Email Google Scholar

Recent Awards
Best Paper Award for CVPR 2023 : Visual Programming: Compositional visual reasoning without training

Outstanding Paper Award for Neurips 2022 : ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

Outstanding Paper Award for CoRL 2024 : PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

Best Paper Award on Mobile Manipulation for IROS 2024 : Harmonic Mobile Manipulation

Best Paper Award for ICRA 2024 : Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Allen Institute Test Of Time Award 2020 : BiDAF: Bidirectional Attention Flow for Machine Comprehension

NVIDIA Pioneer Award 2018 : IQA: Visual Question Answering in Interactive Environments

Recent Talks
Jun 2024 : 5 invited talks and 2 panel discussions at CVPR 2024 workshops

Mar 2024 : Gerald M. Masson Distinguished Lecture at Johns Hopkins University

Feb 2024 : University of Washington Robotics Colloquium

Oct 2023 : Robotics and Computer Vision Panel, 50th Anniversary Celebration for CS @ UMD

Aug 2023 : AMLD Generative AI Workshop at EPFL, Lausanne Switzerland

Aug 2023 : Summer School on AI by IIIT Hyderabad

Aug 2023 : AI + Architecture at Indian Institute for Interior Design, Hubli, India

Jul 2023 : Talk at University of Maryland, College Park

Jul 2023 : Talk at Adobe Research

Jul 2023 : Talk at Amazon

Jun 2023 : Best Paper Award talk at CVPR 2023

Service
Program Chair (PC) : ICCV 2025 (upcoming)

Senior Area Chair (SAC) : CVPR 2024

Area Chair (AC) : Several past CVPR, ICLR, Neurips and EMNLP conferences

Selected Works
Here are some selected works from my teams at the Allen Institute for AI.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Molmo is a family of open state-of-the-art multimodal AI models. Our most powerful model closes the gap between open and proprietary systems across a wide range of academic benchmarks as well as human evaluations. Our smaller models outperform models 10x their size.

While current multimodal models interpret multimodal data and express it in natural language, their full potential remains untapped. Molmo goes beyond. By learning to point at what it perceives, Molmo enables rich interactions with physical and virtual worlds, empowering the next generation of applications capable of acting and interacting with their environments.

UNIFIED-IO: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Unified-IO is the first AI model to perform a large and diverse set of AI tasks spanning classical computer vision, image synthesis, vision-and-language, and natural language processing (NLP). It achieved this broad unification by homogenizing every task's input and output into sequences of tokens using universal compressors.

Unified-IO 2 scales this to support more modalities, more tasks and larger models. Our 7B parameter model is trained from scratch on 1B image-text pairs, 1T text tokens, 180M video clips, 130M interleaved image & text, 3M 3D assets, and 1M robot trajectories. It achieves state-of-the-art performance on the GRIT benchmark and strong results across more than 30 benchmarks in computer vision.

ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

ProcTHOR is a framework for procedural generation of Embodied AI environments. ProcTHOR enables us to sample arbitrarily large datasets of diverse, interactive, customizable, and performant virtual environments to train and evaluate embodied agents.

Models trained using only RGB images on ProcTHOR, with no explicit mapping and no human task supervision produce state-of-the-art results across 6 embodied AI benchmarks for navigation, rearrangement, and arm manipulation.

Visual Programming: Compositional Neuro-Symbolic Visual Reasoning

Visual programming is a neuro-symbolic approach to solving complex and compositional visual tasks given natural language instructions. It avoids the need for any task-specific training. Instead, it uses the in-context learning ability of large language models to generate python-like modular programs, which are then executed to get both the solution and a comprehensive and interpretable rationale. Each line of the generated program may invoke one of several off-the-shelf computer vision models, image processing routines, or python functions to produce intermediate outputs that may be consumed by subsequent parts of the program.

Objaverse and Objaverse-XL: Universes with Over 10M+ Annotated 3D Objects

Objaverse and Objaverse-XL are the largest public resources for 3D objects with 1M and 10M high quality assets. They comprise objects from a diverse set of sources, including manually designed objects, photogrammetry scans of landmarks and everyday items, and professional scans of historic and antique artifacts. In less than a year, these datasets have become the de-facto resources for training foundation models for 3D computer vision.

We demonstrate the power of the Objaverse asset librraies by training Zero123 on novel view synthesis, utilizing over 100 million multi-view rendered images and achieving strong zero-shot generalization abilities.

SATLAS: Open Geospatial Data Generated by AI

Satlas is a platform for visualizing and downloading global geospatial data products generated by our AI models using satellite images. Currently, it includes marine infrastructure (offshore wind turbines and platforms), renewable energy infrastructure (onshore wind turbines and solar farms), and tree cover.

Satlas also contains high resolution imagery on a global scale generated by our super resolution AI models which input freely available low resolution imagery from the Sentinel-2 satellites and produce high fidelity imagery for the entire planet.

AI2-THOR: An Interactive 3D Simulated Environment to Train Robots

AI2-THOR is a simuated environment consisting of near photo-realistic 3D indoor scenes, where AI agents can navigate and interact with objects to perform tasks. It is extensively used in the community to train robot policies using reinforcement learning and imitation learning for tasks such as visual navigation, object manipulation, instruction following, and more.

SPOC: Imitating Shortest Paths in Simulation For Real World Navigation and Manipulation

SPOC is an embodied navigation and manipulation agent trained by imitating shortest-path experts in simulation. SPOC uses no human demonstrations, no reinforcement learning, no depth sensors and makes no assumptions about the target environment.

A key factor that enables this surprising result is the scale and diversity of our training data -- made possible by our recent works to procedurally generate simulations via ProcTHOR and HoloDeck and massively scaling up 3D assets via our openly available Objaverse resource.

RoboTHOR: An Open Simulation-to-Real Embodied AI Platform

RoboTHOR is a framework to study simulation-to-real transfer for robotics. It consists of simulated environments paired with physical counterparts in the real world.

The physical environments are built using modular and movable components, allowing us to host scenes with vastly different layouts within a single physical space.

BiDaF: Bidirectional Attention Flow for Machine Comprehension

BiDaF was an extremely popular and state-of-the-art neural model for machine comprehension before the emergence of the BERT architetcure.

It is a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization.

Press
Molmo: A family of open state-of-the-art multimodal AI models

Learning Generalizable Visual Representations via Interactive Gameplay

X-LXMERT: Teaching vision-and-language transformer models to paint

AllenAct: An open source framework for research in Embodied AI

AI & Creativity

RoboTHOR: An Open Simulation-to-Real Embodied AI Platform

Iconary: An AI powered drawing and guessing game

Craft: Scripts to Compositions to Videos