Aniruddha (Ani) Kembhavi

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Molmo is a family of open state-of-the-art multimodal AI models. Our most powerful model closes the gap between open and proprietary systems across a wide range of academic benchmarks as well as human evaluations. Our smaller models outperform models 10x their size.

While current multimodal models interpret multimodal data and express it in natural language, their full potential remains untapped. Molmo goes beyond. By learning to point at what it perceives, Molmo enables rich interactions with physical and virtual worlds, empowering the next generation of applications capable of acting and interacting with their environments.

Molmo Demo

UNIFIED-IO: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Unified-IO is the first AI model to perform a large and diverse set of AI tasks spanning classical computer vision, image synthesis, vision-and-language, and natural language processing (NLP). It achieved this broad unification by homogenizing every task's input and output into sequences of tokens using universal compressors.

Unified-IO 2 scales this to support more modalities, more tasks and larger models. Our 7B parameter model is trained from scratch on 1B image-text pairs, 1T text tokens, 180M video clips, 130M interleaved image & text, 3M 3D assets, and 1M robot trajectories. It achieves state-of-the-art performance on the GRIT benchmark and strong results across more than 30 benchmarks in computer vision.

Unified-IO Unified-IO-2 Code

ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

ProcTHOR is a framework for procedural generation of Embodied AI environments. ProcTHOR enables us to sample arbitrarily large datasets of diverse, interactive, customizable, and performant virtual environments to train and evaluate embodied agents.

Models trained using only RGB images on ProcTHOR, with no explicit mapping and no human task supervision produce state-of-the-art results across 6 embodied AI benchmarks for navigation, rearrangement, and arm manipulation.

ProcTHOR Code

Visual Programming: Compositional Neuro-Symbolic Visual Reasoning

Visual programming is a neuro-symbolic approach to solving complex and compositional visual tasks given natural language instructions. It avoids the need for any task-specific training. Instead, it uses the in-context learning ability of large language models to generate python-like modular programs, which are then executed to get both the solution and a comprehensive and interpretable rationale. Each line of the generated program may invoke one of several off-the-shelf computer vision models, image processing routines, or python functions to produce intermediate outputs that may be consumed by subsequent parts of the program.

VISPROG Code

Objaverse and Objaverse-XL: Universes with Over 10M+ Annotated 3D Objects

Objaverse and Objaverse-XL are the largest public resources for 3D objects with 1M and 10M high quality assets. They comprise objects from a diverse set of sources, including manually designed objects, photogrammetry scans of landmarks and everyday items, and professional scans of historic and antique artifacts. In less than a year, these datasets have become the de-facto resources for training foundation models for 3D computer vision.

We demonstrate the power of the Objaverse asset librraies by training Zero123 on novel view synthesis, utilizing over 100 million multi-view rendered images and achieving strong zero-shot generalization abilities.

Objaverse Objaverse-XL Code

SATLAS: Open Geospatial Data Generated by AI

Satlas is a platform for visualizing and downloading global geospatial data products generated by our AI models using satellite images. Currently, it includes marine infrastructure (offshore wind turbines and platforms), renewable energy infrastructure (onshore wind turbines and solar farms), and tree cover.

Satlas also contains high resolution imagery on a global scale generated by our super resolution AI models which input freely available low resolution imagery from the Sentinel-2 satellites and produce high fidelity imagery for the entire planet.

SATLAS Code

AI2-THOR: An Interactive 3D Simulated Environment to Train Robots

AI2-THOR is a simuated environment consisting of near photo-realistic 3D indoor scenes, where AI agents can navigate and interact with objects to perform tasks. It is extensively used in the community to train robot policies using reinforcement learning and imitation learning for tasks such as visual navigation, object manipulation, instruction following, and more.

AI2-THOR Code

SPOC: Imitating Shortest Paths in Simulation For Real World Navigation and Manipulation

SPOC is an embodied navigation and manipulation agent trained by imitating shortest-path experts in simulation. SPOC uses no human demonstrations, no reinforcement learning, no depth sensors and makes no assumptions about the target environment.

A key factor that enables this surprising result is the scale and diversity of our training data -- made possible by our recent works to procedurally generate simulations via ProcTHOR and HoloDeck and massively scaling up 3D assets via our openly available Objaverse resource.

SPOC

RoboTHOR: An Open Simulation-to-Real Embodied AI Platform

RoboTHOR is a framework to study simulation-to-real transfer for robotics. It consists of simulated environments paired with physical counterparts in the real world.

The physical environments are built using modular and movable components, allowing us to host scenes with vastly different layouts within a single physical space.

RoboTHOR Code

BiDaF: Bidirectional Attention Flow for Machine Comprehension

BiDaF was an extremely popular and state-of-the-art neural model for machine comprehension before the emergence of the BERT architetcure.

It is a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization.

Code

Molmo: A family of open state-of-the-art multimodal AI models
Learning Generalizable Visual Representations via Interactive Gameplay
X-LXMERT: Teaching vision-and-language transformer models to paint
AllenAct: An open source framework for research in Embodied AI
AI & Creativity
RoboTHOR: An Open Simulation-to-Real Embodied AI Platform
Iconary: An AI powered drawing and guessing game
Craft: Scripts to Compositions to Videos

Recent Awards

Recent Talks

Service

Program Chair (PC) : ICCV 2025 (upcoming)

Senior Area Chair (SAC) : CVPR 2024
Area Chair (AC) : Several past CVPR, ICLR, Neurips and EMNLP conferences

Selected Works

Here are some selected works from my teams at the Allen Institute for AI.

Press