Research @ Hedra

A new approach: multimodal foundation models with inverse graphics

Hedra Team
Hedra Team

Our mission

At Hedra, our mission is to revolutionize the way in which we create and experience visual content by building the technology to create and edit virtual worlds. We believe that a combination of multimodal foundation models and physically-based inverse graphics is the solution to creating compelling, interactive experiences which capture an artists intent. Our internal research is guided by this belief, and deeply explores foundation models and their alignment with the physical world.

The challenge

While recent advancements in foundation models have revolutionized text, image, and now video generation, fundamental challenges remain in building large models that strike the right balance between ease of use and realizing a creators intent. This is because each of these individual models only learn to reason in modalities which are observations of the underlying world without being constrained by physical truth. As a result, current approaches sacrifice user control for the sake of simplicity, allowing users to chain together the workflows of foundation models of language, image, and video, each of which without physical world grounding. This requires extensive re-prompting to arrive at a result which reflects what a user envisions, with multi-minute wait times for even basic changes, and little actionable feedback.

Conversely, existing computer graphics tools for world manipulation have steep initial learning curves for creators. While powerful, users can spend months climbing the learning curve and hours designing relatively simple content. Additionally, these tools are not unified in a single platform, requiring users to switch between many tools and build a workflow which does not generalize beyond niche form of content. These tools are able to provide real-time feedback and physically accurate control, but are unable to match the quality and ease of use of any individual foundation model in the observation space which a user desires to create content.

Our approach

At Hedra, we are tackling this problem by focusing on both of these areas at their intersection.

  1. World Foundation Models: To truly capture the richness of the world, we need foundation models which are multimodal rather than operating within individual observation modalities. To fully model the world, we require models which reason about not only text, image, and video, but audio, physical priors, animation, and agency. We are focused on grounding these representations using our knowledge of how the physical world behaves in order to develop models which are able to efficiently learn from data and produce seamless results but reflect the underlying state of the world.

  2. Inverse Graphics: To give users total control over the look and feel of their creations and manipulate the underlying state of the world we draw inspiration from computer graphics and 3D manipulation. We reconstruct and reason about the state of the world using inverse graphics and neural rendering engines which are compatible with our grounded multimodal foundation models. We study how to best utilize foundation models in conjunction with dynamic 3D representations, and how to provide real-time control and feedback for users with these hybrid representations.

Join us

At Hedra, we've built in months what competitors have spent years perfecting, and our journey is just getting started. We’re backed by some of the best investors in the valley, and have the resources to go the distance.

If you share our passion for pushing the limits of what's possible with AI, and building a product which will unleash the imagination of anyone to create whatever they desire, we’d love to hear from you. Whether you’re a researcher, engineer, artist, or simply someone who believes in putting the human in AI video creation, there's a place for you on our team.