AI lipsync tool for video content creators

at sync. we're making video as fluid and editable as a word document how much time would you save if you could *record every video in a single take?* no more re-recording yourself because you didn't like what you said, or how you said it. just shoot once, revise yourself to do exactly what you want, and post. that's all. this is the future of video: *AI modified >> AI generated* we're playing at the edge of science + fiction. our team is young, hungry, uniquely experienced, and advised by some of the greatest research minds + startup operators in the world. we're driven to solve impossible problems, impossibly fast. our founders are the original team behind the open sourced wav2lip — the most prolific lip-sync model to date w/ over 10k+ GitHub stars. we [1] train state-of-the-art generative models, [2] productize + host them for scale [3] grow virally through content [4] upsell into enterprise

Active Founders

Prady Modukuru

Founder

ceo & cofounder at sync. labs | product engineer obsessed w/ networks of people + products. before startups, I helped incubate, launch, and scale AI-powered cybersecurity products at Microsoft impacting over 500M consumers + $1T worth of publicly traded companies – and became the youngest product leader in my org.

Prady Modukuru

Founder

ceo & cofounder at sync. labs | product engineer obsessed w/ networks of people + products. before startups, I helped incubate, launch, and scale AI-powered cybersecurity products at Microsoft impacting over 500M consumers + $1T worth of publicly traded companies – and became the youngest product leader in my org.

Prajwal K R

Founder & Chief Scientist

Co-founder & Chief Scientist at sync. labs. Ph.D. from University of Oxford with Prof. Andrew Zisserman. Authored multiple breakthrough research papers (incl. Wav2Lip) on understanding and generating humans in video.

Prajwal K R

Founder & Chief Scientist

Co-founder & Chief Scientist at sync. labs. Ph.D. from University of Oxford with Prof. Andrew Zisserman. Authored multiple breakthrough research papers (incl. Wav2Lip) on understanding and generating humans in video.

Rudrabha Mukhopadhyay

Founder

I am the co-founder and CTO of Sync Labs. At Sync, We’re building audio-visual models to understand, modify, and synthesize humans in video. I am one of the primary authors of Wav2Lip, one of the most prolific lip syncing models in the world published in 2020. I have done my PhD at IIIT Hyderabad on Audio-visual deep learning and have been involved in several important projects in the community.

Rudrabha Mukhopadhyay

Founder

I am the co-founder and CTO of Sync Labs. At Sync, We’re building audio-visual models to understand, modify, and synthesize humans in video. I am one of the primary authors of Wav2Lip, one of the most prolific lip syncing models in the world published in 2020. I have done my PhD at IIIT Hyderabad on Audio-visual deep learning and have been involved in several important projects in the community.

Pavan Reddy

Founder

Driving Sales/Operations and Finance/ Strategic roadmap at sync. 2x venture backed entrepreneur. IIT Madras Alumnus. Worked with IIIT Hyderabad in productising research work as per market orientation. Key strength is connecting dots/ identifying patterns across different fields to unlock the value

Pavan Reddy

Founder

Driving Sales/Operations and Finance/ Strategic roadmap at sync. 2x venture backed entrepreneur. IIT Madras Alumnus. Worked with IIIT Hyderabad in productising research work as per market orientation. Key strength is connecting dots/ identifying patterns across different fields to unlock the value

Company Launches

[sync.] we built the most natural lipsync model in the world, again.

See original launch post

tldr;

lipsync-2 is the most advanced video-to-video lipsyncing model in the world
It’s zero-shot, so you don’t need to wait for an “actor”, “clone”, or “avatar” to train before using it.
Even so, it learns and generates a speaker’s unique style of speech
It works across live-action, animated, and AI-generated humans
Thousands of developers use it to build video translation, word-level editing of video, and character re-animation workflows today (including generating realistic AI UGC)
We’re launching our YC deal, 4mo’s of our scale plan for free plus $1000 in credits 🚀

https://www.youtube.com/watch?v=j5iJ2k05ltc

What did we build?

We built lipsync-2, the first in a new generation of zero-shot lipsyncing models. It seamlessly edits any person's lip movements in a video to match any audio without having to train or be fine-tuned on that person.

Zero-shot lipsync models are versatile because they edit any arbitrary person and voice without having to train or fine-tune on every speaker. But traditionally they can lose traits unique to the person, like their speaking style, skin textures, teeth, etc.

With lipsync-2, we introduce a new capability in zero-shot lipsync: style preservation. We learn a representation of how a person speaks by watching how they speak in the input video. We train a spatiotemporal transformer that encodes the different mouth shapes in the input video into a style representation. A generative transformer synthesizes new mouth movements by conditioning on the new target speech and the learned style representation.

How can you use it?

We built a simple API that let’s you build workflows around our core lipsyncing models. You submit a video and an audio (or a script and voiceID to generate audio from), and get a response with the final output.

We see thousands of developers and businesses integrating our APIs to build generative video workflows into their products and services.

[1] Video translation

Notice how even across different languages, we preserve the speaking style of Nicolas Cage. We are the first zero-shot lipsyncing model to achieve this.

https://youtu.be/GaCoHy99zT4

We can even handle long videos with multiple speakers — we built a state-of-the-art active speaker detection pipeline that associates a unique voice with a unique face, and only applies lipsync when we detect that person is actively speaking.

https://www.youtube.com/watch?v=ZaXbiKdoBz8

It also works across animated characters, from Pixar-level animations to AI generated characters.

https://www.youtube.com/watch?v=F_6lGFl6bcA

But translation is only the beginning, with the power to edit dialogue in any video in post-production we’re on the cusp of reimagining how we create, edit, and consume videos forever.

[2] Record once and edit dialogue to use forever.

https://youtu.be/HJR4BbhZ8Uo

Imagine a world where you only ever have to hit record once. lipsync-2 is the only model that let’s you edit a dialogue while preserving the original speakers style, without needing to train or fine-tune beforehand.

[3] AI video

In an age where we can generate any video by typing a few lines of text, we don’t have to limit ourselves to what we can capture with a camera.

https://youtube.com/shorts/KnzWtu3niKQ

Our YC deal

For any YC company we’re giving away our Scale Plan for free for 4 months, plus $1000 to spend on usage.

With the scale plan you get access to up to 15 concurrent jobs processing at once and handle up to 30 minute video at a time — leveraging this maximally you have the ability to generate around ~90 minutes of video per hour every hour.

Launch an AI admaker, video translation tool, or any other content generation workflow you want and serve viral load with speed, reliability, and best-in-class quality.

Email us at yc@sync.so and we’ll get you set up.

So why does this matter?

At sync, AI lipsync is just the beginning.

We live in an extraordinary age.

A high schooler can craft a masterpiece with an iPhone. A studio can produce a movie at a tenth of the cost 10x faster. Every video can be distributed worldwide in any language, instantly. Video is becoming as malleable as text.

But we have two fundamental problems to tackle before this is a reality:

[1] Large video models are great at generating entirely new scenes and worlds, but struggle with precise control and fine grained edits. The ability to make subtle, intentional adjustments – the kind that separates good content from great content – doesn’t exist yet.

[2] If video generation is world modeling, each human is a world unto themselves. We each have idiosyncrasies that make us unique — building primitives to capture, express, and modify them with high precision is the key to breaking through the uncanny valley.

We’re excited about lipsync-2, and for what’s coming up next. Reach out to founders@sync.so if you have any questions or are curious about our roadmap.

Previous Launches

sync. – AI lip sync tool for video content creators

from the team behind original Wav2Lip library

YC Photos

Hear from the founders

How did your company get started? (i.e., How did the founders meet? How did you come up with the idea? How did you decide to be a founder?)

how did we meet?Prajwal + Rudrabha worked together at IIIT-hyderabad — and became famous by shipping the world’s first model that could sync the lips in a video to any audio in the wild in any language, no training required.They formed a company w/ Pavan and then worked w/ the university to monetize state-of-the-art research coming out of the labs and bring it to market.Prady met everyone online — first by hacking together a viral app around their open source models, then collaborating on product + research for fun, to cofounding sync. + going mega-viral.Since then we’ve hacked irl across 4 different countries, across the US coasts, and moved into a hacker house in SF together.

Selected answers from sync.'s original YC application

Describe what your company does in 50 characters or less.

lipsync video to audio in any language in one-shot

How long have each of you been working on this? How much of that has been full-time? Please explain.

Prady + Pavan have been full-time on sync since June 2023

Rudrabha has been contributing greatly while finishing his PhD + joined full-time starting October 2023

Prajwal is finishing up his PhD and is joining fulltime once he completes in May of 2024 – his supervisor is Professor Andrew Zisserman (190+ citations / foremost expert in the field we are playing in. His proximity helps us stay sota + learn from the bleeding edge.

What is your company going to make? Please describe your product and what it does or will do.

we're building generative models to modify / synthesize humans in video + hosting production APIs to let anyone plug them into their own apps / platforms / services.

today we're focused on visual dubbing – we built + launched an updated lip-synchronizing model to let anyone lip-sync a video to an audio in any language in near real-time for HD videos.

as part of the AI translation stack we're used as a post processing step to sync the lips in a video to the new dubbed audio track – this lets everyone around the world experience content like it was made for them in their native language (no more bad / misaligned dubs).

in the future we plan to build + host a suite of production ready models to modify + generate a full-human body digitally in video (ex. facial expressions, head + hand + eye movements, etc.) that can be used for anything from seamless localization of content (cross-language) to generative videos

YC Winter 2024 Application Video

YC Winter 2024 Demo Day Video