Twitter/X

@fchollet: MaineCoon is the first video model that focuses on social interactions: facial expressions, emotions...

MaineCoon is the first video model that focuses on social interactions: facial expressions, emotions, fluid conversation, audio-lip sync, etc. Really impressive inference specs: 22B params, 47.5 FPS on a single H100. Generates in real-time at <$0.001/sec.

They achieve this with an agentic streaming inference framework with 3 different auxiliary models to manage the cache and lookahead buffer. Super cool work.

Catnip (@catnips_ai)

Most AI video today is still:
prompt → wait → watch a clip.

MaineCoon is built for something different:
prompt → talk → interact in real time.

In our vision, the character is not a fixed video clip that just waits for your input. It keeps generating voice, expression, and motion on its own.

That is why AI video starts feeling less like content — and more like someone you can actually hang out with.

To meet our goal, the first step is Mainecoon, a real-time interactive audio-visual model built for streaming generation to interact with you.

1⃣Up to 47.5 FPS on a single H100 GPU
2⃣Audio-visual generation cost below $0.001 / second
3⃣Long-duration streaming generation for 1000s+ seconds
4⃣Continuous audio, motion, expression, and visual alignment
5⃣SOTA performance on SocialVideo Bench
From passive video to real-time AI presence.

Want to try MaineCoon?
Learn more and apply for early access: mainecoon.tech

Share a great MaineCoon video on X and @catnips_ai , get 2 extra codes.

Video

— https://nitter.net/catnips_ai/status/2068015962717315126#m