Real-Time Multimodal Digital Human

This is a real-time multimodal digital human for dialogue and music. I built the full stack that ingests text, audio and video in real time, uses face recognition to personalize responses and manage attention, and matches lip and facial movements to generated speech.

The music agent tracks emotional state from voice and image, logs these states for experiments, and when it detects an intent to play music, it queries a backend database to select tracks that match the user’s emotion and request.

I implemented the web frontend in JavaScript, and the system is now in production at a provincial court.

Direct Link