MetaHuman Lip Sync Setup (OVRLipSync)¶
Known Limitation — Interview Text Input / Suggested Questions
Lip sync currently does not trigger correctly when the interview uses:
- Text-to-Speech / Speech-to-Text accessibility input
- Suggested Questions HUD selections
In these cases, the text is sent to the AI and a verbal response is returned, but the first response currently plays without lip sync.
Because lip sync is not started for that initial text-driven response path, the AI may then hear its own output and incorrectly respond a second time. That second response may appear with lip sync, but this behavior is not intentional and indicates a limitation in the current implementation.
This issue occurs because the current lip sync logic was built around the streamed audio response path and did not fully account for all interview text-driven response flows.
Overview¶
This system enables real-time lip synchronization for MetaHuman characters using the Oculus OVRLipSync plugin.
OVRLipSync analyzes incoming audio waveform data (PCM samples) and converts it into viseme weights.
These viseme weights represent the probability that a specific mouth shape is being spoken at a given moment.
The system works in the following stages:
- Audio Input
- OVRLipSync phoneme analysis
- Viseme weights returned
- Animation Blueprint processing
- ModifyCurve node applies curves
- RigLogic evaluates facial deformation
The result is real-time MetaHuman mouth animation synchronized to speech audio.
Current Limitation: Interview Text-Driven Responses¶
The current OVRLipSync implementation is designed around the streamed AI audio response pipeline, where incoming audio deltas are decoded, buffered, and continuously analyzed for viseme extraction.
However, the interview system also supports text-based input paths that do not currently trigger lip sync correctly in all cases.
Affected Cases¶
The following interview flows are currently affected:
- Text-to-Speech / Speech-to-Text accessibility mode
- Suggested Questions HUD button selections
In these cases:
- Text is sent to the AI
- A verbal response is returned
- The first spoken response plays without lip sync
- The AI may then hear that response and incorrectly respond again
- That second response may appear with lip sync
Cause¶
This happens because the current logic did not fully account for interview text-driven responses needing to explicitly start the same lip sync processing flow used for streamed conversational audio.
In other words, the implementation correctly handles the standard streamed audio path, but not all text-originated interview response paths currently initialize lip sync state the same way.
Impact¶
As a result:
- mouth animation may be missing on the first response
- the AI may incorrectly treat its own output as a new input
- duplicate responses may occur in these specific interview flows
Recommendation for Future Fix¶
A future fix should ensure that interview text-originated responses:
- initialize lip sync processing the same way as streamed conversational responses
- correctly coordinate microphone capture control before playback begins
- avoid allowing the AI to hear and respond to its own first text-driven output
What Are Visemes?¶
A viseme represents the visual mouth shape associated with one or more phonemes.
Example groupings:
| Phoneme | Viseme | Example |
|---|---|---|
| P/B/M | PP | pop |
| F/V | FF | five |
| TH | TH | think |
| D/T/N | DD | dog |
| A | AA | father |
OVRLipSync outputs 15 values:
| Index | Viseme |
|---|---|
| 0 | Silence |
| 1 | PP |
| 2 | FF |
| 3 | TH |
| 4 | DD |
| 5 | KK |
| 6 | CH |
| 7 | SS |
| 8 | NN |
| 9 | RR |
| 10 | AA |
| 11 | E |
| 12 | IH |
| 13 | OH |
| 14 | OU |
Requirements¶
- Unreal Engine 5.6+
- Windows only
⚠️ The Oculus LipSync plugin does not function on macOS.
Plugin Installation¶
Download:
https://github.com/Shiyatzu/OculusLipsyncPlugin-UE5
Copy only:
into:
UE5.6 Compatibility Fixes¶
Fix include¶
Change:
to
Add dependency¶
Edit:
Add:
Fix decompression type¶
Change:
to
Project Code Integration¶
Lip sync logic is implemented in:
These handle:
- audio streaming
- PCM buffering
- viseme extraction
- loudness calculation
- blueprint access
OVRLipSync Streaming Architecture¶
OpenAI responses are streamed as small base64 encoded audio packets ("audio deltas").
These packets arrive faster than the audio actually plays through the Unreal SoundWaveProcedural.
If lip sync were calculated only when packets arrive, the following problem occurs:
- streaming packets stop arriving once the full response is received
- however the SoundWave still contains buffered audio that continues playing
- lip sync processing stops early
- the character’s mouth stops moving while the audio is still playing
This is why the system separates responsibilities into two functions:
| Function | Responsibility |
|---|---|
| HandleAudioDeltaReceived | receives and stores streamed audio packets |
| ProcessLipSyncFrames | continuously analyzes buffered audio for lip sync |
Instead of processing lip sync immediately when packets arrive, decoded PCM samples are stored in a persistent buffer:
A repeating timer then calls ProcessLipSyncFrames() every 10 milliseconds to analyze audio frames while playback continues.
This ensures lip sync continues even after streaming packets stop arriving.
BeginPlay()¶
BeginPlay initializes the lip sync system.
Create OVRLipSync context¶
LipSyncContext = MakeShared<UOVRLipSyncContextWrapper>(
ovrLipSyncContextProvider_Enhanced,
24000,
4096
);
Parameters:
| Parameter | Meaning |
|---|---|
| Enhanced | higher accuracy phoneme detection |
| 24000 | audio sample rate |
| 4096 | internal buffer |
Initialize viseme array¶
OVRLipSync outputs 15 viseme values representing detected mouth shapes.
Create procedural sound wave¶
This sound wave is used to play streamed AI speech audio in real time.
HandleAudioDeltaReceived()¶
This function runs whenever OpenAI sends a new audio delta packet.
The audio arrives as base64 encoded PCM samples.
Responsibilities¶
This function handles stream processing and buffering, but does not perform lip sync analysis.
Decode base64¶
Converts the base64 audio packet into raw audio bytes.
Queue audio for playback¶
This sends the decoded audio to the procedural SoundWave so it can play through the character's dialogue audio component.
Convert to PCM¶
The raw audio bytes are interpreted as 16-bit PCM samples, which is the format required by OVRLipSync.
Append to playback buffer¶
PCM samples are stored in a persistent buffer so they can be analyzed later.
This buffer accumulates streaming audio packets and acts as the source for lip sync frame processing.
Compute loudness¶
The loudness value is normalized to 0–1 and is used later by the animation blueprint to scale jaw-driven visemes.
Start lip sync timer¶
Once audio begins streaming, a timer is started that repeatedly calls:
every 10ms.
This timer ensures lip sync processing continues while buffered audio is still playing.
Timer Lifecycle¶
The lip sync timer is only started once per response:
This ensures:
- only one timer runs at a time
- no duplicate frame processing occurs
- lip sync remains stable and consistent
ProcessLipSyncFrames()¶
Unlike HandleAudioDeltaReceived, this function is responsible for actual lip sync analysis.
It is called repeatedly by a timer every 10 milliseconds.
Frame size¶
OVRLipSync expects fixed-size audio frames.
At the project's 24kHz sample rate, one frame equals:
Extract frame from buffer¶
The function checks whether the playback buffer contains at least 240 samples.
If so, one frame is extracted from the buffer.
Send frame to OVRLipSync¶
OVRLipSync analyzes the audio frame and returns viseme weights.
These values represent the probability of different mouth shapes being spoken.
Update CurrentVisemes¶
The viseme values returned from ProcessFrame() are stored in:
This array is later accessed by the MetaHuman animation blueprint.
Remove processed samples¶
After analysis, the processed samples are removed from PlaybackPCMBuffer.
This keeps the buffer aligned with the current playback position.
Why Two Functions Are Required¶
Without separating these responsibilities, lip sync would fail during streaming playback.
handles incoming audio packets.
handles continuous lip sync analysis.
Because audio playback continues after packets stop arriving, the timer-driven frame analysis ensures the character's mouth continues moving until the sound finishes playing.
HandleAudioFinished()¶
This function runs when the procedural sound wave finishes playing.
At this point the system:
- Stops the lip sync processing timer
- Clears any remaining PCM samples from the playback buffer
- Broadcasts that the AI response has finished speaking
Stopping the timer prevents lip sync processing from continuing after playback ends.
Enhanced Audio Completion Handling¶
In addition to stopping lip sync processing, this function performs full system cleanup to ensure a clean transition back to idle.
Additional Responsibilities¶
-
Clears buffered PCM audio:
-
Resets viseme values to neutral:
-
Resets loudness:
-
Clears subtitles:
-
Restarts microphone capture:
-
Updates response state:
-
Broadcasts talking stopped:
Result¶
This ensures:
- the face returns to a neutral pose
- no residual animation persists
- the system is ready for the next interaction
Talking State Detection (C++)¶
The system determines whether the AI is actively speaking based on the presence of streamed audio data.
A delay is applied to ensure stable completion detection:
Purpose¶
This prevents:
- early cutoff due to streaming gaps
- jitter between talking and idle states
- incorrect animation transitions
Animation Blueprint¶
Animation Blueprint used:
Microphone Capture Control¶
To prevent feedback loops and unintended responses, microphone capture is explicitly controlled during standard AI playback.
When AI Starts Speaking¶
AudioInputSubsystem->StopCapturing();
When AI Finishes Speaking¶
AudioInputSubsystem->StartCapturing();
Importance¶
This is intended to prevent:
- the AI from hearing its own output
- recursive responses
- duplicate or unintended dialogue triggers
Current Limitation¶
This protection currently works for the primary streamed AI playback path, but it does not yet fully cover all interview text-driven response flows, including:
- Text-to-Speech / Speech-to-Text accessibility input
- Suggested Questions HUD selections
Because those paths do not currently initialize lip sync and playback control in the same way, the AI may still hear its own response and answer again.
When AI Starts Speaking¶
When AI Finishes Speaking¶
Importance¶
This prevents:
- the AI from hearing its own output
- recursive responses
- duplicate or unintended dialogue triggers
AnimGraph Setup¶
Add a ModifyCurve node immediately before:
Settings:
| Setting | Value |
|---|---|
| Apply Mode | Add |
| Alpha | 1.0 |
| Curve Map | LipSyncCurves |
This node applies the curve values generated in the EventGraph.
Event Graph¶
Two events control the runtime behavior.
EventBlueprintInitializeAnimation¶
Purpose: cache a reference to the IntervieweeActor.
Steps:
- Event fires on animation initialization.
- Get Owning Actor
- Get Parent Actor
- Cast To IntervieweeActor
- Store reference in:
EventBlueprintUpdateAnimation¶
Runs every animation frame and drives the facial animation curves used by the ModifyCurve node.
Step 1 — Validate Actor Reference¶
The blueprint first verifies that the cached IntervieweeActor reference is valid.
If the reference is invalid, the blueprint attempts to cast the owning actor again.
Step 2 — Retrieve Viseme Array¶
The animation blueprint retrieves the current viseme weights from C++.
This returns the 15 element viseme array generated by OVRLipSync.
Each element corresponds to a viseme index.
Index 0 represents silence and is ignored.
Step 3 — Retrieve Loudness¶
Loudness is retrieved separately from the IntervieweeActor.
The returned value is stored in:
This value represents the normalized RMS loudness of the audio frame (0-1).
Loudness is used to amplify visemes that primarily drive jaw opening.
Step 4 — Store Viseme Values¶
Each viseme index is stored in local animation blueprint variables.
VisemePP
VisemeFF
VisemeTH
VisemeDD
VisemeKK
VisemeCH
VisemeSS
VisemeNN
VisemeRR
VisemeAA
VisemeE
VisemeIH
VisemeOH
VisemeOU
Step 5 — Apply Multipliers¶
The raw viseme values returned by OVRLipSync are very small, so multipliers are applied.
Step 6 — Loudness Driven Visemes¶
The following visemes also multiply by loudness because they primarily control jaw opening:
Formula used:
Step 7 — Clamp¶
All values are clamped to prevent curve values exceeding valid ranges.
Step 8 — Apply FInterpTo Smoothing¶
The clamped values are smoothed using FInterpTo.
This prevents jitter between animation frames and produces smoother lip motion.
Dynamic Interpolation Based on Talking State¶
Interpolation speed is not constant and is now driven by the character's talking state (IsTalking).
When IsTalking = true:
- Higher interpolation speeds are used
- Visemes respond quickly to incoming audio
- Produces responsive lip sync
When IsTalking = false:
- Lower interpolation speeds are used
- Visemes gradually return to neutral
- Prevents the mouth from freezing in a partially open shape after speech ends
This is implemented using Select Float nodes that switch interpolation speed values based on IsTalking.
Post-Speech Viseme Decay (Return to Neutral)¶
A common issue in real-time lip sync systems is that the mouth can remain partially open after audio playback ends.
To solve this, an additional decay system is applied when the character is no longer speaking.
Implementation¶
- The
IsTalkingboolean is used to determine whether speech is active - When
IsTalkingbecomes false: - Target viseme values effectively decay toward 0
- Interpolation speeds are reduced
- This creates a smooth return to the neutral facial pose
Blueprint Logic¶
This behavior is implemented using:
- Select Float nodes for dynamic interpolation speed
- Conditional targets that reduce viseme intensity when not talking
- FInterpTo smoothing applied every frame
Result¶
- No frozen mouth shapes after speech
- Natural relaxation of facial animation
- Improved realism between dialogue segments
Step 9 — Clear Curve Map¶
Before writing new values the map used by the ModifyCurve node is cleared.
Step 10 — Add Curve Values To Map¶
Smoothed viseme values are added to the map:
Each viseme may drive one or more MetaHuman facial curves.¶
Curve Mappings¶
Based on the blueprint implementation.
| Viseme | Curves |
|---|---|
| PP | CTRL_expressions_mouthLipsTogetherUL / UR / DL / DR |
| FF | CTRL_expressions_mouthLowerLipBiteL / R |
| TH | CTRL_expressions_tongueTipUp |
| DD | CTRL_expressions_jawOpen |
| KK | CTRL_expressions_jawOpen |
| CH | CTRL_expressions_mouthFunnelUL / UR / DL / DR |
| SS | CTRL_expressions_mouthLowerLipTowardsTeethL / R + CTRL_expressions_mouthUpperLipTowardsTeethL / R |
| NN | CTRL_expressions_jawOpen |
| RR | CTRL_expressions_mouthFunnelUL / UR / DL / DR |
| AA | CTRL_expressions_jawOpen |
| E | CTRL_expressions_mouthStretchL / R |
| IH | CTRL_expressions_jawOpen |
| OH | CTRL_expressions_mouthLipFunnelUL / UR / DL / DR |
| OU | CTRL_expressions_mouthLipPurseUL / UR / DL / DR |
Viseme Multipliers¶
| Viseme | Multiplier |
|---|---|
| PP | 2.5 |
| FF | 6.0 |
| TH | 4.0 |
| DD | 4.0 |
| KK | 2.6 |
| CH | 2.4 |
| SS | 1.8 |
| NN | 2.5 |
| RR | 2.0 |
| AA | 15.0 |
| E | 2.3 |
| IH | 4.0 |
| OH | 2.0 |
| OU | 2.5 |
FInterpTo Smoothing Speeds (Talking State)¶
| Viseme | Speed |
|---|---|
| PP | 21 |
| FF | 19 |
| TH | 19 |
| DD | 21 |
| KK | 21 |
| CH | 19 |
| SS | 23 |
| NN | 19 |
| RR | 17 |
| AA | 17 |
| E | 17 |
| IH | 17 |
| OH | 19 |
| OU | 19 |
Idle (Not Talking) Speeds¶
When IsTalking is false, lower interpolation speeds are used to allow visemes to decay smoothly:
Example:
- Talking speeds: 17–23
- Idle speeds: 0–5 (or near zero depending on desired decay rate)
These values are selected dynamically in the Animation Blueprint using Select Float nodes.
Important Notes About Curve Names¶
Only curves beginning with:
should be used.
Examples:
CTRL_expressions_jawOpen
CTRL_expressions_mouthStretchL
CTRL_expressions_mouthStretchR
CTRL_expressions_mouthFunnelUL
CTRL_expressions_mouthFunnelUR
Using curves such as:
will not correctly drive MetaHuman facial deformation.
Finding Curve Names¶
To locate valid curves:
- Open the AnimGraph
- Right-click the ModifyCurve node
- Select Add Curve Pin
- Search for curves beginning with:
When adding curves to the LipSyncCurves map, the names must match the ModifyCurve pin names exactly.