Audio streaming
The push/drain pattern every bitHuman SDK shares — push 16-bit PCM in, drain lip-synced 25 FPS frames out — with the canonical minimal Python loop and the audio/frame formats.
The push/drain pattern
Every SDK and the runtime use the same shape — audio in, video out:
- Push 16-bit PCM audio chunks as they arrive (mic, TTS, WebRTC).
- Drain lip-synced video frames at 25 FPS.
That’s the entire surface area. The same two calls drive both Essence and Expression, across Python, Swift, Kotlin, and the CLI.
You feed PCM in as fast as it arrives and drain visual frames out on a fixed 25 FPS clock — the engine buffers between the two so your audio source and your render loop never have to stay in lockstep.
The minimal Python loop
This is the canonical, copy-pasteable loop. Other pages link here rather than repeating it.
import asyncio, os
import numpy as np
import soundfile as sf
from bithuman import AsyncBithuman
# bithuman 2.3 is library-only — the old bithuman.audio helpers were
# removed. Inline what we need: load a WAV, downmix to mono, convert
# float32 → int16 PCM. (The SDK resamples to 16 kHz internally, so the
# loader can hand back any sample rate.)
def load_audio(path: str) -> tuple[np.ndarray, int]:
audio, sr = sf.read(path, dtype="float32", always_2d=False)
if audio.ndim > 1:
audio = audio.mean(axis=1)
return audio, sr
def float32_to_int16(arr: np.ndarray) -> np.ndarray:
return (np.clip(arr, -1.0, 1.0) * 32767.0).astype(np.int16)
async def main():
rt = await AsyncBithuman.create(
model_path="avatar.imx",
api_secret=os.environ["BITHUMAN_API_SECRET"],
)
pcm, sr = load_audio("speech.wav")
pcm = float32_to_int16(pcm)
chunk = sr // 100 # 10 ms chunks
for i in range(0, len(pcm), chunk):
await rt.push_audio(pcm[i:i + chunk].tobytes(), sr, last_chunk=False)
await rt.flush()
async for frame in rt.run():
if frame.has_image:
image = frame.bgr_image # numpy (H, W, 3) uint8
if frame.end_of_speech:
break
await rt.stop()
asyncio.run(main())
The on-device SDK always renders a local .imx, so create() needs model_path; you can also pass agent_code for billing attribution. Resolving an avatar purely by code (no local file) is the cloud/REST path — see Avatars & .imx.
Debian/Ubuntu
create()failing withProblem with the SSL CA certis fixed in 2.3.4 — the SDK auto-discovers your distro’s CA bundle on Linux, no configuration needed. If you must stay on ≤ 2.3.3, either upgrade (recommended) or symlink once:sudo mkdir -p /etc/pki/tls/certs && sudo ln -s /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt. NoteCURL_CA_BUNDLE/SSL_CERT_FILEoverride auto-discovery when set — a stale value breaks auth even on 2.3.4. Details in Python SDK troubleshooting.
Audio format
| Property | Value |
|---|---|
| Encoding | 16-bit signed PCM (int16) |
| Channels | Mono |
| Sample rate | Any (the SDK auto-resamples) |
| Chunk size | Anything; 10–40 ms is typical |
Push raw int16 PCM bytes plus the sample rate — the SDK resamples internally. The load_audio / float32_to_int16 helpers are inlined in the loop above; the old bithuman.audio module was removed in the 2.3 slim wheel.
Frame format
Each yielded frame exposes:
| Field | Type | What it is |
|---|---|---|
bgr_image | numpy.ndarray (H, W, 3) uint8 | The rendered video frame, BGR channel order |
audio_chunk | AudioChunk | Audio aligned with the frame. An object exposing .array (numpy samples), .bytes (raw PCM), and .duration (seconds) — not raw bytes. |
has_image | bool | False for filler frames during silence |
end_of_speech | bool | True on the last frame of a turn |
Frames arrive at 25 FPS regardless of audio chunk size.
When the avatar isn’t speaking
During silence the runtime emits filler frames (has_image=False) so your render loop keeps its 25 FPS cadence. Skip them, or render a static idle frame.
Mapping to other SDKs
The push/drain shape is identical everywhere — only the language idioms change:
- Python —
await rt.push_audio(...)/async for frame in rt.run(). See the Python SDK. - Swift — push PCM into the chat session, receive frames on the render callback. See the Swift SDK.
- Kotlin — same push/drain over the AAR binding (Beta). See the Kotlin SDK.
All SDKs that target the same engine ABI produce byte-equivalent frames from the same audio — see Architecture for the compatibility matrix.
Where to go next
- Agent lifecycle — generate an agent, then stream it.
- Quickstart — your first avatar in ~2 minutes.
- Browser rendering — run the same lip-sync pipeline client-side in WASM.