Audio streaming

The push/drain pattern every bitHuman SDK shares — push 16-bit PCM in, drain lip-synced 25 FPS frames out — with the canonical minimal Python loop and the audio/frame formats.

The push/drain pattern

Every SDK and the runtime use the same shape — audio in, video out:

  1. Push 16-bit PCM audio chunks as they arrive (mic, TTS, WebRTC).
  2. Drain lip-synced video frames at 25 FPS.

That’s the entire surface area. The same two calls drive both Essence and Expression, across Python, Swift, Kotlin, and the CLI.

push audioengine tickspull framerender

You feed PCM in as fast as it arrives and drain visual frames out on a fixed 25 FPS clock — the engine buffers between the two so your audio source and your render loop never have to stay in lockstep.

The minimal Python loop

This is the canonical, copy-pasteable loop. Other pages link here rather than repeating it.

import asyncio, os
import numpy as np
import soundfile as sf
from bithuman import AsyncBithuman

# bithuman 2.3 is library-only — the old bithuman.audio helpers were
# removed. Inline what we need: load a WAV, downmix to mono, convert
# float32 → int16 PCM. (The SDK resamples to 16 kHz internally, so the
# loader can hand back any sample rate.)
def load_audio(path: str) -> tuple[np.ndarray, int]:
    audio, sr = sf.read(path, dtype="float32", always_2d=False)
    if audio.ndim > 1:
        audio = audio.mean(axis=1)
    return audio, sr

def float32_to_int16(arr: np.ndarray) -> np.ndarray:
    return (np.clip(arr, -1.0, 1.0) * 32767.0).astype(np.int16)

async def main():
    rt = await AsyncBithuman.create(
        model_path="avatar.imx",
        api_secret=os.environ["BITHUMAN_API_SECRET"],
    )

    pcm, sr = load_audio("speech.wav")
    pcm = float32_to_int16(pcm)
    chunk = sr // 100                       # 10 ms chunks
    for i in range(0, len(pcm), chunk):
        await rt.push_audio(pcm[i:i + chunk].tobytes(), sr, last_chunk=False)
    await rt.flush()

    async for frame in rt.run():
        if frame.has_image:
            image = frame.bgr_image         # numpy (H, W, 3) uint8
        if frame.end_of_speech:
            break
    await rt.stop()

asyncio.run(main())

The on-device SDK always renders a local .imx, so create() needs model_path; you can also pass agent_code for billing attribution. Resolving an avatar purely by code (no local file) is the cloud/REST path — see Avatars & .imx.

Debian/Ubuntu create() failing with Problem with the SSL CA cert is fixed in 2.3.4 — the SDK auto-discovers your distro’s CA bundle on Linux, no configuration needed. If you must stay on ≤ 2.3.3, either upgrade (recommended) or symlink once: sudo mkdir -p /etc/pki/tls/certs && sudo ln -s /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt. Note CURL_CA_BUNDLE / SSL_CERT_FILE override auto-discovery when set — a stale value breaks auth even on 2.3.4. Details in Python SDK troubleshooting.

Audio format

PropertyValue
Encoding16-bit signed PCM (int16)
ChannelsMono
Sample rateAny (the SDK auto-resamples)
Chunk sizeAnything; 10–40 ms is typical

Push raw int16 PCM bytes plus the sample rate — the SDK resamples internally. The load_audio / float32_to_int16 helpers are inlined in the loop above; the old bithuman.audio module was removed in the 2.3 slim wheel.

Frame format

Each yielded frame exposes:

FieldTypeWhat it is
bgr_imagenumpy.ndarray (H, W, 3) uint8The rendered video frame, BGR channel order
audio_chunkAudioChunkAudio aligned with the frame. An object exposing .array (numpy samples), .bytes (raw PCM), and .duration (seconds) — not raw bytes.
has_imageboolFalse for filler frames during silence
end_of_speechboolTrue on the last frame of a turn

Frames arrive at 25 FPS regardless of audio chunk size.

When the avatar isn’t speaking

During silence the runtime emits filler frames (has_image=False) so your render loop keeps its 25 FPS cadence. Skip them, or render a static idle frame.

Mapping to other SDKs

The push/drain shape is identical everywhere — only the language idioms change:

  • Pythonawait rt.push_audio(...) / async for frame in rt.run(). See the Python SDK.
  • Swift — push PCM into the chat session, receive frames on the render callback. See the Swift SDK.
  • Kotlin — same push/drain over the AAR binding (Beta). See the Kotlin SDK.

All SDKs that target the same engine ABI produce byte-equivalent frames from the same audio — see Architecture for the compatibility matrix.

Where to go next