Essence vs Expression

The two bitHuman avatar models — what each does, where each runs, and which one to pick.

The engines

bitHuman’s avatar runtime is a family of rendering engines plus the conversation and voice stack that feeds them. The two render engines you choose between when packaging an avatar — and the focus of the rest of this page — are Essence and Expression.

Rendering engines — two product families, each with tiers:

  • Essence — the avatar family (a packaged .imx identity with real-time lip-sync):
    • Essence 1 — the default. Pre-built identity, runs on virtually any CPU.
    • Essence 2 Quality — the high-fidelity premium renderer (cloud GPU).
    • Essence 2 Mobile — the efficient distilled renderer (Apple-Silicon Neural Engine primary, with elastic GPU overflow).
  • Expression — the expressive family (animation driven from a portrait at runtime):
    • Expression 1 — dynamic facial animation from any portrait image (Apple Silicon or NVIDIA GPU).
    • Expression 2 — the real-time generative engine: fully-generated motion rather than patching a pre-rendered base.

Each family shares one .imx format, SDK methods, and the push audio → drain frames shape; the tier is selected per session and is transparent to your integration. (A separate self-hosted Flash GPU tier is metered per the pricing table.)

Conversation + voice stack — drives a managed agent and feeds the renderers:

  • Converse — the STT → LLM → TTS turn loop that drives a managed agent’s dialogue. It produces the audio that the renderers lip-sync.
  • Voice — the speech engine (the voice/TTS stack behind audio-only chat and the voices you select for an agent).

The rest of this page focuses on Essence vs Expression — the two you choose between when packaging an avatar.

At a glance

bitHuman’s two avatar models share the same .imx file format, the same SDK methods, and the same push audio → drain frames shape. Essence is the default — it runs on virtually every CPU and is what bithuman pull ships in the showcase. Expression is the heavier high-fidelity option for specific on-device Apple Silicon or GPU server use cases.

Essence (default)Expression
What it doesPre-built avatar identity packaged in an .imx file. Real-time lip-sync.Dynamic facial animation from any portrait image at runtime.
Avatar source.imx you build once from a photo or video.Any face image — provide at runtime, no build step.
Custom gesturesYes (wave, nod, laugh, etc.)No
Idle animationPre-recorded natural movementAI-generated micro-movements
Compute neededAny modern CPUApple Silicon M3+ (demo apps) or NVIDIA GPU
Memory footprintLow (~200–500 MB)Higher (~2–6 GB)
Best forKiosks, mobile, edge, 24/7 deployments, high concurrencyClose-up native consumer apps, custom faces per session
Pricing1 credit/min self-hosted · 2 credits/min cloud2 credits/min self-hosted · 4 credits/min cloud

Both ship to every surface — SDKs, REST API, LiveKit plugin, CLI, on-device, embed widget. The same .imx file works everywhere.

Where each model runs

SurfaceEssenceExpression
iOS / iPadOSiPhone 16 Pro+, iPad Pro M4+iPad Pro M4+ (iPhone 16 Pro+ preview)
macOS arm64Any Apple SiliconM3+
macOS IntelPending (2.3 ships arm64 only)
Androidarm64-v8a, Android 10+
Linux x86_64 / aarch64Any modern CPUvia NVIDIA GPU (Docker)
WindowsPending (use WSL2 today)
Raspberry Pi 4B+Supported
bitHuman CloudManagedManaged
Self-hosted CPUPython SDK / LiveKit plugin
Self-hosted GPUDocker container

Native macOS-Intel and Windows wheels are pending for the 2.3 line; the architecture page tracks per-platform shipping status. On iPhone, Essence delivers a fast, real-time on-device avatar; Expression’s heavier renderer targets iPad Pro and Mac (iPhone is in preview).

Essence

Essence packages a complete avatar identity (face, body, gestures) into an .imx file. At runtime, the SDK plays back pre-rendered base motion and patches the mouth region in real time to match incoming audio.

Runtime characteristics

  • ~200–500 MB resident, 1–2 CPU cores, real-time at 25 FPS.
  • Runs on macOS arm64, Linux x86_64 / aarch64, iOS, iPadOS, Android, Raspberry Pi 4B+, and in the browser via WASM.
  • No idle timeout — sessions can run 24/7. Reliable for unattended kiosks and lobby displays.
  • Supports custom gestures (wave, nod, laugh) triggered by keywords or API.
  • Predictable, consistent behavior. Lower per-stream cost — the right pick for high-concurrency self-hosted deployments.

Try it from the showcase

The CLI ships a curated set of ready-to-run Essence .imx avatars:

bithuman list                          # browse the showcase
bithuman pull modern-court-jester      # downloads to ~/.cache/bithuman/showcase/<slug>.imx
bithuman run modern-court-jester.imx   # live browser-served avatar

How to ship it

Expression

Expression generates real-time facial animation directly from a portrait image. The face can change between sessions or even mid-session — no avatar build step is required.

Runtime characteristics

  • ~2–6 GB resident; needs Apple Silicon M3+ (Mac) / M4+ (iPad Pro) or an NVIDIA GPU (8 GB+ VRAM).
  • Works with any face image — drag-and-drop swap, photo, video frame, anything.
  • AI-driven expressions adapt to speech content and emotional context.
  • Higher visual fidelity for close-up conversational interactions.
  • On-device demo apps target macOS M3+ and iPad Pro M4+ today; iPhone Expression and macOS-Intel are on the way.
  • On Apple Silicon the Swift SDK auto-spawns a bithuman-expression-daemon subprocess to drive the model.

How to ship it

Which should I use?

24/7 kiosk or always-on display

Essence. No idle timeout, runs on CPU, predictable for unattended deployments.

iPhone app

Essence. On iPhone, choose Essence; iPad and Mac are the on-device homes for Expression.

Android app

Essence via the Kotlin SDK (Beta).

Native Mac or iPad app with close-up dynamic faces

Expression on-device via the Swift SDK or the Mac/iPad reference apps.

Need custom gestures (wave, nod, laugh)

Essence. Essence supports custom gestures — wave, nod, laugh — triggered by keyword or API.

Quickest setup with any face photo

Expression via the cloud plugin. Pass the image at session start — no build step.

Voice agent on LiveKit with maximum concurrency

Essence. Lower per-stream cost makes it the right pick for high-concurrency deployments.

Edge hardware (Raspberry Pi, low-power laptop)

Essence. Runs on 1–2 CPU cores at 25 FPS.

Highest visual quality for offline video generation

Expression with quality="high". Best for offline batch jobs rather than real-time streaming.

Where to go next