Essence vs Expression
The two bitHuman avatar models — what each does, where each runs, and which one to pick.
The engines
bitHuman’s avatar runtime is a family of rendering engines plus the conversation and voice stack that feeds them. The two render engines you choose between when packaging an avatar — and the focus of the rest of this page — are Essence and Expression.
Rendering engines — two product families, each with tiers:
- Essence — the avatar family (a packaged
.imxidentity with real-time lip-sync):- Essence 1 — the default. Pre-built identity, runs on virtually any CPU.
- Essence 2 Quality — the high-fidelity premium renderer (cloud GPU).
- Essence 2 Mobile — the efficient distilled renderer (Apple-Silicon Neural Engine primary, with elastic GPU overflow).
- Expression — the expressive family (animation driven from a portrait at runtime):
- Expression 1 — dynamic facial animation from any portrait image (Apple Silicon or NVIDIA GPU).
- Expression 2 — the real-time generative engine: fully-generated motion rather than patching a pre-rendered base.
Each family shares one .imx format, SDK methods, and the push audio → drain frames
shape; the tier is selected per session and is transparent to your integration. (A
separate self-hosted Flash GPU tier is metered per the pricing table.)
Conversation + voice stack — drives a managed agent and feeds the renderers:
- Converse — the STT → LLM → TTS turn loop that drives a managed agent’s dialogue. It produces the audio that the renderers lip-sync.
- Voice — the speech engine (the voice/TTS stack behind audio-only chat and the voices you select for an agent).
The rest of this page focuses on Essence vs Expression — the two you choose between when packaging an avatar.
At a glance
bitHuman’s two avatar models share the same .imx file format, the same SDK methods, and the same push audio → drain frames shape. Essence is the default — it runs on virtually every CPU and is what bithuman pull ships in the showcase. Expression is the heavier high-fidelity option for specific on-device Apple Silicon or GPU server use cases.
| Essence (default) | Expression | |
|---|---|---|
| What it does | Pre-built avatar identity packaged in an .imx file. Real-time lip-sync. | Dynamic facial animation from any portrait image at runtime. |
| Avatar source | .imx you build once from a photo or video. | Any face image — provide at runtime, no build step. |
| Custom gestures | Yes (wave, nod, laugh, etc.) | No |
| Idle animation | Pre-recorded natural movement | AI-generated micro-movements |
| Compute needed | Any modern CPU | Apple Silicon M3+ (demo apps) or NVIDIA GPU |
| Memory footprint | Low (~200–500 MB) | Higher (~2–6 GB) |
| Best for | Kiosks, mobile, edge, 24/7 deployments, high concurrency | Close-up native consumer apps, custom faces per session |
| Pricing | 1 credit/min self-hosted · 2 credits/min cloud | 2 credits/min self-hosted · 4 credits/min cloud |
Both ship to every surface — SDKs, REST API, LiveKit plugin, CLI, on-device, embed widget. The same .imx file works everywhere.
Where each model runs
| Surface | Essence | Expression |
|---|---|---|
| iOS / iPadOS | iPhone 16 Pro+, iPad Pro M4+ | iPad Pro M4+ (iPhone 16 Pro+ preview) |
| macOS arm64 | Any Apple Silicon | M3+ |
| macOS Intel | Pending (2.3 ships arm64 only) | — |
| Android | arm64-v8a, Android 10+ | — |
| Linux x86_64 / aarch64 | Any modern CPU | via NVIDIA GPU (Docker) |
| Windows | Pending (use WSL2 today) | — |
| Raspberry Pi 4B+ | Supported | — |
| bitHuman Cloud | Managed | Managed |
| Self-hosted CPU | Python SDK / LiveKit plugin | — |
| Self-hosted GPU | — | Docker container |
Native macOS-Intel and Windows wheels are pending for the 2.3 line; the architecture page tracks per-platform shipping status. On iPhone, Essence delivers a fast, real-time on-device avatar; Expression’s heavier renderer targets iPad Pro and Mac (iPhone is in preview).
Essence
Essence packages a complete avatar identity (face, body, gestures) into an .imx file. At runtime, the SDK plays back pre-rendered base motion and patches the mouth region in real time to match incoming audio.
Runtime characteristics
- ~200–500 MB resident, 1–2 CPU cores, real-time at 25 FPS.
- Runs on macOS arm64, Linux x86_64 / aarch64, iOS, iPadOS, Android, Raspberry Pi 4B+, and in the browser via WASM.
- No idle timeout — sessions can run 24/7. Reliable for unattended kiosks and lobby displays.
- Supports custom gestures (wave, nod, laugh) triggered by keywords or API.
- Predictable, consistent behavior. Lower per-stream cost — the right pick for high-concurrency self-hosted deployments.
Try it from the showcase
The CLI ships a curated set of ready-to-run Essence .imx avatars:
bithuman list # browse the showcase
bithuman pull modern-court-jester # downloads to ~/.cache/bithuman/showcase/<slug>.imx
bithuman run modern-court-jester.imx # live browser-served avatar
How to ship it
- Python SDK — self-host on macOS arm64 + Linux x86_64 / aarch64.
- Swift SDK — native Mac, iPad, iPhone apps.
- Kotlin SDK — native Android apps (Beta).
- bitHuman CLI — no code, terminal or browser.
- REST API — backend integration in any language.
- Cloud LiveKit plugin — managed, no infrastructure.
- Embed widget — drop-in iframe for websites.
Expression
Expression generates real-time facial animation directly from a portrait image. The face can change between sessions or even mid-session — no avatar build step is required.
Runtime characteristics
- ~2–6 GB resident; needs Apple Silicon M3+ (Mac) / M4+ (iPad Pro) or an NVIDIA GPU (8 GB+ VRAM).
- Works with any face image — drag-and-drop swap, photo, video frame, anything.
- AI-driven expressions adapt to speech content and emotional context.
- Higher visual fidelity for close-up conversational interactions.
- On-device demo apps target macOS M3+ and iPad Pro M4+ today; iPhone Expression and macOS-Intel are on the way.
- On Apple Silicon the Swift SDK auto-spawns a
bithuman-expression-daemonsubprocess to drive the model.
How to ship it
- Cloud LiveKit plugin — bitHuman hosts the GPU worker (set
model="expression"). - Self-hosted GPU — your own NVIDIA GPU via the Docker container.
- On-device macOS / iPadOS — Apple Silicon M3+, via the Swift SDK.
- bitHuman CLI —
bithuman runwith an Expression.imx. - REST API — same endpoint as Essence; the model is selected per agent.
Which should I use?
24/7 kiosk or always-on display
Essence. No idle timeout, runs on CPU, predictable for unattended deployments.
iPhone app
Essence. On iPhone, choose Essence; iPad and Mac are the on-device homes for Expression.
Android app
Essence via the Kotlin SDK (Beta).
Native Mac or iPad app with close-up dynamic faces
Expression on-device via the Swift SDK or the Mac/iPad reference apps.
Need custom gestures (wave, nod, laugh)
Essence. Essence supports custom gestures — wave, nod, laugh — triggered by keyword or API.
Quickest setup with any face photo
Expression via the cloud plugin. Pass the image at session start — no build step.
Voice agent on LiveKit with maximum concurrency
Essence. Lower per-stream cost makes it the right pick for high-concurrency deployments.
Edge hardware (Raspberry Pi, low-power laptop)
Essence. Runs on 1–2 CPU cores at 25 FPS.
Highest visual quality for offline video generation
Expression with quality="high". Best for offline batch jobs rather than real-time streaming.
Where to go next
- Quickstart — get your first avatar running in ~2 minutes.
- Architecture — engine layering and the full per-platform device matrix.
- Pricing — credits, tiers, and what’s metered.
- Avatars and the
.imxformat — how avatars are packaged.