WASM Models
BYOK API supports running AI models entirely in the browser using WebLLM. No API keys, no network requests, complete privacy.
How it works
The @byokapi/wasm package provides a WasmLanguageModel that implements the AI SDK v6 LanguageModelV3 interface. Under the hood, it uses WebLLM to run quantized LLMs in the browser via WebAssembly and WebGPU.
Supported models
BYOK API bundles several pre-configured models:
| Model | Size | Description |
|---|---|---|
| SmolLM2-360M | ~250 MB | Tiny model, fast inference |
| SmolLM2-1.7B | ~1.1 GB | Small but capable |
| Llama-3.2-1B | ~700 MB | Meta's compact model |
| Llama-3.2-3B | ~2 GB | Good balance of size and quality |
| Phi-3.5-mini | ~2.4 GB | Microsoft's efficient model |
| Qwen2.5-1.5B | ~1 GB | Alibaba's multilingual model |
Usage
In the bridge
The bridge can serve WASM models alongside API-backed models. Users select a WASM model in the bridge dashboard, and the bridge handles loading and inference.
Standalone
You can also use WasmLanguageModel directly without the bridge:
import { WasmLanguageModel } from "@byokapi/wasm"
import { generateText } from "ai"
const model = new WasmLanguageModel("SmolLM2-360M-Instruct-q4f16_1-MLC")
const { text } = await generateText({
model,
prompt: "What is the meaning of life?",
})Requirements
- WebGPU support — required for model inference. Chrome 113+ and Edge 113+ support it natively.
- Sufficient VRAM — models need GPU memory. Smaller models (360M, 1B) work on most devices; larger models need dedicated GPUs.
- Storage — models are downloaded and cached in the browser. First load takes time depending on model size.
Trade-offs
Advantages:
- Complete privacy — no data leaves the device
- No API costs
- Works offline after initial download
- No rate limits
Limitations:
- Slower than cloud APIs (especially on CPU)
- Limited model selection (quantized models only)
- Requires WebGPU-capable browser
- Large initial download
- Quality is lower than full-size cloud models