WASM Models

BYOK API supports running AI models entirely in the browser using WebLLM. No API keys, no network requests, complete privacy.

How it works

The @byokapi/wasm package provides a WasmLanguageModel that implements the AI SDK v6 LanguageModelV3 interface. Under the hood, it uses WebLLM to run quantized LLMs in the browser via WebAssembly and WebGPU.

Supported models

BYOK API bundles several pre-configured models:

Model	Size	Description
SmolLM2-360M	~250 MB	Tiny model, fast inference
SmolLM2-1.7B	~1.1 GB	Small but capable
Llama-3.2-1B	~700 MB	Meta's compact model
Llama-3.2-3B	~2 GB	Good balance of size and quality
Phi-3.5-mini	~2.4 GB	Microsoft's efficient model
Qwen2.5-1.5B	~1 GB	Alibaba's multilingual model

Usage

In the bridge

The bridge can serve WASM models alongside API-backed models. Users select a WASM model in the bridge dashboard, and the bridge handles loading and inference.

Standalone

You can also use WasmLanguageModel directly without the bridge:

import { WasmLanguageModel } from "@byokapi/wasm"
import { generateText } from "ai"

const model = new WasmLanguageModel("SmolLM2-360M-Instruct-q4f16_1-MLC")

const { text } = await generateText({
  model,
  prompt: "What is the meaning of life?",
})

Requirements

WebGPU support — required for model inference. Chrome 113+ and Edge 113+ support it natively.
Sufficient VRAM — models need GPU memory. Smaller models (360M, 1B) work on most devices; larger models need dedicated GPUs.
Storage — models are downloaded and cached in the browser. First load takes time depending on model size.

Trade-offs

Advantages:

Complete privacy — no data leaves the device
No API costs
Works offline after initial download
No rate limits

Limitations:

Slower than cloud APIs (especially on CPU)
Limited model selection (quantized models only)
Requires WebGPU-capable browser
Large initial download
Quality is lower than full-size cloud models

On this page