BYOK API

WASM Models

BYOK API supports running AI models entirely in the browser using WebLLM. No API keys, no network requests, complete privacy.

How it works

The @byokapi/wasm package provides a WasmLanguageModel that implements the AI SDK v6 LanguageModelV3 interface. Under the hood, it uses WebLLM to run quantized LLMs in the browser via WebAssembly and WebGPU.

Supported models

BYOK API bundles several pre-configured models:

ModelSizeDescription
SmolLM2-360M~250 MBTiny model, fast inference
SmolLM2-1.7B~1.1 GBSmall but capable
Llama-3.2-1B~700 MBMeta's compact model
Llama-3.2-3B~2 GBGood balance of size and quality
Phi-3.5-mini~2.4 GBMicrosoft's efficient model
Qwen2.5-1.5B~1 GBAlibaba's multilingual model

Usage

In the bridge

The bridge can serve WASM models alongside API-backed models. Users select a WASM model in the bridge dashboard, and the bridge handles loading and inference.

Standalone

You can also use WasmLanguageModel directly without the bridge:

import { WasmLanguageModel } from "@byokapi/wasm"
import { generateText } from "ai"

const model = new WasmLanguageModel("SmolLM2-360M-Instruct-q4f16_1-MLC")

const { text } = await generateText({
  model,
  prompt: "What is the meaning of life?",
})

Requirements

  • WebGPU support — required for model inference. Chrome 113+ and Edge 113+ support it natively.
  • Sufficient VRAM — models need GPU memory. Smaller models (360M, 1B) work on most devices; larger models need dedicated GPUs.
  • Storage — models are downloaded and cached in the browser. First load takes time depending on model size.

Trade-offs

Advantages:

  • Complete privacy — no data leaves the device
  • No API costs
  • Works offline after initial download
  • No rate limits

Limitations:

  • Slower than cloud APIs (especially on CPU)
  • Limited model selection (quantized models only)
  • Requires WebGPU-capable browser
  • Large initial download
  • Quality is lower than full-size cloud models

On this page