Qwen 3.5 Local
Running the Qwen 3.5 model family locally on Apple Silicon using MLX — Apple’s framework for Metal-accelerated ML inference with zero-copy unified memory.
Models I Run
Section titled “Models I Run”All Qwen 3.5 models are natively multimodal (text + vision). The table below shows only the models I actively use — chosen so that no model is both slower and worse than another.
| Model | Type | Active Params | RAM Usage | Speed | Use Case |
|---|---|---|---|---|---|
| 35B-A3B Q4 | MoE | 3B | ~22 GB | ~70-90 t/s | Daily driver — coding, debugging, tool-calling |
| 27B Q8 | Dense | 27B | ~30 GB | ~25-30 t/s | Hardest problems — deep code review, algorithms |
| 9B BF16 | Dense | 9B | ~18 GB | ~45-55 t/s | Full-precision reasoning — nuanced code review |
| 4B BF16 | Dense | 4B | ~8 GB | ~100+ t/s | Instant autocomplete — docstrings, quick chat |
Why These Specific Quantizations?
Section titled “Why These Specific Quantizations?”- 35B-A3B Q4 — MoE architecture means only 3B params are active per token, so it’s fast despite 35B total knowledge. Q4 fits comfortably with headroom.
- 27B Q8 — Largest dense model that fits. Q8 preserves ~99% quality over BF16.
- 9B & 4B BF16 — Small enough to run at full precision. No reason to quantize.
-
Create a dedicated venv
~/zen/local-llm/qwen mkdir -p ~/zen/local-llm/qwen && cd ~/zen/local-llm/qwenuv venv --python 3.14source .venv/bin/activateuv pip install mlx-lm huggingface-hub -
Download models
Terminal window hf download mlx-community/Qwen3.5-35B-A3B-4bithf download mlx-community/Qwen3.5-27B-8bithf download mlx-community/Qwen3.5-9B-bf16hf download mlx-community/Qwen3.5-4B-bf16 -
Generate text
Terminal window mlx_lm.generate \--model mlx-community/Qwen3.5-4B-bf16 \--prompt "Write a Python function to merge two sorted lists" \--max-tokens 2048 -
Serve as OpenAI-compatible API
Start server mlx_lm.server \--model mlx-community/Qwen3.5-35B-A3B-4bit \--port 8080 \--max-tokens 16384Test the API curl http://localhost:8080/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "mlx-community/Qwen3.5-35B-A3B-4bit","messages": [{"role": "user", "content": "Hello!"}],"max_tokens": 512}'
- Close Chrome/Electron apps before heavy inference — they compete for GPU Metal resources.
- Set
--max-tokens 16384— Qwen 3.5 uses internal chain-of-thought (“thinking mode”); too-low limits cut off reasoning. - Monitor GPU — Activity Monitor → GPU tab to verify Metal utilization is high.
- Nix users —
mlxlinks against macOS Metal frameworks at runtime. Use a standarduv venvoutside of Nix-managed Python. If Python is managed via Nix, shell intonix-shell -p python314first, then create the venv.