Skip to content

Qwen 3.5 Local

Running the Qwen 3.5 model family locally on Apple Silicon using MLX — Apple’s framework for Metal-accelerated ML inference with zero-copy unified memory.


All Qwen 3.5 models are natively multimodal (text + vision). The table below shows only the models I actively use — chosen so that no model is both slower and worse than another.

ModelTypeActive ParamsRAM UsageSpeedUse Case
35B-A3B Q4MoE3B~22 GB~70-90 t/sDaily driver — coding, debugging, tool-calling
27B Q8Dense27B~30 GB~25-30 t/sHardest problems — deep code review, algorithms
9B BF16Dense9B~18 GB~45-55 t/sFull-precision reasoning — nuanced code review
4B BF16Dense4B~8 GB~100+ t/sInstant autocomplete — docstrings, quick chat
  • 35B-A3B Q4 — MoE architecture means only 3B params are active per token, so it’s fast despite 35B total knowledge. Q4 fits comfortably with headroom.
  • 27B Q8 — Largest dense model that fits. Q8 preserves ~99% quality over BF16.
  • 9B & 4B BF16 — Small enough to run at full precision. No reason to quantize.

  1. Create a dedicated venv

    ~/zen/local-llm/qwen
    mkdir -p ~/zen/local-llm/qwen && cd ~/zen/local-llm/qwen
    uv venv --python 3.14
    source .venv/bin/activate
    uv pip install mlx-lm huggingface-hub
  2. Download models

    Terminal window
    hf download mlx-community/Qwen3.5-35B-A3B-4bit
    hf download mlx-community/Qwen3.5-27B-8bit
    hf download mlx-community/Qwen3.5-9B-bf16
    hf download mlx-community/Qwen3.5-4B-bf16
  3. Generate text

    Terminal window
    mlx_lm.generate \
    --model mlx-community/Qwen3.5-4B-bf16 \
    --prompt "Write a Python function to merge two sorted lists" \
    --max-tokens 2048
  4. Serve as OpenAI-compatible API

    Start server
    mlx_lm.server \
    --model mlx-community/Qwen3.5-35B-A3B-4bit \
    --port 8080 \
    --max-tokens 16384
    Test the API
    curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "mlx-community/Qwen3.5-35B-A3B-4bit",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 512
    }'

  • Close Chrome/Electron apps before heavy inference — they compete for GPU Metal resources.
  • Set --max-tokens 16384 — Qwen 3.5 uses internal chain-of-thought (“thinking mode”); too-low limits cut off reasoning.
  • Monitor GPU — Activity Monitor → GPU tab to verify Metal utilization is high.
  • Nix usersmlx links against macOS Metal frameworks at runtime. Use a standard uv venv outside of Nix-managed Python. If Python is managed via Nix, shell into nix-shell -p python314 first, then create the venv.