Home


A hybrid low-latency energy-efficient AI engine for mobile devices & wearables.
┌─────────────────┐
│ Cactus Engine │ ←── OpenAI-compatible APIs for all major languages
└─────────────────┘ Chat, vision, STT, RAG, tool call, cloud handoff
│
┌─────────────────┐
│ Cactus Graph │ ←── Zero-copy computation graph (PyTorch for mobile)
└─────────────────┘ Custom models, optimised for RAM & quantisation
│
┌─────────────────┐
│ Cactus Kernels │ ←── ARM SIMD kernels (Apple, Snapdragon, Exynos, etc)
└─────────────────┘ Custom attention, KV-cache quant, chunked prefill
Quick Demo
- Step 1:
brew install cactus-compute/cactus/cactus
- Step 2:
cactus transcribe or cactus run
Cactus Engine
#include cactus.h
cactus_model_t model = cactus_init(
"path/to/weight/folder",
"path to txt or dir of txts for auto-rag",
);
const char* messages = R"([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is Henry Ndubuaku"}
])";
const char* options = R"({
"max_tokens": 50,
"stop_sequences": ["<|im_end|>"]
})";
char response[4096];
int result = cactus_complete(
model, // model handle
messages, // JSON chat messages
response, // response buffer
sizeof(response), // buffer size
options, // generation options
nullptr, // tools JSON
nullptr, // streaming callback
nullptr // user data
);
Example response from Gemma3-270m
{
"success": true, // generation succeeded
"error": null, // error details if failed
"cloud_handoff": false, // true if cloud model used
"response": "Hi there!",
"function_calls": [], // parsed tool calls
"confidence": 0.8193, // model confidence
"time_to_first_token_ms": 45.23,
"total_time_ms": 163.67,
"prefill_tps": 1621.89,
"decode_tps": 168.42,
"ram_usage_mb": 245.67,
"prefill_tokens": 28,
"decode_tokens": 50,
"total_tokens": 78
}
Cactus Graph
#include cactus.h
CactusGraph graph;
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);
auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);
float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);
graph.execute();
void* output_data = graph.get_output(result);
graph.hard_reset();
API & SDK References
| Reference |
Language |
Description |
| Engine API |
C |
Chat completion, streaming, tool calling, transcription, embeddings, RAG, vision, VAD, vector index, cloud handoff |
| Graph API |
C++ |
Tensor operations, matrix multiplication, attention, normalization, activation functions |
| Python SDK |
Python |
Mac, Linux |
| Swift SDK |
Swift |
iOS, macOS, tvOS, watchOS, Android |
| Kotlin SDK |
Kotlin |
Android, iOS (via KMP) |
| Flutter SDK |
Dart |
iOS, macOS, Android |
| Rust SDK |
Rust |
Mac, Linux |
| React Native |
JavaScript |
iOS, Android |
Benchmarks (CPU-only, no GPU)
- All weights INT4 quantised
- LFM: 1k-prefill / 100-decode, values are prefill tps / decode tps
- LFM-VL: 256px input, values are latency / decode tps
- Parakeet: 20s audio input, values are latency / decode tps
- Missing latency = no NPU support yet
| Device |
LFM 1.2B |
LFMVL 1.6B |
Parakeet 1.1B |
RAM |
| Mac M4 Pro |
582/100 |
0.2s/98 |
0.1s/900k+ |
76MB |
| iPad/Mac M3 |
350/60 |
0.3s/69 |
0.3s/800k+ |
70MB |
| iPhone 17 Pro |
327/48 |
0.3s/48 |
0.3s/300k+ |
108MB |
| iPhone 13 Mini |
148/34 |
0.3s/35 |
0.7s/90k+ |
1GB |
| Galaxy S25 Ultra |
255/37 |
-/34 |
-/250k+ |
1.5GB |
| Pixel 6a |
70/15 |
-/15 |
-/17k+ |
1GB |
| Galaxy A17 5G |
32/10 |
-/11 |
-/40k+ |
727MB |
| CMF Phone 2 Pro |
- |
- |
- |
- |
| Raspberry Pi 5 |
69/11 |
13.3s/11 |
4.5s/180k+ |
869MB |
Supported Transcription Model
- STT: 20s audio input on Macbook Air M3 chip
- Benchmark dataset: internal evals with production users
| Model |
Params |
End2End ms |
Latency ms |
Decode toks/sec |
NPU |
RTF |
WER |
| UsefulSensors/moonshine-base |
61M |
361.35 |
182 |
262 |
yes |
0.0180 |
0.1395 |
| openai/whisper-tiny |
39M |
232.03 |
137.38 |
581 |
yes |
0.0116 |
0.1860 |
| openai/whisper-base |
74M |
329.37 |
178.65 |
358 |
yes |
0.0164 |
0.1628 |
| openai/whisper-small |
244M |
856.79 |
332.63 |
108 |
yes |
0.0428 |
0.0930 |
| openai/whisper-medium |
769M |
2085.87 |
923.33 |
49 |
yes |
0.1041 |
0.0930 |
| nvidia/parakeet-ctc-0.6b |
600M |
201.77 |
201.44 |
5214285 |
yes |
0.0101 |
0.0930 |
| nvidia/parakeet-tdt-0.6b-v3 |
600M |
718.91 |
718.82 |
3583333 |
no |
0.0359 |
0.0465 |
| nvidia/parakeet-ctc-1.1b |
1.1B |
279.03 |
278.92 |
4562500 |
yes |
0.0139 |
0.1628 |
| snakers4/silero-vad |
- |
- |
- |
- |
- |
- |
- |
Supported LLMs
- Gemma weights are often gated on HuggingFace, needs tokens
- Run
hf auth login and input your huggingface token
| Model |
Features |
| google/gemma-3-270m-it |
completion |
| google/functiongemma-270m-it |
tools |
| google/gemma-3-1b-it |
completion, gated |
| google/gemma-3n-E2B-it |
completion, tools |
| google/gemma-3n-E4B-it |
completion, tools |
| Qwen/Qwen3-0.6B |
completion, tools, embed |
| Qwen/Qwen3-Embedding-0.6B |
embed |
| Qwen/Qwen3.5-0.8B |
vision, completion, tools, embed |
| Qwen/Qwen3-1.7B |
completion, tools, embed |
| Qwen/Qwen3.5-2B |
vision, completion, tools, embed |
| LiquidAI/LFM2-350M |
completion, tools, embed |
| LiquidAI/LFM2-700M |
completion, tools, embed |
| LiquidAI/LFM2-8B-A1B |
completion, tools, embed |
| LiquidAI/LFM2.5-1.2B-Thinking |
completion, tools, embed |
| LiquidAI/LFM2.5-1.2B-Instruct |
completion, tools, embed |
| LiquidAI/LFM2-2.6B |
completion, tools, embed |
| LiquidAI/LFM2-VL-450M |
vision, txt & img embed, Apple NPU |
| LiquidAI/LFM2.5-VL-1.6B |
vision, txt & img embed, Apple NPU |
| nomic-ai/nomic-embed-text-v2-moe |
embed |
Roadmap
| Date |
Status |
Milestone |
| Sep 2025 |
Done |
Released v1 |
| Oct 2025 |
Done |
Chunked prefill, KVCache Quant (2x prefill) |
| Nov 2025 |
Done |
Cactus Attention (10 & 1k prefill = same decode) |
| Dec 2025 |
Done |
Team grows to +6 Research Engineers |
| Jan 2026 |
Done |
Apple NPU/RAM, 5-11x faster iOS/Mac |
| Feb 2026 |
Done |
Hybrid inference, INT4, lossless Quant (1.5x) |
| Mar 2026 |
Coming |
Qualcomm/Google NPUs, 5-11x faster Android |
| Apr 2026 |
Coming |
Mediatek/Exynos NPUs, Cactus@ICLR |
| May 2026 |
Coming |
Kernel→C++, Graph/Engine→Rust, Mac GPU & VR |
| Jun 2026 |
Coming |
Torch/JAX model transpilers |
| Jul 2026 |
Coming |
Wearables optimisations, Cactus@ICML |
| Aug 2026 |
Coming |
Orchestration |
| Sep 2026 |
Coming |
Full Cactus paper, chip manufacturer partners |
Using this repo
┌──────────────────────────────────────────────────────────────────────────────┐
│ │
│ Step 0: if on Linux (Ubuntu/Debian) │
│ sudo apt-get install python3 python3-venv python3-pip cmake │
│ build-essential libcurl4-openssl-dev │
│ │
│ Step 1: clone and setup │
│ git clone https://github.com/cactus-compute/cactus && cd cactus │
│ source ./setup │
│ │
│ Step 2: use the commands │
│──────────────────────────────────────────────────────────────────────────────│
│ │
│ cactus auth manage Cloud API key │
│ --status show key status │
│ --clear remove saved key │
│ │
│ cactus run <model> opens playground (auto downloads) │
│ --precision INT4|INT8|FP16 quantization (default: INT4) │
│ --token <token> HF token (gated models) │
│ --reconvert force reconversion from source │
│ │
│ cactus transcribe [model] live mic transcription (parakeet-1.1b) │
│ --file <audio.wav> transcribe file instead of mic │
│ --precision INT4|INT8|FP16 quantization (default: INT4) │
│ --token <token> HF token (gated models) │
│ --reconvert force reconversion from source │
│ │
│ cactus download <model> downloads model to ./weights │
│ --precision INT4|INT8|FP16 quantization (default: INT4) │
│ --token <token> HuggingFace API token │
│ --reconvert force reconversion from source │
│ │
│ cactus convert <model> [dir] convert model, supports LoRA merge │
│ --precision INT4|INT8|FP16 quantization (default: INT4) │
│ --lora <path> LoRA adapter to merge │
│ --token <token> HuggingFace API token │
│ │
│ cactus build build for ARM → build/libcactus.a │
│ --apple Apple (iOS/macOS) │
│ --android Android │
│ --flutter Flutter (all platforms) │
│ --python shared lib for Python FFI │
│ │
│ cactus test run unit tests and benchmarks │
│ --model <model> default: LFM2-VL-450M │
│ --transcribe_model <model> default: moonshine-base │
│ --benchmark use larger models │
│ --precision INT4|INT8|FP16 regenerate weights with precision │
│ --reconvert force reconversion from source │
│ --no-rebuild skip building library │
│ --llm / --stt / --performance run specific test suite │
│ --ios run on connected iPhone │
│ --android run on connected Android │
│ │
│ cactus clean remove all build artifacts │
│ cactus --help show all commands and flags │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
Maintaining Organisations
- Cactus Compute, Inc. (YC S25)
- UCLA's BruinAI
- Char (YC S25)
- Yale's AI Society
- National Unoversity of Singapore's AI Society
- UC Irvine's AI@UCI
- Imperial College's AI Society
- University of Pennsylvania's AI@Penn
- University of Michigan Ann-Arbor MSAIL
- University of Colorado Boulder's AI Club
Citation
If you use Cactus in your research, please cite it as follows:
@software{cactus,
title = {Cactus: AI Inference Engine for Phones & Wearables},
author = {Ndubuaku, Henry and Cactus Team},
url = {https://github.com/cactus-compute/cactus},
year = {2025}
}
N/B: Scroll all the way up and click the shields link for resources!