Home

┌─────────────────┐ Energy-efficient inference engine for running AI on mobile devices
│ Cactus Engine │ ←── OpenAI compatible APIs for C/C++, Swift, Kotlin, Flutter & React-Native
└─────────────────┘ Supports tool call, auto RAG, NPU, INT4, and cloud handoff for complex tasks
│
┌─────────────────┐ Zero-copy computation graph, think PyTorch for mobile devices
│ Cactus Graph │ ←── You can implement custom models directly using this
└─────────────────┘ Highly optimised for RAM & lossless weight quantisation
│
┌─────────────────┐ Low-level ARM-specific SIMD kernels (Apple, Snapdragon, Google, Exynos, MediaTek & Raspberry Pi)
│ Cactus Kernels │ ←── Optimised Matrix Multiplication & n
└─────────────────┘ Custom attention kernels with KV-Cache Quantisation, chunked prefill, streaming LLM, etc.
Cactus Engine¶
#include cactus.h
cactus_model_t model = cactus_init(
"path/to/weight/folder",
"path to txt or dir of txts for auto-rag",
);
const char* messages = R"([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is Henry Ndubuaku"}
])";
const char* options = R"({
"max_tokens": 50,
"stop_sequences": ["<|im_end|>"]
})";
char response[4096];
int result = cactus_complete(
model, // model handle from cactus_init
messages, // JSON array of chat messages
response, // buffer to store response JSON
sizeof(response), // size of response buffer
options, // optional: generation options (nullptr for defaults)
nullptr, // optional: tools JSON for function calling
nullptr, // optional: streaming callback fn(token, id, user_data)
nullptr // optional: user data passed to callback
);
{
"success": true, // when successfully generated
"error": null, // returns specific errors if success = false
"cloud_handoff": false, // true when response is generated with cloud model
"response": "Hi there!", // null when error is not null
"function_calls": [], // parsed to [{"name":"set_alarm","arguments":{"hour":"10","minute":"0"}}]
"confidence": 0.8193, // how confident the model is with its locally generated response
"time_to_first_token_ms": 45.23, // latency (time to first token)
"total_time_ms": 163.67, // total execution time
"prefill_tps": 1621.89, // prefill tokens per second
"decode_tps": 168.42, // decode tokens per second
"ram_usage_mb": 245.67, // current process RAM usage in MB
"prefill_tokens": 28,
"decode_tokens": 50,
"total_tokens": 78
}
Cactus Graph¶
#include cactus.h
CactusGraph graph;
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);
auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);
float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);
graph.execute();
void* output_data = graph.get_output(result);
graph.hard_reset();
Benchmark (missing latency = no NPU support yet)¶
High-End Devices
| Device | LFM2.5-1.2B-INT4
(1k-Prefill/100-Decode) | LFM2.5-VL-1.6B-INT4
(256px-Latency & Decode) | Parakeet-1.1B-INT4
(30s-audio-Latency & Decode)
|--------|--------|--------|----------|
| Mac M4 Pro (Highend) | 582tps/100tps (76MB RAM) | 0.2s/98tps (87MB RAM) | 0.1s/900k+tps (1GB RAM) |
| iPad/Mac M3 (Budget) | 350tps/60tps (70MB RAM) | 0.3s/69tps (80MB RAM) | 0.3s/800k+tps (102MB RAM) |
| iPhone 17 Pro (Highend) | 327tps/48tps (108MB RAM)| 0.3s/48tps (156MB RAM) | 0.3s/300k+tps (177MB RAM)|
| iPhone 13 Mini (Budget) | 148tps/34tps (1GB RAM) | 0.3s/35tps (1.2GBMB RAM) | 0.7s/90k+tps (1GB RAM) |
| Galaxy S25 Ultra (Qualcomm 8 Elite) | 255tps/37tps (1.5GB RAM) | -/34tps (2GB RAM) | -/250k+tps (1.8GBG RAM) |
| Pixel 6a (Budget Google Tensor) | 70tps/15tps (1GB RAM)| -/15tps (1.5GB RAM) | - /17k+tps (1GB RAM)|
| Galaxy A17 5G (Budget Exxynox) | 32tps/10tps (727MB RAM) | -/11tps (727MB RAM) | -/40k+tps (809MB RAM) |
| CMF Phone 2 Pro (Budget Mediatek) | - | - | - |
| Raspberry Pi 5 (IoT) | 69tps/11tps (869MB RAM) | 13.3s/11tps (2.1GB RAM) | 4.5s/180k+tps (1.9GB RAM) |
Supported Models¶
| Model | Features |
|---|---|
| google/gemma-3-270m-it | completion |
| google/functiongemma-270m-it | completion, tools |
| LiquidAI/LFM2-350M | completion, tools, embed |
| Qwen/Qwen3-0.6B | completion, tools, embed |
| LiquidAI/LFM2-700M | completion, tools, embed |
| LiquidAI/LFM2-8B-A1B | completion, tools, embed |
| google/gemma-3-1b-it | completion |
| LiquidAI/LFM2.5-1.2B-Thinking | completion, tools, embed |
| LiquidAI/LFM2.5-1.2B-Instruct | completion, tools, embed |
| Qwen/Qwen3-1.7B | completion, tools, embed |
| LiquidAI/LFM2-2.6B | completion, tools, embed |
| LiquidAI/LFM2-VL-450M | vision, txt & img embed, Apple NPU |
| LiquidAI/LFM2.5-VL-1.6B | vision, txt & img embed, Apple NPU |
| UsefulSensors/moonshine-base | transcription, speech embed |
| openai/whisper-small | transcription, speech embed, Apple NPU |
| openai/whisper-medium | transcribe, speech embed, Apple NPU |
| nvidia/parakeet-ctc-0.6b | transcribe, speech embed, Apple NPU |
| nvidia/parakeet-ctc-1.1b | transcribe, speech embed, Apple NPU |
| snakers4/silero-vad | vad |
| nomic-ai/nomic-embed-text-v2-moe | embed |
| Qwen/Qwen3-Embedding-0.6B | embed |
Using this repo on Mac¶
Using this repo on Linux (Ubuntu/Debian)¶
sudo apt-get install python3 python3-venv python3-pip cmake build-essential libcurl4-openssl-dev
git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setup
| Command | Description |
|---|---|
cactus auth |
Setup Cactus cloud fallback (optional) (--status, --clear) |
cactus run [model] |
Opens playground (auto downloads model) |
cactus download [model] |
Downloads model to ./weights |
cactus convert [model] [dir] |
Converts model, supports LoRA merging (--lora <path>) |
cactus build |
Builds for ARM (--apple or --android) |
cactus test |
Runs tests (--ios / --android, --model [name/path], --transcribe_model [name/path], --only [test_name], --precision) |
cactus transcribe [model] |
Transcribe audio file (--file) or live microphone |
cactus clean |
Removes build artifacts |
cactus --help |
Shows all commands and flags (always run this) |
- Reproduce reported benchmarks with
cactus test --benchmark - Plug in any mobule device and add the
--iosor--androidflag. - Mobile devices must be in developer mode.
Using in your apps¶
- Python for Mac
- Rust SDK
- React Native SDK
- Swift Multiplatform SDK
- Kotlin Multiplatform SDK
- Flutter SDK
Try demo apps¶
Maintaining Organisations¶
Developed by Cactus Compute, Inc. (YC S25), with maintenance from:
- UCLA's BruinAI
- Char (YC S25)
- Yale's AI Society
- National Unoversity of Singapore's AI Society
- UC Irvine's AI@UCI
- Imperial College's AI Society
- University of Pennsylvania's AI@Penn
- University of Michigan Ann-Arbor MSAIL
- University of Colorado Boulder's AI Club
Roadmap¶
- Jul 2026: Got funding from YC & Oxford, launched, started building
- Sep 2025: Launched Cactus Kernel, Graph & Engine, raised more funding
- Oct 2025: Chunked prefill, streamingLLMs, KVCache Quantisation (2x faster prefill)
- Nov 2025: Novel Cactus Attention algorithm (10 & 1k prefill yields same decode speed)
- Dec 2025: Cactus team expands from original authour to +6 Research Engineers ()
- Jan 2026: Apple NPU/RAM optimisations, grew maintainers (reduce iOS/Mac latency 5-11x)
- Feb 2026: Hybrid inference with GCP, INT4, lossless Quantisation (1.5x speed)
- Mar 2026: Qualcomm NPU, Google NPU, optimise Android (5-11x less Qualcomm/Pixel latency)
- Apr 2026: Mediatek NPU, Exynox NPU, Cactus@ICLR (improve all Android latency 5-11x)
- May 2026: Kernel=C++, Graph=Rust, Engine=Rust, GPU support for Macs & VR Headsets
- Jun 2026: Transpilers for porting custom models from Torch/Jax
- Jul 2026: Aggressive optimisations to run directly on wearables, Cactus@ICML
- Aug 2026: Orchestration, orchestration, orchestration.
- Sep 2026: 1yr post-release, release the full Cactus paper.
Contributing to Cactus¶
- C++ Standard: Use C++20 features where appropriate.
- Formatting: Follow the existing code style in the project, one header per folder.
- Comments: Avoid comments, make your code read like plain english.
- AI-Generated Code: Do not bindly PR AI slop, this codebase is very complex, they miss details.
- Update docs: Please update docs when necessary, be intuitive and straightforward.
- Keep It Simple: Do not go beyond the scope of the GH issue, avoid bloated PRs, keep codes lean.
- Benchmark Your Changes: Test performance impact, Cactus is performance-critical.
- Test everything: A PR that fails to build is the biggest red flag, means it was not tested.
Citation¶
If you use Cactus in your research, please cite it as follows:
@software{cactus,
title = {Cactus: AI Inference Engine for Phones & Wearables},
author = {Ndubuaku, Henry and Cactus Team},
url = {https://github.com/cactus-compute/cactus},
year = {2025}
}