Home

┌─────────────────┐     Energy-efficient inference engine for running AI on mobile devices 
│  Cactus Engine  │ ←── OpenAI compatible APIs for C/C++, Swift, Kotlin, Flutter & React-Native
└─────────────────┘     Supports tool call, auto RAG, NPU, INT4, and cloud handoff for complex tasks
         │
┌─────────────────┐     Zero-copy computation graph, think PyTorch for mobile devices
│  Cactus Graph   │ ←── You can implement custom models directly using this
└─────────────────┘     Highly optimised for RAM & lossless weight quantisation 
         │
┌─────────────────┐     Low-level ARM-specific SIMD kernels (Apple, Snapdragon, Google, Exynos, MediaTek & Raspberry Pi)
│ Cactus Kernels  │ ←── Optimised Matrix Multiplication & n
└─────────────────┘     Custom attention kernels with KV-Cache Quantisation, chunked prefill, streaming LLM, etc.

Cactus Engine¶

#include cactus.h

cactus_model_t model = cactus_init(
    "path/to/weight/folder",
    "path to txt or dir of txts for auto-rag",
);

const char* messages = R"([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "My name is Henry Ndubuaku"}
])";

const char* options = R"({
    "max_tokens": 50,
    "stop_sequences": ["<|im_end|>"]
})";

char response[4096];
int result = cactus_complete(
    model,                            // model handle from cactus_init
    messages,                         // JSON array of chat messages
    response,                         // buffer to store response JSON
    sizeof(response),                 // size of response buffer
    options,                          // optional: generation options (nullptr for defaults)
    nullptr,                          // optional: tools JSON for function calling 
    nullptr,                          // optional: streaming callback fn(token, id, user_data)
    nullptr                           // optional: user data passed to callback
);

Example response from Gemma3-270m

{
    "success": true,                 // when successfully generated
    "error": null,                   // returns specific errors if success = false
    "cloud_handoff": false,          // true when response is generated with cloud model
    "response": "Hi there!",         // null when error is not null
    "function_calls": [],            // parsed to [{"name":"set_alarm","arguments":{"hour":"10","minute":"0"}}]
    "confidence": 0.8193,            // how confident the model is with its locally generated response
    "time_to_first_token_ms": 45.23, // latency (time to first token)
    "total_time_ms": 163.67,         // total execution time
    "prefill_tps": 1621.89,          // prefill tokens per second
    "decode_tps": 168.42,            // decode tokens per second
    "ram_usage_mb": 245.67,          // current process RAM usage in MB
    "prefill_tokens": 28,
    "decode_tokens": 50,
    "total_tokens": 78
}

Cactus Graph¶

#include cactus.h

CactusGraph graph;
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);

auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);

float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};

graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);

graph.execute();
void* output_data = graph.get_output(result);

graph.hard_reset();

Benchmark (missing latency = no NPU support yet)¶

High-End Devices | Device | LFM2.5-1.2B-INT4
(1k-Prefill/100-Decode) | LFM2.5-VL-1.6B-INT4
(256px-Latency & Decode) | Parakeet-1.1B-INT4
(30s-audio-Latency & Decode) |--------|--------|--------|----------| | Mac M4 Pro (Highend) | 582tps/100tps (76MB RAM) | 0.2s/98tps (87MB RAM) | 0.1s/900k+tps (1GB RAM) | | iPad/Mac M3 (Budget) | 350tps/60tps (70MB RAM) | 0.3s/69tps (80MB RAM) | 0.3s/800k+tps (102MB RAM) | | iPhone 17 Pro (Highend) | 327tps/48tps (108MB RAM)| 0.3s/48tps (156MB RAM) | 0.3s/300k+tps (177MB RAM)| | iPhone 13 Mini (Budget) | 148tps/34tps (1GB RAM) | 0.3s/35tps (1.2GBMB RAM) | 0.7s/90k+tps (1GB RAM) | | Galaxy S25 Ultra (Qualcomm 8 Elite) | 255tps/37tps (1.5GB RAM) | -/34tps (2GB RAM) | -/250k+tps (1.8GBG RAM) | | Pixel 6a (Budget Google Tensor) | 70tps/15tps (1GB RAM)| -/15tps (1.5GB RAM) | - /17k+tps (1GB RAM)| | Galaxy A17 5G (Budget Exxynox) | 32tps/10tps (727MB RAM) | -/11tps (727MB RAM) | -/40k+tps (809MB RAM) | | CMF Phone 2 Pro (Budget Mediatek) | - | - | - | | Raspberry Pi 5 (IoT) | 69tps/11tps (869MB RAM) | 13.3s/11tps (2.1GB RAM) | 4.5s/180k+tps (1.9GB RAM) |

Supported Models¶

Model	Features
google/gemma-3-270m-it	completion
google/functiongemma-270m-it	completion, tools
LiquidAI/LFM2-350M	completion, tools, embed
Qwen/Qwen3-0.6B	completion, tools, embed
LiquidAI/LFM2-700M	completion, tools, embed
LiquidAI/LFM2-8B-A1B	completion, tools, embed
google/gemma-3-1b-it	completion
LiquidAI/LFM2.5-1.2B-Thinking	completion, tools, embed
LiquidAI/LFM2.5-1.2B-Instruct	completion, tools, embed
Qwen/Qwen3-1.7B	completion, tools, embed
LiquidAI/LFM2-2.6B	completion, tools, embed
LiquidAI/LFM2-VL-450M	vision, txt & img embed, Apple NPU
LiquidAI/LFM2.5-VL-1.6B	vision, txt & img embed, Apple NPU
UsefulSensors/moonshine-base	transcription, speech embed
openai/whisper-small	transcription, speech embed, Apple NPU
openai/whisper-medium	transcribe, speech embed, Apple NPU
nvidia/parakeet-ctc-0.6b	transcribe, speech embed, Apple NPU
nvidia/parakeet-ctc-1.1b	transcribe, speech embed, Apple NPU
snakers4/silero-vad	vad
nomic-ai/nomic-embed-text-v2-moe	embed
Qwen/Qwen3-Embedding-0.6B	embed

Using this repo on Mac¶

git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setup

Using this repo on Linux (Ubuntu/Debian)¶

sudo apt-get install python3 python3-venv python3-pip cmake build-essential libcurl4-openssl-dev
git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setup

Command	Description
`cactus auth`	Setup Cactus cloud fallback (optional) (`--status`, `--clear`)
`cactus run [model]`	Opens playground (auto downloads model)
`cactus download [model]`	Downloads model to `./weights`
`cactus convert [model] [dir]`	Converts model, supports LoRA merging (`--lora <path>`)
`cactus build`	Builds for ARM (`--apple` or `--android`)
`cactus test`	Runs tests (`--ios` / `--android`, `--model [name/path]`, `--transcribe_model [name/path]`, `--only [test_name]`, `--precision`)
`cactus transcribe [model]`	Transcribe audio file (`--file`) or live microphone
`cactus clean`	Removes build artifacts
`cactus --help`	Shows all commands and flags (always run this)

Reproduce reported benchmarks with cactus test --benchmark
Plug in any mobule device and add the --ios or --android flag.
Mobile devices must be in developer mode.

Using in your apps¶

Try demo apps¶

Maintaining Organisations¶

Developed by Cactus Compute, Inc. (YC S25), with maintenance from:

Roadmap¶

Jul 2026: Got funding from YC & Oxford, launched, started building
Sep 2025: Launched Cactus Kernel, Graph & Engine, raised more funding
Oct 2025: Chunked prefill, streamingLLMs, KVCache Quantisation (2x faster prefill)
Nov 2025: Novel Cactus Attention algorithm (10 & 1k prefill yields same decode speed)
Dec 2025: Cactus team expands from original authour to +6 Research Engineers ()
Jan 2026: Apple NPU/RAM optimisations, grew maintainers (reduce iOS/Mac latency 5-11x)
Feb 2026: Hybrid inference with GCP, INT4, lossless Quantisation (1.5x speed)

Mar 2026: Qualcomm NPU, Google NPU, optimise Android (5-11x less Qualcomm/Pixel latency)
Apr 2026: Mediatek NPU, Exynox NPU, Cactus@ICLR (improve all Android latency 5-11x)
May 2026: Kernel=C++, Graph=Rust, Engine=Rust, GPU support for Macs & VR Headsets
Jun 2026: Transpilers for porting custom models from Torch/Jax
Jul 2026: Aggressive optimisations to run directly on wearables, Cactus@ICML
Aug 2026: Orchestration, orchestration, orchestration.
Sep 2026: 1yr post-release, release the full Cactus paper.

Contributing to Cactus¶

C++ Standard: Use C++20 features where appropriate.
Formatting: Follow the existing code style in the project, one header per folder.
Comments: Avoid comments, make your code read like plain english.
AI-Generated Code: Do not bindly PR AI slop, this codebase is very complex, they miss details.
Update docs: Please update docs when necessary, be intuitive and straightforward.
Keep It Simple: Do not go beyond the scope of the GH issue, avoid bloated PRs, keep codes lean.
Benchmark Your Changes: Test performance impact, Cactus is performance-critical.
Test everything: A PR that fails to build is the biggest red flag, means it was not tested.

Citation¶

If you use Cactus in your research, please cite it as follows:

@software{cactus,
  title        = {Cactus: AI Inference Engine for Phones & Wearables},
  author       = {Ndubuaku, Henry and Cactus Team},
  url          = {https://github.com/cactus-compute/cactus},
  year         = {2025}
}

Join The Community¶

Reddit Channel