Skip to content

Home

Logo

┌─────────────────┐     Energy-efficient inference engine for running AI on mobile devices 
│  Cactus Engine  │ ←── OpenAI compatible APIs for C/C++, Swift, Kotlin, Flutter & React-Native
└─────────────────┘     Supports tool call, auto RAG, NPU, INT4, and cloud handoff for complex tasks
┌─────────────────┐     Zero-copy computation graph, think PyTorch for mobile devices
│  Cactus Graph   │ ←── You can implement custom models directly using this
└─────────────────┘     Highly optimised for RAM & lossless weight quantisation 
┌─────────────────┐     Low-level ARM-specific SIMD kernels (Apple, Snapdragon, Google, Exynos, MediaTek & Raspberry Pi)
│ Cactus Kernels  │ ←── Optimised Matrix Multiplication & n
└─────────────────┘     Custom attention kernels with KV-Cache Quantisation, chunked prefill, streaming LLM, etc.

Cactus Engine

#include cactus.h

cactus_model_t model = cactus_init(
    "path/to/weight/folder",
    "path to txt or dir of txts for auto-rag",
);

const char* messages = R"([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "My name is Henry Ndubuaku"}
])";

const char* options = R"({
    "max_tokens": 50,
    "stop_sequences": ["<|im_end|>"]
})";

char response[4096];
int result = cactus_complete(
    model,                            // model handle from cactus_init
    messages,                         // JSON array of chat messages
    response,                         // buffer to store response JSON
    sizeof(response),                 // size of response buffer
    options,                          // optional: generation options (nullptr for defaults)
    nullptr,                          // optional: tools JSON for function calling 
    nullptr,                          // optional: streaming callback fn(token, id, user_data)
    nullptr                           // optional: user data passed to callback
);
Example response from Gemma3-270m
{
    "success": true,                 // when successfully generated locally
    "error": null,                   // returns specific errors if success = false
    "cloud_handoff": false,          // true when model is unconfident, simply route to cloud
    "response": "Hi there!",         // null when error is not null or cloud_handoff = true
    "function_calls": [],            // parsed to [{"name":"set_alarm","arguments":{"hour":"10","minute":"0"}}]
    "confidence": 0.8193,            // how confident the model is with its response
    "time_to_first_token_ms": 45.23, // latency (time to first token)
    "total_time_ms": 163.67,         // total execution time
    "prefill_tps": 1621.89,          // prefill tokens per second
    "decode_tps": 168.42,            // decode tokens per second
    "ram_usage_mb": 245.67,          // current process RAM usage in MB
    "prefill_tokens": 28,
    "decode_tokens": 50,
    "total_tokens": 78
}

Cactus Graph

#include cactus.h

CactusGraph graph;
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);

auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);

float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};

graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);

graph.execute();
void* output_data = graph.get_output(result);

graph.hard_reset(); 

Benchmark

High-End Devices | Device | LFM2.5-1.2B-INT4
(1k-Prefill/100-Decode) | LFM2.5-VL-1.6B-INT4
(256px-Latency & Decode) | Whisper-Small-244-INT8
(30s-audio-Latency & Decode) |--------|--------|--------|----------| | Mac M4 Pro | 582tps/100tps (76MB RAM) | 0.2s/98tps (87MB RAM) | 0.1s/119tps (73MB RAM) | | iPad/Mac M4 | 379tps/66tps (30MB RAM) | 0.2s/64tps (53MB RAM) | 0.2s/100tp (122MB RAM) | | iPad/Mac M2 | 315tps/60tps (181MB RAM) | 0.3s/58tps (426MB RAM) | 0.3s/86tps (160MB RAM) | | iPhone 17 Pro | 327tps/48tps (108MB RAM)| 0.3s/48tps (156MB RAM) | 0.3s/114tps (177MB RAM)| | Galaxy S25 Ultra | 255tps/37tps (1.2GB RAM) | 2.6s/34tps (2GB RAM) | 2.3s/90tps (363MB RAM) |

Budget Devices | Device | LFM2-350m-INT4
(1k-Prefill/100-Decode) | LFM2-VL-450m-INT4
(256px-Latency & Decode) | Moonshine-Base-67m-INT4
(30s-audio-Latency & Decode) |--------|--------|--------|----------| | iPhone 13 Mini (Apple A15) | 516tps/65tps (29MB RAM) | 0.3s/69tps (20MB RAM) | 0.7s/413tps (10MB RAM) | | Pixel 6a (Google Tensor G1) | 218tps/44tps (395MB RAM)| 2.5s/36tps (631MB RAM) | 1.5s/189tps (111MB RAM)| | Galaxy A17 5G (Exxynox 1330) | 87tps/24tps (395MB RAM) | 4.1s/20tps (619MB RAM) | 4.2s/90tps (103MB RAM) | | CMF Phone 2 Pro (Mediatek Dimensity 7300) | 146tps/21tps (394MB RAM) | 2.4s/22tps (632MB RAM) | 1.9s/119tps (112MB RAM) | | Raspberry Pi 5 | - | - | - |

## Supported Models

Model Features
google/gemma-3-270m-it completion
google/functiongemma-270m-it completion, tools
LiquidAI/LFM2-350M completion, tools, embed
Qwen/Qwen3-0.6B completion, tools, embed
LiquidAI/LFM2-700M completion, tools, embed
LiquidAI/LFM2-8B-A1B completion, tools, embed
google/gemma-3-1b-it completion
LiquidAI/LFM2.5-1.2B-Thinking completion, tools, embed
LiquidAI/LFM2.5-1.2B-Instruct completion, tools, embed
Qwen/Qwen3-1.7B completion, tools, embed
LiquidAI/LFM2-2.6B completion, tools, embed
LiquidAI/LFM2-VL-450M vision, txt & img embed, Apple NPU
LiquidAI/LFM2.5-VL-1.6B vision, txt & img embed, Apple NPU
UsefulSensors/moonshine-base transcription, speech embed
openai/whisper-small transcription, speech embed, Apple NPU
openai/whisper-medium transcribe, speech embed, Apple NPU
snakers4/silero-vad vad
nomic-ai/nomic-embed-text-v2-moe embed
Qwen/Qwen3-Embedding-0.6B embed

Using this repo on Mac

git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setup

Using this repo on Linux (Ubuntu/Debian)

sudo apt-get install python3 python3-venv python3-pip cmake build-essential libcurl4-openssl-dev
git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setup
Command Description
cactus auth Setup Cactus cloud fallback (optional) (--status, --clear)
cactus run [model] Opens playground (auto downloads model)
cactus download [model] Downloads model to ./weights
cactus convert [model] [dir] Converts model, supports LoRA merging (--lora <path>)
cactus build Builds for ARM (--apple or --android)
cactus test Runs tests (--ios / --android, --model [name/path], --transcribe_model [name/path], --only [test_name], --precision)
cactus transcribe [model] Transcribe audio file (--file) or live microphone
cactus clean Removes build artifacts
cactus --help Shows all commands and flags (always run this)

Using in your apps

Try demo apps

Maintaining Organisations

Developed by Cactus Compute, Inc. (YC S25), with maintenance from:

  1. UCLA's BruinAI
  2. Char (YC S25)
  3. Yale's AI Society
  4. National Unoversity of Singapore's AI Society
  5. UC Irvine's AI@UCI
  6. Imperial College's AI Society
  7. University of Pennsylvania's AI@Penn
  8. University of Michigan Ann-Arbor MSAIL
  9. University of Colorado Boulder's AI Club

Contributing to Cactus

  • C++ Standard: Use C++20 features where appropriate.
  • Formatting: Follow the existing code style in the project, one header per folder.
  • Comments: Avoid comments, make your code read like plain english.
  • AI-Generated Code: Do not bindly PR AI slop, this codebase is very complex, they miss details.
  • Update docs: Please update docs when necessary, be intuitive and straightforward.
  • Keep It Simple: Do not go beyond the scope of the GH issue, avoid bloated PRs, keep codes lean.
  • Benchmark Your Changes: Test performance impact, Cactus is performance-critical.
  • Test everything: A PR that fails to build is the biggest red flag, means it was not tested.

Citation

If you use Cactus in your research, please cite it as follows:

@software{cactus,
  title        = {Cactus: AI Inference Engine for Phones & Wearables},
  author       = {Ndubuaku, Henry and Cactus Team},
  url          = {https://github.com/cactus-compute/cactus},
  year         = {2025}
}

Join The Community