Cactus Python Package¶

Python bindings for Cactus Engine via FFI. Auto-installed when you run source ./setup.

Model bundles: Pre-built runtime bundles for all supported models at huggingface.co/Cactus-Compute.

Getting Started¶

git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setup
cactus build --python

# Download pre-built bundles (defaults to the generic CPU variant)
cactus download LiquidAI/LFM2-VL-450M
cactus download openai/whisper-small --platform apple   # CoreML/NPU variant

# Optional: set your Cactus Cloud API key for automatic cloud fallback
cactus auth

Quick Example¶

from cactus import ensure_model
from cactus import cactus_init, cactus_complete, cactus_destroy
import json

# Downloads the pre-built bundle from HuggingFace if not already present
bundle = ensure_model("LiquidAI/LFM2-VL-450M")

model = cactus_init(str(bundle), None, False)
messages = json.dumps([{"role": "user", "content": "What is 2+2?"}])
result = cactus_complete(model, messages, None, None, None)
print(result["response"])
cactus_destroy(model)

API Reference¶

All functions are module-level and mirror the C FFI directly. Handles are plain int values (C pointers).

Model Downloads¶

Download pre-built bundles programmatically (no CLI needed):

from cactus import ensure_model, get_bundle_dir

# ensure_model downloads the pre-built bundle if missing, returns its Path
bundle = ensure_model("openai/whisper-tiny")

# Or resolve the expected on-disk location explicitly
bundle_dir = get_bundle_dir("openai/whisper-tiny", bits=4, platform=None)
# -> Path("transpiled/whisper-tiny-cq4")  (or `-cq4-apple` with platform="apple")

Init / Lifecycle¶

model = cactus_init(model_path: str, corpus_dir: str | None, cache_index: bool) -> int
cactus_destroy(model: int)
cactus_reset(model: int)   # clear KV cache
cactus_stop(model: int)    # abort ongoing generation
cactus_get_last_error() -> str | None

Completion¶

Returns a dict with success, error, cloud_handoff, response, optional thinking (only present when the model emits chain-of-thought content, placed before function_calls), function_calls, segments (always [] for completion — populated only in transcription responses), confidence, timing stats (time_to_first_token_ms, total_time_ms, prefill_tps, decode_tps, ram_usage_mb), and token counts (prefill_tokens, decode_tokens, total_tokens).

result = cactus_complete(
    model: int,
    messages_json: str,              # JSON array of {role, content}
    options_json: str | None,        # optional inference options
    tools_json: str | None,          # optional tool definitions
    callback: Callable[[str, int], None] | None,  # streaming token callback
    pcm_data: list[int] | None = None              # optional raw audio bytes
) -> dict

# With options and streaming
options = json.dumps({"max_tokens": 256, "temperature": 0.7})
def on_token(token, token_id): print(token, end="", flush=True)

result = cactus_complete(model, messages_json, options, None, on_token)
if result["cloud_handoff"]:
    # response already contains cloud result
    pass

Response format:

{
    "success": true,
    "error": null,
    "cloud_handoff": false,
    "response": "4",
    "function_calls": [],
    "segments": [],
    "confidence": 0.92,
    "time_to_first_token_ms": 45.2,
    "total_time_ms": 163.7,
    "prefill_tps": 619.5,
    "decode_tps": 168.4,
    "ram_usage_mb": 512.3,
    "prefill_tokens": 28,
    "decode_tokens": 12,
    "total_tokens": 40
}

Prefill¶

Pre-processes input text and populates the KV cache without generating output tokens. This reduces latency for subsequent calls to cactus_complete.

cactus_prefill(
    model: int,
    messages_json: str,              # JSON array of {role, content}
    options_json: str | None,        # optional inference options
    tools_json: str | None,          # optional tool definitions
    pcm_data: list[int] | None = None              # optional raw audio bytes
) -> None

tools = json.dumps([{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City, State, Country"}
            },
            "required": ["location"]
        }
    }
}])

messages = json.dumps([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the weather in Paris?"},
    {"role": "assistant", "content": "<|tool_call_start|>get_weather(location=\"Paris\")<|tool_call_end|>"},
    {"role": "tool", "content": "{\"name\": \"get_weather\", \"content\": \"Sunny, 72°F\"}"},
    {"role": "assistant", "content": "It's sunny and 72°F in Paris!"}
])
cactus_prefill(model, messages, None, tools)

completion_messages = json.dumps([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the weather in Paris?"},
    {"role": "assistant", "content": "<|tool_call_start|>get_weather(location=\"Paris\")<|tool_call_end|>"},
    {"role": "tool", "content": "{\"name\": \"get_weather\", \"content\": \"Sunny, 72°F\"}"},
    {"role": "assistant", "content": "It's sunny and 72°F in Paris!"},
    {"role": "user", "content": "What about SF?"}
])
result = cactus_complete(model, completion_messages, None, tools, None)

Response format:

{
    "success": true,
    "error": null,
    "prefill_tokens": 25,
    "prefill_tps": 166.1,
    "total_time_ms": 150.5,
    "ram_usage_mb": 245.67
}

Transcription¶

Returns a dict with the response field (transcribed text), the segments array (timestamped segments as {"start": <sec>, "end": <sec>, "text": "<str>"} — Whisper: phrase-level from timestamp tokens; Parakeet TDT: word-level from frame timing; Parakeet CTC and Moonshine: one segment per transcription window (consecutive VAD speech regions up to 30s)), and other metadata.

result = cactus_transcribe(
    model: int,
    audio_path: str | None,
    prompt: str | None,
    options_json: str | None,
    callback: Callable[[str, int], None] | None,
    pcm_data: list[int] | bytes | None
) -> dict

Custom vocabulary biases the decoder toward domain-specific words (supported for Whisper and Moonshine models). Pass custom_vocabulary and vocabulary_boost in options_json:

options = json.dumps({
    "custom_vocabulary": ["Omeprazole", "HIPAA", "Cactus"],
    "vocabulary_boost": 3.0
})
result = cactus_transcribe(model, "medical_notes.wav", None, options, None, None)

result = cactus_transcribe(model, "/path/to/audio.wav", None, None, None, None)
print(result["response"])
for seg in result["segments"]:
    print(f"[{seg['start']:.3f}s - {seg['end']:.3f}s] {seg['text']}")

Embeddings¶

embedding = cactus_embed(model: int, text: str, normalize: bool) -> list[float]
embedding = cactus_image_embed(model: int, image_path: str) -> list[float]
embedding = cactus_audio_embed(model: int, audio_path: str) -> list[float]

Tokenization¶

tokens = cactus_tokenize(model: int, text: str) -> list[int]
result = cactus_score_window(model: int, tokens: list[int], start: int, end: int, context: int) -> dict

RAG¶

result = cactus_rag_query(model: int, query: str, top_k: int) -> dict

Returns a dict with a chunks array. Each chunk has score (float), source (str, from document metadata), and content (str):

{
    "chunks": [
        {"score": 0.0142, "source": "doc.txt", "content": "relevant passage..."}
    ]
}

Vector Index¶

index = cactus_index_init(index_dir: str, embedding_dim: int) -> int
cactus_index_add(index: int, ids: list[int], documents: list[str],
                 metadatas: list[str] | None, embeddings: list[list[float]])
cactus_index_delete(index: int, ids: list[int])
result = cactus_index_get(index: int, ids: list[int]) -> dict
result = cactus_index_query(index: int, embedding: list[float], options_json: str | None) -> dict
cactus_index_compact(index: int)
cactus_index_destroy(index: int)

cactus_index_query returns {"results":[{"id":<int>,"score":<float>}, ...]}. cactus_index_get returns {"results":[{"document":"...","metadata":<str|null>,"embedding":[...]}, ...]}.

Logging¶

cactus_log_set_level(level: int)  # 0=DEBUG 1=INFO 2=WARN (default) 3=ERROR 4=NONE
cactus_log_set_callback(callback: Callable[[int, str, str], None] | None)

Telemetry¶

cactus_set_telemetry_environment(framework: str, cache_location: str | None, version: str | None)
cactus_set_app_id(app_id: str)
cactus_telemetry_flush()
cactus_telemetry_shutdown()

Functions that return a value raise RuntimeError on failure. cactus_prefill, cactus_index_add, cactus_index_delete, and cactus_index_compact also raise RuntimeError on failure despite not returning a value. Truly void functions that never raise: cactus_destroy, cactus_reset, cactus_stop, cactus_index_destroy, logging and telemetry functions.

Vision (VLM)¶

Pass images in the messages content for vision-language models (LFM2-VL, LFM2.5-VL, Gemma4, Qwen3.5):

messages = json.dumps([{
    "role": "user",
    "content": "Describe this image",
    "images": ["path/to/image.png"]
}])
result = cactus_complete(model, messages, None, None, None)
print(result["response"])

Audio (Multimodal)¶

Pass audio files in messages for models with native audio understanding (Gemma4):

messages = json.dumps([{
    "role": "user",
    "content": "Transcribe the audio.",
    "audio": ["path/to/audio.wav"]
}])
result = cactus_complete(model, messages, None, None, None)
print(result["response"])

# Combined vision + audio
messages = json.dumps([{
    "role": "user",
    "content": "Describe the image and transcribe the audio.",
    "images": ["path/to/image.png"],
    "audio": ["path/to/audio.wav"]
}])
result = cactus_complete(model, messages, None, None, None)

Compute Graph¶

The Graph API provides a tensor computation graph for building and executing dataflow pipelines on the Cactus kernel layer:

from cactus.bindings.cactus import Graph
import numpy as np

g = Graph()
a = g.input((2, 2))
b = g.input((2, 2))
y = ((a - b) * (a + b)).abs().pow(2.0).view((4,))

g.set_input(a, np.array([[2, 4], [6, 8]], dtype=np.float16))
g.set_input(b, np.array([[1, 2], [3, 4]], dtype=np.float16))
g.execute()

print(y.numpy())  # [9. 36. 81. 144.]

Supported ops: +, -, *, /, abs, pow, view, flatten, concat, cat, relu, sigmoid, tanh, gelu, softmax.

Testing¶

Run the full test suite:

python python/test.py        # compact output
python python/test.py -v     # verbose

Tests are in python/tests/ — bindings, CLI, server, graph, model, transpile, and component-partition coverage. Add a new test_*.py to extend.