Skip to content

Ridiculously Fast On-Device Transcription: Reviewing Parakeet CTC 1.1B with Cactus

By Satyajit Kumar and Henry Ndubuaku

Video Title

Parakeet CTC 1.1B is NVIDIA’s non-autoregressive English speech-to-text model built on FastConformer. At only 1.1 billion parameters, it is small enough to run entirely on-device while still delivering state-of-the-art transcription quality. It uses Limited Context Attention in the encoder and a lightweight CTC projection head instead of an autoregressive decoder, which makes the decoding stage extremely efficient. Using Cactus we achieve up to 6 million tokens/second decode speed with sub-200 ms end-to-end latency on Apple Silicon, fast enough for real-time, always-on transcription without a cloud round-trip.

Architecture Details

Parakeet CTC 1.1B is built on NVIDIA's FastConformer encoder and optimized for non-autoregressive ASR. At a high level:

  1. Audio front-end (mel + subsampling): Input audio is converted to log-mel features, then an 8x depthwise-separable convolutional subsampler reduces sequence length before the encoder stack.
  2. FastConformer encoder blocks: The encoder combines Conformer layers with Limited Context Attention (LCA) for local efficiency and periodic Global Tokens (GT) so long-range context is still preserved.
  3. CTC projection head: Instead of an autoregressive decoder, Parakeet projects encoder states directly to token logits and uses CTC decoding (blank/repeat collapse), making inference highly parallel and low latency.

This architecture is why Parakeet works well for both real-time and batch transcription: most compute is in the encoder pass, and decoding stays lightweight.

Model Architecture Diagram

                                      ┌───────────────────────┐
                                      │     CTC Collapse      │
                                      │ remove blanks / merge │
                                      │ repeated labels       │
                                      └───────────┬───────────┘
                                      ┌───────────┴───────────┐
                                      │   CTC Projection Head │
                                      │  Conv1D / Linear → V  │
                                      └───────────┬───────────┘
                                      ┌───────────┴───────────┐
                                      │         Norm          │
                                      └───────────┬───────────┘
                         ┌────────────────────────⊕───────────────────────┐
                         │                        │                       │
                         │          FastConformer Encoder Stack           │
                         │                  × Num Layers                  │
                         │                                                │
                         │   ┌────────────────────────────────────────┐   │
                         │   │           FastConformer Block          │   │
                         │   │                                        │   │
                         │   │             ┌──────────────┐           │   │
                         │   │             │     FFN      │           │   │
                         │   │             │ Linear       │           │   │
                         │   │             │ SwiGLU/Act   │           │   │
                         │   │             │ Linear       │           │   │
                         │   │             └──────┬───────┘           │   │
                         │   │                    │                   │   │
                         │   │                    ⊕                   │   │
                         │   │                    │                   │   │
                         │   │             ┌──────┴───────┐           │   │
                         │   │             │  Conv Module │           │   │
                         │   │             │ Pointwise    │           │   │
                         │   │             │ Depthwise    │           │   │
                         │   │             │ Pointwise    │           │   │
                         │   │             └──────┬───────┘           │   │
                         │   │                    │                   │   │
                         │   │                    ⊕                   │   │
                         │   │                    │                   │   │
                         │   │    ┌───────────────┴──────────────┐    │   │
                         │   │    │   Limited Context Attention  │    │   │
                         │   │    │     local / sliding window   │    │   │
                         │   │    │                              │    │   │
                         │   │    │      Q        K        V     │    │   │
                         │   │    │      ↑        ↑        ↑     │    │   │
                         │   │    │ ┌────┴────────┴────────┴───┐ │    │   │
                         │   │    │ │           Linear         │ │    │   │
                         │   │    │ └─────────────┬────────────┘ │    │   │
                         │   │    └───────────────┼──────────────┘    │   │
                         │   │                    │                   │   │
                         │   │                    ⊕                   │   │
                         │   │                    │                   │   │
                         │   │    ┌───────────────┴──────────────┐    │   │
                         │   │    │            FFN               │    │   │
                         │   │    │    Linear → Act → Linear     │    │   │
                         │   │    └──────────────────────────────┘    │   │
                         │   │                                        │   │
                         │   └────────────────────────────────────────┘   │
                         └────────────────────────┬───────────────────────┘
                                      ┌───────────┴───────────┐
                                      │  Conv Subsampling /   │
                                      │ Sequence Reduction    │
                                      │  (time downsample)    │
                                      └───────────┬───────────┘
                                      ┌───────────┴───────────┐
                                      │   Mel-Spectrogram /   │
                                      │   Acoustic Features   │
                                      └───────────┬───────────┘
                                      ┌───────────┴───────────┐
                                      │     16 kHz Audio      │
                                      │      Waveform In      │
                                      └───────────────────────┘

Getting Started with Parakeet-CTC-1.1B on Cactus

Quick Start (Homebrew)

The fastest way to try Parakeet: two commands, sub-200 ms latency:

brew install cactus-compute/cactus/cactus
cactus transcribe nvidia/parakeet-ctc-1.1b

That's it. Cactus downloads the 1.1B model, quantizes it, and starts a live transcription session from your microphone. To transcribe a file instead:

cactus transcribe nvidia/parakeet-ctc-1.1b --file /path/to/your/file.wav

Building from Source

If you need the Python, Rust, or C libraries for integration, build from source:

Prerequisites

  • macOS with Apple Silicon and 16GB+ RAM (M1 or later recommended)
  • Python 3.10+
  • CMake (brew install cmake)
  • Git

Clone and Build

git clone https://github.com/cactus-compute/cactus.git
cd cactus

# Build the Cactus engine (shared library for Python FFI)
cactus build --python

Download the Model

Cactus handles downloading and converting HuggingFace models to its optimized binary format with INT4/INT8 quantization, all in one command:

cactus download nvidia/parakeet-ctc-1.1b

4. Use the Python Binding

For integrating Parakeet into your own applications, use the Python FFI bindings directly:

from cactus import cactus_init, cactus_transcribe, cactus_destroy

model = cactus_init("weights/parakeet-ctc-1.1b", None, False)

result = cactus_transcribe(model, "/path/to/audio.wav")

print("\n\nFinal transcript:")
print(result["response"])
print(f"Decode speed: {result['decode_tps']:.1f} tokens/sec")

cactus_destroy(model)

5. Use the C API

The C API is the base layer all other bindings build on. Link against libcactus_engine and include the FFI header:

#include "cactus_engine.h"
#include <stdio.h>
#include <string.h>

int main() {
    cactus_model_t model = cactus_init("weights/parakeet-ctc-1.1b", NULL, false);

    char response[16384];
    int rc = cactus_transcribe(
        model, "audio.wav", NULL,
        response, sizeof(response),
        NULL, NULL, NULL, NULL, 0
    );

    if (rc >= 0) printf("Transcript: %s\n", response);

    cactus_destroy(model);
    return 0;
}

6. Use the Rust Binding

Copy cactus.rs into your project (see the README), link libcactus_engine.a from cactus build, and call the FFI bindings directly:

use std::ffi::CString;
use std::os::raw::c_char;
use std::ptr;

fn main() {
    let model_path = CString::new("weights/parakeet-ctc-1.1b").unwrap();
    let audio_path = CString::new("audio.wav").unwrap();

    let model = unsafe {
        cactus_sys::cactus_init(model_path.as_ptr(), ptr::null(), false)
    };

    let mut buf = vec![0u8; 16384];
    let rc = unsafe {
        cactus_sys::cactus_transcribe(
            model,
            audio_path.as_ptr(),
            ptr::null(),
            buf.as_mut_ptr() as *mut c_char,
            buf.len(),
            ptr::null(), None, ptr::null_mut(),
            ptr::null(), 0,
        )
    };

    if rc >= 0 {
        let response = unsafe { std::ffi::CStr::from_ptr(buf.as_ptr() as *const c_char).to_string_lossy() };
        println!("Transcript: {}", response);
    }

    unsafe { cactus_sys::cactus_destroy(model) };
}

7. Use the Swift Binding

The Swift binding exposes top-level functions that map directly to the C FFI:

import Foundation

let model = try cactusInit("weights/parakeet-ctc-1.1b", nil, false)
let resultJson = try cactusTranscribe(model, "/path/to/audio.wav", nil, nil, nil, nil)
print(resultJson)

cactusDestroy(model)

8. Use the Kotlin Binding

The Kotlin binding exposes top-level functions that map directly to the C FFI:

import com.cactus.*

val model = cactusInit("weights/parakeet-ctc-1.1b", null, false)
val resultJson = cactusTranscribe(model, "/path/to/audio.wav", null, null, null, null)
println(resultJson)

cactusDestroy(model)

9. Use the Flutter Binding

The Flutter binding brings Cactus transcription to iOS, macOS, and Android:

import 'cactus.dart';

final model = cactusInit('weights/parakeet-ctc-1.1b', null, false);
final resultJson = cactusTranscribe(model, '/path/to/audio.wav', null, null, null, null);
print(resultJson);

cactusDestroy(model);

See Also