Help for package whisper

Title:

Native R 'torch' Implementation of 'OpenAI' 'Whisper'

Version:

0.1.0

Description:

Speech-to-text transcription using a native R 'torch' implementation of 'OpenAI' 'Whisper' model https://github.com/openai/whisper. Supports multiple model sizes from tiny (39M parameters) to large-v3 (1.5B parameters) with integrated download from 'HuggingFace' https://huggingface.co/ via the 'hfhub' package. Provides automatic speech recognition with optional language detection and translation to English. Audio preprocessing, mel spectrogram computation, and transformer-based encoder-decoder inference are all implemented in R using the 'torch' package.

License:

MIT + file LICENSE

Encoding:

UTF-8

URL:

https://github.com/cornball-ai/whisper

BugReports:

https://github.com/cornball-ai/whisper/issues

Imports:

torch, av, jsonlite, hfhub, safetensors, stats, utils

Suggests:

tinytest

NeedsCompilation:

Packaged:

2026-02-04 01:07:15 UTC; troy

Author:

Troy Hernandez [aut, cre], cornball.ai [cph], OpenAI [cph] (Whisper model architecture and mel filterbank data (MIT license))

Maintainer:

Troy Hernandez <troy@cornball.ai>

Repository:

CRAN

Date/Publication:

2026-02-06 20:00:02 UTC

Audio Preprocessing for Whisper

Description

Convert audio files to mel spectrograms for Whisper input. Whisper Audio Constants

Usage

WHISPER_SAMPLE_RATE

Format

An object of class integer of length 1.

Apply BPE Merges

Description

Apply BPE Merges

Usage

apply_bpe(tokens, merge_ranks)

Arguments

tokens

Character vector of tokens

merge_ranks

Named vector of merge rankings

Value

Character vector after BPE merges

Get Audio Duration

Description

Get Audio Duration

Usage

audio_duration(file)

Arguments

file

Path to audio file

Value

Duration in seconds

Convert Audio to Mel Spectrogram

Description

Main preprocessing function that converts audio to the mel spectrogram format expected by Whisper.

Usage

audio_to_mel(file, n_mels = 80L, device = "auto", dtype = "auto")

Arguments

file

Path to audio file, or numeric vector of audio samples

n_mels

Number of mel bins (80 for most models, 128 for large-v3)

device

torch device for output tensor

dtype

torch dtype for output tensor

Value

torch tensor of shape (1, n_mels, 3000) for 30s audio

Examples


# Convert audio file to mel spectrogram
audio_file <- system.file("audio", "jfk.mp3", package = "whisper")
mel <- audio_to_mel(audio_file)
dim(mel)

Convert Byte to BPE Token

Description

GPT-2/Whisper uses a specific byte-to-unicode mapping.

Usage

byte_to_token(byte)

Arguments

byte

Integer byte value (0-255)

Value

Character token

Clean Transcribed Text

Description

Clean Transcribed Text

Usage

clean_text(text)

Arguments

text

Raw decoded text

Value

Cleaned text

Compute STFT Magnitude

Description

Compute STFT Magnitude

Usage

compute_stft(audio, n_fft = WHISPER_N_FFT, hop_length = WHISPER_HOP_LENGTH)

Arguments

audio

Numeric vector of audio samples

n_fft

FFT window size

hop_length

Hop length between frames

Value

Complex STFT matrix

Copy Weight if Exists

Description

Copy Weight if Exists

Usage

copy_if_exists(param, weights, name)

Arguments

param

Target parameter

weights

Weight dictionary

name

Weight name

Create Decoder from Config

Description

Create Decoder from Config

Usage

create_decoder(config)

Arguments

config

Model configuration from whisper_config()

Value

WhisperDecoder module

Create Encoder from Config

Description

Create Encoder from Config

Usage

create_encoder(config)

Arguments

config

Model configuration from whisper_config()

Value

WhisperEncoder module

Create Mel Filterbank (Fallback)

Description

Create a mel filterbank matrix for converting STFT to mel spectrogram. Used when pre-computed filterbank is not available.

Usage

create_mel_filterbank_fallback(
  n_fft = WHISPER_N_FFT,
  n_mels = 80L,
  sample_rate = WHISPER_SAMPLE_RATE
)

Arguments

n_fft

FFT size

n_mels

Number of mel bins

sample_rate

Audio sample rate

Value

Mel filterbank matrix (n_mels x n_freqs)

Decode BPE Bytes Back to Text

Description

Decode BPE Bytes Back to Text

Usage

decode_bpe_bytes(text)

Arguments

text

Text with BPE byte tokens

Value

Decoded text

Decode Timestamp Token

Description

Decode Timestamp Token

Usage

decode_timestamp(token_id, model = "tiny")

Arguments

token_id

Token ID

model

Model name for correct token IDs

Value

Time in seconds

Download Tokenizer Files from HuggingFace

Description

Download Tokenizer Files from HuggingFace

Usage

download_tokenizer_files(model)

Arguments

model

Model name

Download Model from HuggingFace

Description

Download Whisper model weights and tokenizer files from HuggingFace. In interactive sessions, asks for user consent before downloading.

Usage

download_whisper_model(model = "tiny", force = FALSE)

Arguments

model

Model name: "tiny", "base", "small", "medium", "large-v3"

force

Re-download even if exists

Value

Path to model directory (invisibly)

Examples


if (interactive()) {
  # Download tiny model (smallest, ~150MB)
  download_whisper_model("tiny")

  # Download larger model for better accuracy
  download_whisper_model("small")
}

Ensure Tokenizer Files are Downloaded

Description

Ensure Tokenizer Files are Downloaded

Usage

ensure_tokenizer_files(model)

Arguments

model

Model name

Value

Path to vocab directory (directory containing vocab.json)

Extract Segments with Timestamps

Description

Extract Segments with Timestamps

Usage

extract_segments(tokens, tokenizer, time_offset = 0)

Arguments

tokens

Token IDs

tokenizer

Tokenizer

time_offset

Offset in seconds for chunk processing

Value

Data frame with start, end, text

Get Initial Decoder Tokens

Description

Build the initial token sequence for decoder input.

Usage

get_initial_tokens(
  language = "en",
  task = "transcribe",
  model = "tiny",
  timestamps = FALSE
)

Arguments

language

Two-letter language code or NULL for auto

task

"transcribe" or "translate"

model

Model name for correct special token IDs

timestamps

Whether to include timestamps (internal use)

Value

Integer vector of initial token IDs

Get Model Cache Path

Description

Get Model Cache Path

Usage

get_model_path(model)

Arguments

model

Model name

Value

Path to model directory in hfhub cache

Get Path to Model Weights

Description

Get Path to Model Weights

Usage

get_weights_path(model)

Arguments

model

Model name

Value

Path to safetensors file

Greedy Decoding

Description

Greedy Decoding

Usage

greedy_decode(
  model,
  encoder_output,
  initial_tokens,
  tokenizer,
  max_length = 448L,
  device
)

Arguments

model

WhisperModel

encoder_output

Encoder hidden states

initial_tokens

Initial token tensor

tokenizer

Tokenizer

max_length

Maximum output length

device

Device

Value

Integer vector of generated tokens

Convert Hz to Mel Scale

Description

Convert Hz to Mel Scale

Usage

hz_to_mel(hz)

Arguments

hz

Frequency in Hz

Value

Frequency in mel scale

Check if Token is Timestamp

Description

Check if Token is Timestamp

Usage

is_timestamp_token(token_id, model = "tiny")

Arguments

token_id

Token ID

model

Model name for correct token IDs

Value

TRUE if timestamp token

List Downloaded Models

Description

List Downloaded Models

Usage

list_downloaded_models()

Value

Character vector of downloaded model names

Examples

list_downloaded_models()

List Available Models

Description

List Available Models

Usage

list_whisper_models()

Value

Character vector of model names

Examples

list_whisper_models()

Load Added Tokens from HuggingFace

Description

Load Added Tokens from HuggingFace

Usage

load_added_tokens(repo)

Arguments

repo

HuggingFace repo ID

Value

Named list of token -> ID mappings, or NULL if not found

Load and Preprocess Audio

Description

Load audio from file, convert to mono, resample to 16kHz.

Usage

load_audio(file)

Arguments

file

Path to audio file (WAV, MP3, etc.)

Value

Numeric vector of audio samples normalized to -1 to 1 range

Examples

# Load included sample audio
audio_file <- system.file("audio", "jfk.mp3", package = "whisper")
samples <- load_audio(audio_file)
length(samples)
range(samples)

Load Decoder Weights

Description

Load Decoder Weights

Usage

load_decoder_weights(decoder, weights)

Arguments

decoder

WhisperDecoder module

weights

Named list of tensors

Load Encoder Weights

Description

Load Encoder Weights

Usage

load_encoder_weights(encoder, weights)

Arguments

encoder

WhisperEncoder module

weights

Named list of tensors

Load Pre-computed Mel Filterbank

Description

Load the official Whisper mel filterbank from bundled CSV file.

Usage

load_mel_filterbank(n_mels = 80L)

Arguments

n_mels

Number of mel bins (80 or 128)

Value

Mel filterbank matrix (n_mels x n_freqs)

Load Whisper Model

Description

Load a Whisper model with weights from HuggingFace.

Usage

load_whisper_model(
  model = "tiny",
  device = "auto",
  dtype = "auto",
  download = FALSE,
  verbose = TRUE
)

Arguments

model

Model name: "tiny", "base", "small", "medium", "large-v3"

device

Device to load model on ("auto", "cpu", "cuda")

dtype

Data type ("auto", "float16", "float32")

download

If TRUE and model not present, prompt to download

verbose

Print loading messages

Value

WhisperModel module

Examples


# Load tiny model (requires prior download)
if (model_exists("tiny")) {
  model <- load_whisper_model("tiny")
}

Load Weights from Safetensors

Description

Load Weights from Safetensors

Usage

load_whisper_weights(model, weights_path, verbose = TRUE)

Arguments

model

WhisperModel module

weights_path

Path to safetensors file

verbose

Print loading messages

Convert Mel Scale to Hz

Description

Convert Mel Scale to Hz

Usage

mel_to_hz(mel)

Arguments

mel

Frequency in mel scale

Value

Frequency in Hz

Check if Model is Downloaded

Description

Check if Model is Downloaded

Usage

model_exists(model)

Arguments

model

Model name

Value

TRUE if model weights exist locally

Examples

model_exists("tiny")
model_exists("large-v3")

Pad or Trim Audio to Fixed Length

Description

Pad or Trim Audio to Fixed Length

Usage

pad_or_trim(audio, length = WHISPER_N_SAMPLES)

Arguments

audio

Numeric vector of audio samples

length

Target length in samples (default: 30s at 16kHz)

Value

Numeric vector of specified length

Parse Device Argument

Description

Parse Device Argument

Usage

parse_device(device = "auto")

Arguments

device

Character or torch device. "auto" uses GPU if available.

Value

torch device object

Parse Dtype Argument

Description

Parse Dtype Argument

Usage

parse_dtype(dtype = "auto", device = whisper_device())

Arguments

dtype

Character or torch dtype. "auto" uses float16 on GPU, float32 on CPU.

device

torch device (used for auto selection)

Value

torch dtype

Split Long Audio into Chunks

Description

Split audio longer than 30 seconds into overlapping chunks.

Usage

split_audio(file, chunk_length = 30, overlap = 1)

Arguments

file

Path to audio file

chunk_length

Chunk length in seconds

overlap

Overlap between chunks in seconds

Value

List of audio chunks (numeric vectors)

Decode Token IDs to Text

Description

Decode Token IDs to Text

Usage

tokenizer_decode(ids, id_to_token, special_tokens)

Arguments

ids

Integer vector of token IDs

id_to_token

Mapping from ID to token

special_tokens

Special token info

Value

Character string

Encode Text to Token IDs

Description

Encode Text to Token IDs

Usage

tokenizer_encode(text, vocab, merge_ranks)

Arguments

text

Character string to encode

vocab

Vocabulary mapping (token -> id)

merge_ranks

Merge ranking for BPE

Value

Integer vector of token IDs

Whisper Transcription

Description

Main transcription API for Whisper. Transcribe Audio Transcribe speech from an audio file using Whisper.

Usage

transcribe(
  file,
  model = "tiny",
  language = "en",
  task = "transcribe",
  device = "auto",
  dtype = "auto",
  verbose = TRUE
)

Arguments

file

Path to audio file (WAV, MP3, etc.)

model

Model name: "tiny", "base", "small", "medium", "large-v3"

language

Language code (e.g., "en", "es"). NULL for auto-detection.

task

"transcribe" or "translate" (translate to English)

device

Device: "auto", "cpu", "cuda"

dtype

Data type: "auto", "float16", "float32"

verbose

Print progress messages

Value

List with text, language, and metadata

Examples


# Transcribe included sample (JFK "ask not" speech)
if (model_exists("tiny")) {
  audio_file <- system.file("audio", "jfk.mp3", package = "whisper")
  result <- transcribe(audio_file, model = "tiny")
  result$text

  # Translate Spanish audio to English
  spanish_file <- system.file("audio", "allende.mp3", package = "whisper")
  result <- transcribe(spanish_file, model = "tiny",
                       language = "es", task = "translate")
  result$text
}

Transcribe Single Chunk

Description

Transcribe Single Chunk

Usage

transcribe_chunk(
  file,
  model,
  tokenizer,
  config,
  language = "en",
  task = "transcribe",
  device,
  dtype,
  verbose = TRUE
)

Arguments

file

Audio file or mel spectrogram

model

WhisperModel

tokenizer

Tokenizer

config

Model config

language

Language code

task

Task type

device

Device

dtype

Dtype

verbose

Verbose output

Value

Transcription result

Transcribe Long Audio

Description

Process audio longer than 30 seconds in chunks.

Usage

transcribe_long(
  file,
  model,
  tokenizer,
  config,
  language,
  task,
  device,
  dtype,
  verbose
)

Arguments

file

Audio file

model

WhisperModel

tokenizer

Tokenizer

config

Model config

language

Language

task

Task

device

Device

dtype

Dtype

verbose

Verbose

Value

Combined transcription result

Whisper Encoder

Description

Transformer encoder for processing mel spectrograms. Multi-Head Self-Attention

Usage

whisper_attention(n_state, n_head)

Arguments

n_state

Hidden dimension

n_head

Number of attention heads

Whisper Model Configurations

Description

Get configuration for a Whisper model variant.

Usage

whisper_config(model = "tiny")

Arguments

model

Character. Model name: "tiny", "base", "small", "medium", "large-v3"

Value

List with model configuration parameters

Examples

# Get tiny model configuration
cfg <- whisper_config("tiny")
cfg$n_mels
cfg$n_audio_layer

# Compare model sizes
whisper_config("tiny")$n_text_layer
whisper_config("large-v3")$n_text_layer

Text Decoder

Description

Full Whisper decoder: token embedding + positional embedding + transformer layers.

Usage

whisper_decoder(n_vocab, n_ctx, n_state, n_head, n_layer)

Arguments

n_vocab

Vocabulary size

n_ctx

Maximum context length

n_state

Hidden dimension

n_head

Number of attention heads

n_layer

Number of transformer layers

Whisper Decoder

Description

Transformer decoder with cross-attention to encoder outputs. Decoder Layer Pre-norm transformer decoder layer with self-attention and cross-attention.

Usage

whisper_decoder_layer(n_state, n_head)

Arguments

n_state

Hidden dimension

n_head

Number of attention heads

Device and Dtype Management

Description

Utilities for managing torch devices and data types. Get Default Device Returns CUDA device if available, otherwise CPU.

Usage

whisper_device()

Value

torch device object

Examples


if (torch::torch_is_installed()) {
  device <- whisper_device()
  device$type
}

Get Default Dtype

Description

Returns float16 on CUDA, float32 on CPU.

Usage

whisper_dtype(device = whisper_device())

Arguments

device

torch device

Value

torch dtype

Examples


if (torch::torch_is_installed()) {
  dtype <- whisper_dtype()
  dtype
}

Audio Encoder

Description

Full Whisper encoder: Conv stem + positional encoding + transformer layers.

Usage

whisper_encoder(n_mels, n_ctx, n_state, n_head, n_layer)

Arguments

n_mels

Number of mel spectrogram bins

n_ctx

Maximum context length (1500 for 30s audio)

n_state

Hidden dimension

n_head

Number of attention heads

n_layer

Number of transformer layers

Encoder Layer

Description

Pre-norm transformer encoder layer.

Usage

whisper_encoder_layer(n_state, n_head)

Arguments

n_state

Hidden dimension

n_head

Number of attention heads

Get Language Token ID

Description

Get Language Token ID

Usage

whisper_lang_token(lang = "en", model = "tiny")

Arguments

lang

Two-letter language code (e.g., "en", "es", "fr")

model

Model name for correct token IDs

Value

Token ID for the language

Whisper Model

Description

Full Whisper model combining encoder and decoder. Whisper Model Module

Usage

whisper_model(config)

Arguments

config

Model configuration

Special Token IDs

Description

Get special token IDs for a Whisper model. Token IDs differ between model variants (e.g., large-v3 has extra language tokens).

Usage

whisper_special_tokens(model = "tiny")

Arguments

model

Model name (default: "tiny")

Value

Named list of special token IDs

Whisper BPE Tokenizer

Description

Byte-pair encoding tokenizer for Whisper models. Create Whisper Tokenizer Load or create a Whisper tokenizer from HuggingFace vocab files.

Usage

whisper_tokenizer(model = "tiny")

Arguments

model

Model name for vocab lookup

Value

Tokenizer object (list with encode/decode functions)

Examples


# Load tokenizer (requires prior model download)
if (model_exists("tiny")) {
  tok <- whisper_tokenizer("tiny")
  tok$encode("Hello world")
  tok$decode(c(50258, 50259, 50359, 50363))
}