Analyzing Audio¶
Audio is where Taters earns its name. The goal is simple: make it easy to get from messy containers and long recordings to clean, analysis-ready artifacts you can iterate on. The core tools cover three phases:
- extract and standardize audio (WAVs at predictable locations)
- structure the speech (diarization and transcripts)
- turn waveforms into features (embeddings, per-speaker splits)
Everything follows the same philosophy as the text stack: predictable outputs, friendly defaults, and a "do not overwrite unless asked" rule.
Extract audio from video¶
Many recordings arrive as multi-track containers (Zoom, OBS, ProRes). This utility lists every audio stream and writes one WAV per stream with sensible names that include stream index and tags like language/title. It is handy both for audits and for preparing inputs to downstream steps.
What it does
- probes audio streams with ffprobe
- writes one PCM WAV per stream at your chosen sample rate and bit depth
- predictable filenames:
<stem>_a<index>[_<lang>][_<title>].wav
- default output directory if you do not pass one
When to use
- you have a video/container and want clean WAVs for each embedded track
- you intend to diarize and embed only one stream (e.g., the mixed program feed)
API: split audio streams to WAV¶
Extract each audio stream in a container to its own WAV file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
str | PathLike
|
Video or audio container readable by FFmpeg. |
required |
output_dir
|
str | PathLike | None
|
Destination directory. If None, defaults to |
None
|
sample_rate
|
int
|
Target sample rate for the output WAVs (Hz). |
48000
|
bit_depth
|
(16, 24, 32)
|
Output PCM bit depth (little-endian). |
16,24,32
|
overwrite
|
bool
|
If True, overwrite existing files. If False and a target exists,
raises :class: |
True
|
Returns:
Type | Description |
---|---|
list[str]
|
Absolute paths to the created WAVs. |
Behavior
- Output file names are constructed from the input base name and stream
metadata:
<stem>_a<index>[_<lang>][_<title>].wav
with safe slugs. - Uses
-map 0:a:<N>
to select the N-th audio stream in the container. - Runs FFmpeg with
-nostdin
and quiet loglevel to avoid TTY lockups.
Examples:
>>> split_audio_streams_to_wav("session.mp4")
['.../audio/session_a0_eng.wav', '.../audio/session_a1_eng.wav']
Source code in src\taters\audio\extract_wav_from_video.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
|
Convert any audio to WAV¶
Standardize any FFmpeg-readable media into a linear PCM WAV at the sample rate, bit depth, and channel layout you specify. Defaults are sensible for ASR and most modeling pipelines (16 kHz, 16-bit, mono).
What it does
- converts audio or extracts audio from video into a single WAV
- preserves channel layout if you request it
- uses ffmpeg with quiet, pipeline-safe flags and clear error reporting
- predictable default output path if omitted
When to use
- you need consistent, model-friendly WAVs from heterogeneous sources
- you want a one-liner from notebooks or the CLI
API: convert audio to WAV¶
Convert any FFmpeg-readable audio/video file to a linear PCM WAV.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
str | Path
|
Source media file (audio or video container). FFmpeg must be able to read it. |
required |
output_path
|
str | Path | None
|
Target WAV path. If None, defaults to
|
None
|
sample_rate
|
int
|
Desired sample rate (Hz). |
16000
|
bit_depth
|
(16, 24, 32)
|
Output PCM bit depth; maps to |
16,24,32
|
channels
|
int | None
|
If provided, set number of output channels (e.g., 1=mono, 2=stereo). If None, keep original channel count. |
1
|
overwrite_existing
|
bool
|
Overwrite |
False
|
Returns:
Type | Description |
---|---|
Path
|
Path to the written WAV file. |
Raises:
Type | Description |
---|---|
FileNotFoundError
|
If |
RuntimeError
|
If FFmpeg/FFprobe are missing or the conversion fails. |
Notes
- Video inputs are supported: the audio stream is extracted and converted.
- For multi-channel sources and
channels is None
, channel layout is preserved. - We run FFmpeg with
-nostdin
to avoid TTY issues in pipelines.
Source code in src\taters\audio\convert_to_wav.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
|
Diarize and transcribe¶
This is a thin CLI shim that forwards to the vendored Whisper diarization wrapper. It handles device selection and writes transcripts alongside subtitles with helpful defaults. Use it to get a timestamped CSV (start_time,end_time,speaker,text
) plus SRT/TXT that other Taters tools can consume immediately.
What it does
- delegates to the underlying whisper diarization wrapper
- writes transcripts to a predictable folder by input stem
- accepts device hints (cuda/cpu/auto)
When to use
- you are preparing per-segment text for embeddings or dictionary coding
- you plan to split a long recording into per-speaker WAVs
API: diarize with third-party wrapper (CLI entry)¶
Important note: This function is also exposed more easily via taters.audio.diarize_with_thirdparty
Run the vendored Whisper diarization scripts and normalize their outputs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
audio_path
|
str | Path
|
Input audio (WAV recommended). |
required |
out_dir
|
str | Path
|
Output directory for transcript artifacts. If it does not exist, it will be created. |
None
|
repo_dir
|
str | Path | None
|
Optional explicit location of the diarization repo. If None, the vendored copy is used. |
None
|
whisper_model
|
str
|
Whisper ASR model to use (e.g., "small", "base", "large-v3"). |
"medium.en"
|
language
|
str | None
|
Language hint for Whisper (e.g., "en"); if None, autodetection is used. |
None
|
device
|
('cpu', 'cuda')
|
Runtime device. If "cpu", environment variables are set to hide GPUs. |
"cpu","cuda"
|
batch_size
|
int
|
Whisper batch size; 0 disables batching. |
0
|
no_stem
|
bool
|
Pass through to demucs/whisper scripts to disable vocal/instrument stems. |
False
|
suppress_numerals
|
bool
|
Heuristic to reduce spurious numeral tokens. |
False
|
parallel
|
bool
|
Use parallel diarization script if available. |
False
|
timeout
|
int | None
|
Subprocess timeout in seconds; None means no timeout. |
None
|
use_custom
|
bool
|
Prefer the customized script if present (adds CSV emission and minor cleanup). |
True
|
keep_temp
|
bool
|
If False (default), temporary folders created by demucs/whisper are removed. |
False
|
num_speakers
|
int | None
|
Force a fixed number of speakers, if the downstream diarizer supports it. |
None
|
Returns:
Type | Description |
---|---|
DiarizationOutputFiles
|
Paths to |
Notes
- The function copies the input WAV to a per-file work directory before running, to ensure relative paths inside the third-party scripts resolve correctly.
- If
device="cpu"
, CUDA is disabled in the child environment. - On success, the local WAV copy is deleted and temporary folders are tidied up.
See Also
taters.audio.split_wav_by_speaker.make_speaker_wavs_from_csv : Build per-speaker WAVs from the diarization CSV.
Source code in src\taters\audio\diarizer\whisper_diar_wrapper.py
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 |
|
Whisper encoder embeddings¶
Export Whisper encoder embeddings as features. Two modes:
- Transcript-driven: one vector per transcript row (e.g., per diarized segment)
- General-audio: segment the WAV with fixed windows or non-silent spans; optionally mean-pool to a single vector for the whole file
Outputs land under ./features/whisper-embeddings/
by default with a stable <stem>_embeddings.csv
name. The wrapper runs extraction in a subprocess by default to avoid CUDA/Torch collisions elsewhere in your pipeline.
What it does
- computes D-dimensional vectors using Faster-Whisper/CTranslate2 backends
- segmenting strategies for raw audio; segment-level when given a transcript
- optional single-row mean pooling for whole-file summaries
- isolates heavy GPU state in a child process by default
When to use
- you want robust, speech-centric features for clustering, retrieval, or as inputs to downstream models
- you have transcripts already and want segment-level representations aligned to text
API: extract Whisper embeddings¶
Export Whisper encoder embeddings to a CSV file, using a subprocess by default.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source_wav
|
str | Path
|
Path to the input WAV. Must be readable by |
required |
transcript_csv
|
str | Path | None
|
If provided, enables transcript-driven mode. The CSV is expected to contain timestamp columns and (optionally) a speaker column. A row is emitted per transcript segment. |
None
|
time_unit
|
('auto', 'ms', 's', 'samples')
|
How to interpret timestamps in |
"auto","ms","s","samples"
|
strategy
|
('windows', 'nonsilent')
|
General-audio mode only. "windows" uses fixed sized windows with overlap; "nonsilent" uses an energy-based splitter (librosa.effects.split). |
"windows","nonsilent"
|
window_s
|
float
|
General-audio mode only. Window length and hop (seconds). |
30.0, 15.0
|
hop_s
|
float
|
General-audio mode only. Window length and hop (seconds). |
30.0, 15.0
|
min_seg_s
|
float
|
General-audio mode only. Skip segments shorter than this many seconds. |
1.0
|
top_db
|
float
|
General-audio mode only ("nonsilent"). Threshold (dB) below reference to consider as silence. Smaller → more segments; larger → fewer. |
30.0
|
aggregate
|
('none', 'mean')
|
General-audio mode only. If "mean", a single pooled row is written covering the entire file; otherwise one row per segment. |
"none","mean"
|
output_dir
|
str | Path | None
|
Directory for the output CSV. If None, defaults to
|
None
|
model_name
|
str
|
Model identifier passed through to the worker (e.g., "tiny", "base", "small", "large-v3" or a local CTranslate2 model directory). |
"base"
|
device
|
('auto', 'cuda', 'cpu')
|
Runtime device. If "cpu", environment variables are set to disable CUDA in the child process. |
"auto","cuda","cpu"
|
compute_type
|
str
|
CTranslate2 compute type (e.g., "float16", "int8", "float32"); passed to the worker module. |
"float16"
|
run_in_subprocess
|
bool
|
If True (recommended), runs extraction in a separate Python process to isolate Torch/CUDA state from the parent process. |
True
|
extra_env
|
dict | None
|
Additional environment variables to inject into the child process. |
None
|
verbose
|
bool
|
If True, print the launched command and the child's stdout. |
True
|
extractor_module
|
str
|
Dotted module path whose |
"chopshop.audio.extract_whisper_embeddings_subproc"
|
Returns:
Type | Description |
---|---|
Path
|
Path to the written embeddings CSV. Pattern:
|
Notes
- The subprocess writes and exits. The parent returns once the file exists.
- If
transcript_csv
is supplied, the worker runs in transcript mode; otherwise general-audio mode is used with the given segmentation strategy. - Failures in the child process are re-raised with the captured stdout/stderr to ease debugging.
Examples:
Transcript per-segment embeddings:
>>> extract_whisper_embeddings(
... source_wav="audio/session.wav",
... transcript_csv="transcripts/session.csv",
... time_unit="ms",
... model_name="small",
... device="cuda",
... )
Whole-file mean embedding:
>>> extract_whisper_embeddings(
... source_wav="audio/session.wav",
... strategy="nonsilent",
... aggregate="mean",
... output_dir="features/whisper-embeddings",
... )
Source code in src\taters\audio\extract_whisper_embeddings.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 |
|
options: members_order: alphabetical show_source: true
Split WAV by speaker¶
Given a diarization transcript, create one WAV per speaker by concatenating that speaker's segments. You can insert small silences to avoid clicks at joins and resample/downmix on the fly. Filenames are readable and stable.
What it does
- reads a timestamped CSV with
start_time,end_time,speaker
- builds one output WAV per unique speaker
- optional silence padding, resampling, mono mixdown
- skips ultra-short segments; clamps times to audio bounds
When to use
- you want per-speaker audio for targeted feature extraction or human coding
- you plan to model speakers separately or compute speaker-level aggregates
API: make per-speaker WAVs from a transcript¶
Concatenate speaker-specific segments into per-speaker WAV files.
If merge_consecutive=True
(default), adjacent transcript rows with the same
speaker are merged into a single, longer segment spanning from the first
start to the last end — including any silence between those turns. If you
need the strict per-row behavior, set merge_consecutive=False
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source_wav
|
str | Path
|
Path to the source WAV. |
required |
transcript_csv_path
|
str | Path
|
CSV with timing and speaker columns (e.g., diarization output). |
required |
output_dir
|
str | Path | None
|
Where to write the per-speaker files. If None, defaults to
|
None
|
start_col
|
str
|
Column names in the transcript CSV. |
'start_time'
|
end_col
|
str
|
Column names in the transcript CSV. |
'start_time'
|
speaker_col
|
str
|
Column names in the transcript CSV. |
'start_time'
|
time_unit
|
('ms', 's')
|
Units for start/end columns. |
"ms","s"
|
silence_ms
|
int
|
If |
1000
|
pre_silence_ms
|
int | None
|
Explicit padding (ms) before/after each segment; overrides |
None
|
post_silence_ms
|
int | None
|
Explicit padding (ms) before/after each segment; overrides |
None
|
sr
|
int | None
|
Resample output to this rate. If None, keep original rate. |
16000
|
mono
|
bool
|
Downmix to mono if True. |
True
|
min_dur_ms
|
int
|
Skip segments shorter than this duration (ms). |
50
|
merge_consecutive
|
bool
|
Merge back-to-back turns for the same speaker into one segment span (including any inter-turn silence). If False, emit one clip per row. |
True
|
Returns:
Type | Description |
---|---|
dict[str, Path]
|
Mapping from friendly speaker label → output WAV path. |
Behavior
- Input speaker labels are sanitized for filenames but a more readable label (without path-hostile characters) is preserved for naming.
- Segments are sorted by start time per speaker before concatenation.
- If a speaker ends up with zero valid segments, no file is written.
Examples:
>>> make_speaker_wavs_from_csv(
... source_wav="audio/session.wav",
... transcript_csv_path="transcripts/session.csv",
... time_unit="ms",
... silence_ms=0, # no padding
... sr=16000,
... mono=True,
... )
Source code in src\taters\audio\split_wav_by_speaker.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
|
options: members_order: alphabetical show_source: true
Practical notes¶
- Paths: if you do not pass explicit outputs, tools write to predictable folders next to your project root (for example,
./audio
,./features/whisper-embeddings
). - Overwrite behavior: by default, functions will not overwrite existing files; pass the relevant flag to force a rebuild.
- Device selection: where supported,
device="auto"
picks sensibly; setcuda
orcpu
explicitly when you need control.