Skip to content

Analyzing Audio

Audio is where Taters earns its name. The goal is simple: make it easy to get from messy containers and long recordings to clean, analysis-ready artifacts you can iterate on. The core tools cover three phases:

  • extract and standardize audio (WAVs at predictable locations)
  • structure the speech (diarization and transcripts)
  • turn waveforms into features (embeddings, per-speaker splits)

Everything follows the same philosophy as the text stack: predictable outputs, friendly defaults, and a "do not overwrite unless asked" rule.


Extract audio from video

Many recordings arrive as multi-track containers (Zoom, OBS, ProRes). This utility lists every audio stream and writes one WAV per stream with sensible names that include stream index and tags like language/title. It is handy both for audits and for preparing inputs to downstream steps.

What it does

  • probes audio streams with ffprobe
  • writes one PCM WAV per stream at your chosen sample rate and bit depth
  • predictable filenames: <stem>_a<index>[_<lang>][_<title>].wav
  • default output directory if you do not pass one

When to use

  • you have a video/container and want clean WAVs for each embedded track
  • you intend to diarize and embed only one stream (e.g., the mixed program feed)

API: split audio streams to WAV

Extract each audio stream in a container to its own WAV file.

Parameters:

Name Type Description Default
input_path str | PathLike

Video or audio container readable by FFmpeg.

required
output_dir str | PathLike | None

Destination directory. If None, defaults to ./audio in the current working directory (predictable write location).

None
sample_rate int

Target sample rate for the output WAVs (Hz).

48000
bit_depth (16, 24, 32)

Output PCM bit depth (little-endian).

16,24,32
overwrite bool

If True, overwrite existing files. If False and a target exists, raises :class:FileExistsError.

True

Returns:

Type Description
list[str]

Absolute paths to the created WAVs.

Behavior
  • Output file names are constructed from the input base name and stream metadata: <stem>_a<index>[_<lang>][_<title>].wav with safe slugs.
  • Uses -map 0:a:<N> to select the N-th audio stream in the container.
  • Runs FFmpeg with -nostdin and quiet loglevel to avoid TTY lockups.

Examples:

>>> split_audio_streams_to_wav("session.mp4")
['.../audio/session_a0_eng.wav', '.../audio/session_a1_eng.wav']
Source code in src\taters\audio\extract_wav_from_video.py
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
def split_audio_streams_to_wav(
    input_path: str | os.PathLike,
    output_dir: str | os.PathLike | None = None,     # <-- now optional
    sample_rate: int = 48000,
    bit_depth: int = 16,
    overwrite: bool = True,
) -> List[str]:
    """
    Extract each audio stream in a container to its own WAV file.

    Parameters
    ----------
    input_path : str | os.PathLike
        Video or audio container readable by FFmpeg.
    output_dir : str | os.PathLike | None, optional
        Destination directory. If None, defaults to ``./audio`` in the current
        working directory (predictable write location).
    sample_rate : int, default 48000
        Target sample rate for the output WAVs (Hz).
    bit_depth : {16,24,32}, default 16
        Output PCM bit depth (little-endian).
    overwrite : bool, default True
        If True, overwrite existing files. If False and a target exists,
        raises :class:`FileExistsError`.

    Returns
    -------
    list[str]
        Absolute paths to the created WAVs.

    Behavior
    --------
    - Output file names are constructed from the input base name and stream
      metadata: ``<stem>_a<index>[_<lang>][_<title>].wav`` with safe slugs.
    - Uses ``-map 0:a:<N>`` to select the N-th audio stream in the container.
    - Runs FFmpeg with ``-nostdin`` and quiet loglevel to avoid TTY lockups.

    Examples
    --------
    >>> split_audio_streams_to_wav("session.mp4")
    ['.../audio/session_a0_eng.wav', '.../audio/session_a1_eng.wav']
    """

    _check_binaries()

    in_path = Path(input_path)
    if not in_path.exists():
        raise FileNotFoundError(f"Input file not found: {in_path}")

    # Default predictable location when none is provided
    if output_dir is None:
        out_dir = Path.cwd() / "audio"
    else:
        out_dir = Path(output_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    print(f"Extracting audio streams from {in_path} to {out_dir} at {sample_rate} Hz, bit depth: {bit_depth}")

    streams = _probe_audio_streams(in_path)
    if not streams:
        raise ValueError("No audio streams found in input.")

    pcm_fmt_map = {16: "pcm_s16le", 24: "pcm_s24le", 32: "pcm_s32le"}
    if bit_depth not in pcm_fmt_map:
        raise ValueError("bit_depth must be one of {16, 24, 32}.")
    pcm_codec = pcm_fmt_map[bit_depth]

    created_files: List[str] = []
    base = in_path.stem

    for s in streams:
        idx = s.get("index")
        tags = s.get("tags", {}) or {}
        lang = tags.get("language")
        title = tags.get("title")

        print(f"Extracting audio stream:\n"
              f"index: {idx}\n"
              f"tags: {tags}\n"
              f"language: {lang}\n"
              f"title: {title}\n")

        out_name = _build_wav_name(base, idx, lang, title)
        out_path = out_dir / out_name

        ffmpeg_cmd = [
            "ffmpeg",
            "-nostdin",
            "-hide_banner",
            "-loglevel", "error",
            "-y" if overwrite else "-n",
            "-i", str(in_path),
            "-map", f"0:a:{streams.index(s)}",  # Nth audio stream
            "-acodec", pcm_codec,
            "-ar", str(sample_rate),
            str(out_path),
        ]

        result = subprocess.run(ffmpeg_cmd, capture_output=True, text=True, stdin=subprocess.DEVNULL)
        if result.returncode != 0:
            if not overwrite and out_path.exists():
                raise FileExistsError(f"Target exists (use overwrite=True): {out_path}")
            raise RuntimeError(f"ffmpeg failed for stream {idx}: {result.stderr.strip()}")

        created_files.append(str(out_path))

    return created_files

Convert any audio to WAV

Standardize any FFmpeg-readable media into a linear PCM WAV at the sample rate, bit depth, and channel layout you specify. Defaults are sensible for ASR and most modeling pipelines (16 kHz, 16-bit, mono).

What it does

  • converts audio or extracts audio from video into a single WAV
  • preserves channel layout if you request it
  • uses ffmpeg with quiet, pipeline-safe flags and clear error reporting
  • predictable default output path if omitted

When to use

  • you need consistent, model-friendly WAVs from heterogeneous sources
  • you want a one-liner from notebooks or the CLI

API: convert audio to WAV

Convert any FFmpeg-readable audio/video file to a linear PCM WAV.

Parameters:

Name Type Description Default
input_path str | Path

Source media file (audio or video container). FFmpeg must be able to read it.

required
output_path str | Path | None

Target WAV path. If None, defaults to <cwd>/audio/<input_stem>.wav.

None
sample_rate int

Desired sample rate (Hz).

16000
bit_depth (16, 24, 32)

Output PCM bit depth; maps to pcm_s{bit_depth}le codec.

16,24,32
channels int | None

If provided, set number of output channels (e.g., 1=mono, 2=stereo). If None, keep original channel count.

1
overwrite_existing bool

Overwrite output_path if it already exists.

False

Returns:

Type Description
Path

Path to the written WAV file.

Raises:

Type Description
FileNotFoundError

If input_path does not exist.

RuntimeError

If FFmpeg/FFprobe are missing or the conversion fails.

Notes
  • Video inputs are supported: the audio stream is extracted and converted.
  • For multi-channel sources and channels is None, channel layout is preserved.
  • We run FFmpeg with -nostdin to avoid TTY issues in pipelines.
Source code in src\taters\audio\convert_to_wav.py
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
def convert_audio_to_wav(
    input_path: Union[str, Path],
    *,
    output_path: Optional[Union[str, Path]] = None,
    output_dir: Optional[Union[str, Path]] = None,
    sample_rate: int = 16000,          # common for ASR
    bit_depth: int = 16,               # 16/24/32 signed PCM
    channels: int = 1,                 # 1=mono, 2=stereo
    overwrite_existing: bool = False,  # if the file already exists, let's not overwrite by default
) -> Path:
    """
    Convert any FFmpeg-readable audio/video file to a linear PCM WAV.

    Parameters
    ----------
    input_path : str | Path
        Source media file (audio or video container). FFmpeg must be able to read it.
    output_path : str | Path | None, optional
        Target WAV path. If None, defaults to
        ``<cwd>/audio/<input_stem>.wav``.
    sample_rate : int, default 16000
        Desired sample rate (Hz).
    bit_depth : {16,24,32}, default 16
        Output PCM bit depth; maps to ``pcm_s{bit_depth}le`` codec.
    channels : int | None, default 1
        If provided, set number of output channels (e.g., 1=mono, 2=stereo).
        If None, keep original channel count.
    overwrite_existing : bool, default False
        Overwrite `output_path` if it already exists.

    Returns
    -------
    Path
        Path to the written WAV file.

    Raises
    ------
    FileNotFoundError
        If `input_path` does not exist.
    RuntimeError
        If FFmpeg/FFprobe are missing or the conversion fails.

    Notes
    -----
    - Video inputs are supported: the audio stream is extracted and converted.
    - For multi-channel sources and `channels is None`, channel layout is preserved.
    - We run FFmpeg with ``-nostdin`` to avoid TTY issues in pipelines.
    """

    _check_ffmpeg()

    in_path = Path(input_path).resolve()
    if not in_path.exists():
        raise FileNotFoundError(f"Input file not found: {in_path}")

    if output_path and output_dir:
        raise ValueError("Provide at most one of output_path or output_dir.")

    if output_path:
        out_path = Path(output_path).resolve()
    else:
        base = in_path.stem + ".wav"
        out_dir = Path(output_dir).resolve() if output_dir else Path.cwd() / "audio"
        out_dir.mkdir(parents=True, exist_ok=True)
        out_path = out_dir / base

    if not overwrite_existing and Path(out_path).is_file():
        print("WAV file already exists; returning existing file.")
        return out_path

    pcm_map = {16: "pcm_s16le", 24: "pcm_s24le", 32: "pcm_s32le"}
    if bit_depth not in pcm_map:
        raise ValueError("bit_depth must be one of {16, 24, 32}.")
    if channels not in (1, 2):
        raise ValueError("channels must be 1 (mono) or 2 (stereo).")

    cmd = [
        "ffmpeg",
        "-nostdin",
        "-hide_banner", "-loglevel", "error",
        "-y" if overwrite_existing else "-n",
        "-i", str(in_path),
        "-vn",                        # ignore video
        "-acodec", pcm_map[bit_depth],
        "-ar", str(sample_rate),
        "-ac", str(channels),
        str(out_path),
    ]

    result = subprocess.run(cmd, capture_output=True, text=True, stdin=subprocess.DEVNULL)
    if result.returncode != 0:
        if not overwrite_existing and out_path.exists():
            raise FileExistsError(f"Target exists (use overwrite=True): {out_path}")
        raise RuntimeError(f"ffmpeg failed: {result.stderr.strip()}")

    return out_path

Diarize and transcribe

This is a thin CLI shim that forwards to the vendored Whisper diarization wrapper. It handles device selection and writes transcripts alongside subtitles with helpful defaults. Use it to get a timestamped CSV (start_time,end_time,speaker,text) plus SRT/TXT that other Taters tools can consume immediately.

What it does

  • delegates to the underlying whisper diarization wrapper
  • writes transcripts to a predictable folder by input stem
  • accepts device hints (cuda/cpu/auto)

When to use

  • you are preparing per-segment text for embeddings or dictionary coding
  • you plan to split a long recording into per-speaker WAVs

API: diarize with third-party wrapper (CLI entry)

Important note: This function is also exposed more easily via taters.audio.diarize_with_thirdparty

Run the vendored Whisper diarization scripts and normalize their outputs.

Parameters:

Name Type Description Default
audio_path str | Path

Input audio (WAV recommended).

required
out_dir str | Path

Output directory for transcript artifacts. If it does not exist, it will be created.

None
repo_dir str | Path | None

Optional explicit location of the diarization repo. If None, the vendored copy is used.

None
whisper_model str

Whisper ASR model to use (e.g., "small", "base", "large-v3").

"medium.en"
language str | None

Language hint for Whisper (e.g., "en"); if None, autodetection is used.

None
device ('cpu', 'cuda')

Runtime device. If "cpu", environment variables are set to hide GPUs.

"cpu","cuda"
batch_size int

Whisper batch size; 0 disables batching.

0
no_stem bool

Pass through to demucs/whisper scripts to disable vocal/instrument stems.

False
suppress_numerals bool

Heuristic to reduce spurious numeral tokens.

False
parallel bool

Use parallel diarization script if available.

False
timeout int | None

Subprocess timeout in seconds; None means no timeout.

None
use_custom bool

Prefer the customized script if present (adds CSV emission and minor cleanup).

True
keep_temp bool

If False (default), temporary folders created by demucs/whisper are removed.

False
num_speakers int | None

Force a fixed number of speakers, if the downstream diarizer supports it.

None

Returns:

Type Description
DiarizationOutputFiles

Paths to .txt, .srt, and .csv (if produced) in a per-file working directory, plus an (empty) speaker_wavs mapping for API compatibility.

Notes
  • The function copies the input WAV to a per-file work directory before running, to ensure relative paths inside the third-party scripts resolve correctly.
  • If device="cpu", CUDA is disabled in the child environment.
  • On success, the local WAV copy is deleted and temporary folders are tidied up.
See Also

taters.audio.split_wav_by_speaker.make_speaker_wavs_from_csv : Build per-speaker WAVs from the diarization CSV.

Source code in src\taters\audio\diarizer\whisper_diar_wrapper.py
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
def run_whisper_diarization_repo(
    audio_path: str | Path,
    out_dir: Optional[str] | Optional[Path] | None = None,
    *,
    overwrite_existing: bool = False,  # if the file already exists, let's not overwrite by default
    repo_dir: str | Path | None = None,      # ← now Optional
    whisper_model: str = "base.en",
    language: Optional[str] = None,
    device: Optional[str] = None,            # "cuda" / "cpu"
    batch_size: int = 0,
    no_stem: bool = False,
    suppress_numerals: bool = False,
    parallel: bool = False,
    timeout: Optional[int] = None,
    use_custom: bool = True,
    keep_temp: bool = False,
    num_speakers: Optional[int] = None,
) -> DiarizationOutputFiles:
    """
    Run the vendored Whisper diarization scripts and normalize their outputs.

    Parameters
    ----------
    audio_path : str | Path
        Input audio (WAV recommended).
    out_dir : str | Path
        Output directory for transcript artifacts. If it does not exist, it will
        be created.
    repo_dir : str | Path | None
        Optional explicit location of the diarization repo. If None, the
        vendored copy is used.
    whisper_model : str, default "medium.en"
        Whisper ASR model to use (e.g., "small", "base", "large-v3").
    language : str | None
        Language hint for Whisper (e.g., "en"); if None, autodetection is used.
    device : {"cpu","cuda"} | None
        Runtime device. If "cpu", environment variables are set to hide GPUs.
    batch_size : int, default 0
        Whisper batch size; 0 disables batching.
    no_stem : bool, default False
        Pass through to demucs/whisper scripts to disable vocal/instrument stems.
    suppress_numerals : bool, default False
        Heuristic to reduce spurious numeral tokens.
    parallel : bool, default False
        Use parallel diarization script if available.
    timeout : int | None
        Subprocess timeout in seconds; None means no timeout.
    use_custom : bool, default True
        Prefer the customized script if present (adds CSV emission and minor cleanup).
    keep_temp : bool, default False
        If False (default), temporary folders created by demucs/whisper are removed.
    num_speakers : int | None
        Force a fixed number of speakers, if the downstream diarizer supports it.

    Returns
    -------
    DiarizationOutputFiles
        Paths to ``.txt``, ``.srt``, and ``.csv`` (if produced) in a per-file working
        directory, plus an (empty) ``speaker_wavs`` mapping for API compatibility.

    Notes
    -----
    - The function copies the input WAV to a per-file work directory before running,
      to ensure relative paths inside the third-party scripts resolve correctly.
    - If `device="cpu"`, CUDA is disabled in the child environment.
    - On success, the local WAV copy is deleted and temporary folders are tidied up.

    See Also
    --------
    taters.audio.split_wav_by_speaker.make_speaker_wavs_from_csv :
        Build per-speaker WAVs from the diarization CSV.
    """


    # Decide device if user passed "auto" (or None)
    resolved_device = _resolve_device(device)
    print(f"Resolved device for whisper extraction: {resolved_device}")

    audio_path = Path(audio_path).resolve()
    # default transcripts folder next to current working dir
    out_dir = Path(out_dir).resolve() if out_dir is not None else (Path.cwd() / "transcripts")
    out_dir.mkdir(parents=True, exist_ok=True)

    # Isolated working folder
    work_dir = out_dir / f"{audio_path.stem}"
    work_dir.mkdir(parents=True, exist_ok=True)

    local_audio = work_dir / audio_path.name

    # CSV default: <work_dir>/<stem>.csv
    csv_path = work_dir / f"{local_audio.stem}.csv"
    if not overwrite_existing and Path(csv_path).is_file():
        print("Diarized transcript output file already exists; returning existing file.")
        raw = _guess_outputs_from_stem(work_dir, local_audio.stem)
        return DiarizationOutputFiles(work_dir=work_dir, raw_files=raw, speaker_wavs={})


    # Copy input audio next to outputs so the CLI can use simple relative paths
    if not local_audio.exists():
        shutil.copy2(audio_path, local_audio)

    # Resolve path to vendored repo (or use user-supplied path)
    with ExitStack() as stack:
        if repo_dir is None:
            repo_trav = _resolve_vendored_repo_dir()
            repo_dir_path = stack.enter_context(as_file(repo_trav))  # real FS path
        else:
            repo_dir_path = Path(repo_dir).resolve()

        # Validate the script exists inside the repo
        script_name = ("diarize_custom.py" if (use_custom and (repo_dir_path / "diarize_custom.py").exists())
                       else ("diarize_parallel.py" if parallel else "diarize.py"))
        script_path = (repo_dir_path / script_name)
        if not script_path.exists():
            raise FileNotFoundError(f"Expected script not found: {script_path}")

        # Run the repo script (cwd = work_dir so temp_outputs land there)
        _run_repo_script(
            repo_dir=repo_dir_path,
            audio_path=local_audio,
            work_dir=work_dir,
            whisper_model=whisper_model,
            language=language,
            device=resolved_device,
            batch_size=batch_size,
            no_stem=no_stem,
            suppress_numerals=suppress_numerals,
            parallel=parallel,
            timeout=timeout,
            use_custom=use_custom,
            csv_out=csv_path,
            num_speakers=num_speakers,
        )

    # Tidy temp dirs
    _cleanup_temps(work_dir, keep_temp)

    # Collect outputs (.txt/.srt/.csv)
    raw = _guess_outputs_from_stem(work_dir, local_audio.stem)

    # Remove the copied WAV now that we're done
    try:
        if local_audio.exists():
            local_audio.unlink()
    except Exception:
        pass

    return DiarizationOutputFiles(work_dir=work_dir, raw_files=raw, speaker_wavs={})

Whisper encoder embeddings

Export Whisper encoder embeddings as features. Two modes:

  • Transcript-driven: one vector per transcript row (e.g., per diarized segment)
  • General-audio: segment the WAV with fixed windows or non-silent spans; optionally mean-pool to a single vector for the whole file

Outputs land under ./features/whisper-embeddings/ by default with a stable <stem>_embeddings.csv name. The wrapper runs extraction in a subprocess by default to avoid CUDA/Torch collisions elsewhere in your pipeline.

What it does

  • computes D-dimensional vectors using Faster-Whisper/CTranslate2 backends
  • segmenting strategies for raw audio; segment-level when given a transcript
  • optional single-row mean pooling for whole-file summaries
  • isolates heavy GPU state in a child process by default

When to use

  • you want robust, speech-centric features for clustering, retrieval, or as inputs to downstream models
  • you have transcripts already and want segment-level representations aligned to text

API: extract Whisper embeddings

Export Whisper encoder embeddings to a CSV file, using a subprocess by default.

Parameters:

Name Type Description Default
source_wav str | Path

Path to the input WAV. Must be readable by librosa.

required
transcript_csv str | Path | None

If provided, enables transcript-driven mode. The CSV is expected to contain timestamp columns and (optionally) a speaker column. A row is emitted per transcript segment.

None
time_unit ('auto', 'ms', 's', 'samples')

How to interpret timestamps in transcript_csv. In "auto", the worker heuristically infers the unit from max end time vs audio duration.

"auto","ms","s","samples"
strategy ('windows', 'nonsilent')

General-audio mode only. "windows" uses fixed sized windows with overlap; "nonsilent" uses an energy-based splitter (librosa.effects.split).

"windows","nonsilent"
window_s float

General-audio mode only. Window length and hop (seconds).

30.0, 15.0
hop_s float

General-audio mode only. Window length and hop (seconds).

30.0, 15.0
min_seg_s float

General-audio mode only. Skip segments shorter than this many seconds.

1.0
top_db float

General-audio mode only ("nonsilent"). Threshold (dB) below reference to consider as silence. Smaller → more segments; larger → fewer.

30.0
aggregate ('none', 'mean')

General-audio mode only. If "mean", a single pooled row is written covering the entire file; otherwise one row per segment.

"none","mean"
output_dir str | Path | None

Directory for the output CSV. If None, defaults to ./features/whisper-embeddings.

None
model_name str

Model identifier passed through to the worker (e.g., "tiny", "base", "small", "large-v3" or a local CTranslate2 model directory).

"base"
device ('auto', 'cuda', 'cpu')

Runtime device. If "cpu", environment variables are set to disable CUDA in the child process.

"auto","cuda","cpu"
compute_type str

CTranslate2 compute type (e.g., "float16", "int8", "float32"); passed to the worker module.

"float16"
run_in_subprocess bool

If True (recommended), runs extraction in a separate Python process to isolate Torch/CUDA state from the parent process.

True
extra_env dict | None

Additional environment variables to inject into the child process.

None
verbose bool

If True, print the launched command and the child's stdout.

True
extractor_module str

Dotted module path whose __main__ implements the extractor CLI.

"chopshop.audio.extract_whisper_embeddings_subproc"

Returns:

Type Description
Path

Path to the written embeddings CSV. Pattern: <output_dir>/<source_stem>_embeddings.csv.

Notes
  • The subprocess writes and exits. The parent returns once the file exists.
  • If transcript_csv is supplied, the worker runs in transcript mode; otherwise general-audio mode is used with the given segmentation strategy.
  • Failures in the child process are re-raised with the captured stdout/stderr to ease debugging.

Examples:

Transcript per-segment embeddings:

>>> extract_whisper_embeddings(
...     source_wav="audio/session.wav",
...     transcript_csv="transcripts/session.csv",
...     time_unit="ms",
...     model_name="small",
...     device="cuda",
... )

Whole-file mean embedding:

>>> extract_whisper_embeddings(
...     source_wav="audio/session.wav",
...     strategy="nonsilent",
...     aggregate="mean",
...     output_dir="features/whisper-embeddings",
... )
Source code in src\taters\audio\extract_whisper_embeddings.py
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
def extract_whisper_embeddings(
    *,
    # required
    source_wav: Union[str, Path],

    # optional transcript-driven mode
    transcript_csv: Optional[Union[str, Path]] = None,
    time_unit: Literal["auto", "ms", "s", "samples"] = "auto",

    # general-audio mode (used when transcript_csv is None)
    strategy: Literal["windows", "nonsilent"] = "windows",
    window_s: float = 30.0,
    hop_s: float = 15.0,
    min_seg_s: float = 1.0,
    top_db: float = 30.0,
    aggregate: Literal["none", "mean"] = "none",

    # outputs
    output_dir: Optional[Union[str, Path]] = None,
    overwrite_existing: bool = False,  # if the file already exists, let's not overwrite by default

    # model/runtime
    model_name: str = "base",
    device: Literal["auto", "cuda", "cpu"] = "auto",
    compute_type: str = "float16",

    # execution strategy
    run_in_subprocess: bool = True,
    extra_env: Optional[dict] = None,
    verbose: bool = True,

    # where the extractor lives (python -m <module>)
    extractor_module: str = "taters.audio.extract_whisper_embeddings_subproc",
) -> Path:
    """
    Export Whisper encoder embeddings to a CSV file, using a subprocess by default.

    Parameters
    ----------
    source_wav : str | Path
        Path to the input WAV. Must be readable by `librosa`.
    transcript_csv : str | Path | None, optional
        If provided, enables transcript-driven mode. The CSV is expected to contain
        timestamp columns and (optionally) a speaker column. A row is emitted per
        transcript segment.
    time_unit : {"auto","ms","s","samples"}, default "auto"
        How to interpret timestamps in `transcript_csv`. In "auto", the worker
        heuristically infers the unit from max end time vs audio duration.
    strategy : {"windows","nonsilent"}, default "windows"
        General-audio mode only. "windows" uses fixed sized windows with overlap;
        "nonsilent" uses an energy-based splitter (librosa.effects.split).
    window_s, hop_s : float, default 30.0, 15.0
        General-audio mode only. Window length and hop (seconds).
    min_seg_s : float, default 1.0
        General-audio mode only. Skip segments shorter than this many seconds.
    top_db : float, default 30.0
        General-audio mode only ("nonsilent"). Threshold (dB) below reference to
        consider as silence. Smaller → more segments; larger → fewer.
    aggregate : {"none","mean"}, default "none"
        General-audio mode only. If "mean", a single pooled row is written covering
        the entire file; otherwise one row per segment.
    output_dir : str | Path | None, optional
        Directory for the output CSV. If None, defaults to
        ``./features/whisper-embeddings``.
    model_name : str, default "base"
        Model identifier passed through to the worker (e.g., "tiny", "base",
        "small", "large-v3" or a local CTranslate2 model directory).
    device : {"auto","cuda","cpu"}, default "auto"
        Runtime device. If "cpu", environment variables are set to disable CUDA
        in the child process.
    compute_type : str, default "float16"
        CTranslate2 compute type (e.g., "float16", "int8", "float32"); passed to
        the worker module.
    run_in_subprocess : bool, default True
        If True (recommended), runs extraction in a separate Python process to
        isolate Torch/CUDA state from the parent process.
    extra_env : dict | None, optional
        Additional environment variables to inject into the child process.
    verbose : bool, default True
        If True, print the launched command and the child's stdout.
    extractor_module : str, default "chopshop.audio.extract_whisper_embeddings_subproc"
        Dotted module path whose ``__main__`` implements the extractor CLI.

    Returns
    -------
    Path
        Path to the written embeddings CSV. Pattern:
        ``<output_dir>/<source_stem>_embeddings.csv``.

    Notes
    -----
    - The subprocess writes and exits. The parent returns once the file exists.
    - If `transcript_csv` is supplied, the worker runs in transcript mode; otherwise
      general-audio mode is used with the given segmentation strategy.
    - Failures in the child process are re-raised with the captured stdout/stderr
      to ease debugging.

    Examples
    --------
    Transcript per-segment embeddings:

    >>> extract_whisper_embeddings(
    ...     source_wav="audio/session.wav",
    ...     transcript_csv="transcripts/session.csv",
    ...     time_unit="ms",
    ...     model_name="small",
    ...     device="cuda",
    ... )

    Whole-file mean embedding:

    >>> extract_whisper_embeddings(
    ...     source_wav="audio/session.wav",
    ...     strategy="nonsilent",
    ...     aggregate="mean",
    ...     output_dir="features/whisper-embeddings",
    ... )
    """

    source_wav = Path(source_wav).resolve()
    # default to ./features/whisper-embeddings when not provided
    out_dir_final = (
        Path(output_dir).resolve()
        if output_dir
        else (Path.cwd() / "features" / "whisper-embeddings")
    )

    out_dir_final.mkdir(parents=True, exist_ok=True)
    output_csv = out_dir_final / f"{source_wav.stem}_embeddings.csv"

    if not overwrite_existing and Path(output_csv).is_file():
        print("Whisper embedding feature output file already exists; returning existing file.")
        return output_csv

    if not run_in_subprocess:
        # ---- In-process path (only when you’re sure no Torch/CUDA conflicts) ----
        from ..audio.extract_whisper_embeddings import (  # type: ignore
            export_segment_embeddings_csv,
            export_audio_embeddings_csv,
            EmbedConfig,
        )
        cfg = EmbedConfig(model_name=model_name, device=device, compute_type=compute_type, time_unit=time_unit)
        if transcript_csv is not None:
            transcript_csv = Path(transcript_csv).resolve()
            return Path(
                export_segment_embeddings_csv(
                    transcript_csv=transcript_csv,
                    source_wav=source_wav,
                    output_dir=out_dir_final,
                    config=cfg,
                )
            )
        else:
            return Path(
                export_audio_embeddings_csv(
                    source_wav=source_wav,
                    output_dir=out_dir_final,
                    config=cfg,
                    strategy=strategy,
                    window_s=window_s,
                    hop_s=hop_s,
                    min_seg_s=min_seg_s,
                    top_db=top_db,
                    aggregate=aggregate,
                )
            )

    # ---- Subprocess path (recommended) ----
    env = os.environ.copy()
    # Keep Transformers from importing heavy backends in the child
    env.setdefault("TRANSFORMERS_NO_TORCH", "1")
    env.setdefault("TRANSFORMERS_NO_TF", "1")
    env.setdefault("TRANSFORMERS_NO_FLAX", "1")

    if extra_env:
        env.update({k: str(v) for k, v in extra_env.items()})

    if device == "cpu":
        # Make sure the child won’t try CUDA
        env.update({"CUDA_VISIBLE_DEVICES": "", "USE_CUDA": "0", "FORCE_CPU": "1"})
    else:
        # Best-effort: prepend cuDNN wheel's lib dir if available
        try:
            import nvidia.cudnn, pathlib  # type: ignore
            cudnn_lib = str(pathlib.Path(nvidia.cudnn.__file__).with_name("lib"))
            env["LD_LIBRARY_PATH"] = cudnn_lib + ":" + env.get("LD_LIBRARY_PATH", "")
        except Exception:
            pass

    cmd = [
        sys.executable, "-m", extractor_module,
        "--source_wav", str(source_wav),
        "--output_dir", str(out_dir_final),
        "--model_name", model_name,
        "--device", device,
        "--compute_type", compute_type,
    ]

    if transcript_csv is not None:
        transcript_csv = Path(transcript_csv).resolve()
        cmd += ["--transcript_csv", str(transcript_csv), "--time_unit", time_unit]
    else:
        cmd += [
            "--strategy", strategy,
            "--window_s", str(window_s),
            "--hop_s", str(hop_s),
            "--min_seg_s", str(min_seg_s),
            "--top_db", str(top_db),
            "--aggregate", aggregate,
        ]

    if verbose:
        print("Launching embedding subprocess:")
        print(" ", shlex.join(cmd))

    try:
        res = subprocess.run(cmd, check=True, env=env, capture_output=True, text=True, stdin=subprocess.DEVNULL)
        if verbose and res.stdout:
            print(res.stdout.strip())
    except subprocess.CalledProcessError as e:
        raise RuntimeError(
            f"Embedding subprocess failed with code {e.returncode}\n"
            f"CMD: {shlex.join(cmd)}\n"
            f"STDOUT:\n{(e.stdout or '').strip()}\n\n"
            f"STDERR:\n{(e.stderr or '').strip()}"
        ) from e

    if not output_csv.exists():
        raise FileNotFoundError(f"Expected embeddings CSV not found: {output_csv}")

    if verbose:
        print(f"Embeddings CSV written to: {output_csv}")

    return output_csv

options: members_order: alphabetical show_source: true


Split WAV by speaker

Given a diarization transcript, create one WAV per speaker by concatenating that speaker's segments. You can insert small silences to avoid clicks at joins and resample/downmix on the fly. Filenames are readable and stable.

What it does

  • reads a timestamped CSV with start_time,end_time,speaker
  • builds one output WAV per unique speaker
  • optional silence padding, resampling, mono mixdown
  • skips ultra-short segments; clamps times to audio bounds

When to use

  • you want per-speaker audio for targeted feature extraction or human coding
  • you plan to model speakers separately or compute speaker-level aggregates

API: make per-speaker WAVs from a transcript

Concatenate speaker-specific segments into per-speaker WAV files.

If merge_consecutive=True (default), adjacent transcript rows with the same speaker are merged into a single, longer segment spanning from the first start to the last end — including any silence between those turns. If you need the strict per-row behavior, set merge_consecutive=False.

Parameters:

Name Type Description Default
source_wav str | Path

Path to the source WAV.

required
transcript_csv_path str | Path

CSV with timing and speaker columns (e.g., diarization output).

required
output_dir str | Path | None

Where to write the per-speaker files. If None, defaults to ./audio_split/<source_stem>/.

None
start_col str

Column names in the transcript CSV.

'start_time'
end_col str

Column names in the transcript CSV.

'start_time'
speaker_col str

Column names in the transcript CSV.

'start_time'
time_unit ('ms', 's')

Units for start/end columns.

"ms","s"
silence_ms int

If pre_silence_ms/post_silence_ms are None, use this for both sides.

1000
pre_silence_ms int | None

Explicit padding (ms) before/after each segment; overrides silence_ms.

None
post_silence_ms int | None

Explicit padding (ms) before/after each segment; overrides silence_ms.

None
sr int | None

Resample output to this rate. If None, keep original rate.

16000
mono bool

Downmix to mono if True.

True
min_dur_ms int

Skip segments shorter than this duration (ms).

50
merge_consecutive bool

Merge back-to-back turns for the same speaker into one segment span (including any inter-turn silence). If False, emit one clip per row.

True

Returns:

Type Description
dict[str, Path]

Mapping from friendly speaker label → output WAV path.

Behavior
  • Input speaker labels are sanitized for filenames but a more readable label (without path-hostile characters) is preserved for naming.
  • Segments are sorted by start time per speaker before concatenation.
  • If a speaker ends up with zero valid segments, no file is written.

Examples:

>>> make_speaker_wavs_from_csv(
...     source_wav="audio/session.wav",
...     transcript_csv_path="transcripts/session.csv",
...     time_unit="ms",
...     silence_ms=0,  # no padding
...     sr=16000,
...     mono=True,
... )
Source code in src\taters\audio\split_wav_by_speaker.py
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
def make_speaker_wavs_from_csv(
    source_wav: Union[str, Path],
    transcript_csv_path: Union[str, Path],
    output_dir: Union[str, Path, None] = None,
    *,
    overwrite_existing: bool = False,
    start_col: str = "start_time",
    end_col: str = "end_time",
    speaker_col: str = "speaker",
    time_unit: str = "ms",             # "ms" or "s"
    silence_ms: int = 1000,
    pre_silence_ms: Optional[int] = None,
    post_silence_ms: Optional[int] = None,
    sr: Optional[int] = 16000,
    mono: bool = True,
    min_dur_ms: int = 50,
    merge_consecutive: bool = True,    # NEW: merge back-to-back turns by same speaker
) -> Dict[str, Path]:
    """
    Concatenate speaker-specific segments into per-speaker WAV files.

    If `merge_consecutive=True` (default), adjacent transcript rows with the same
    speaker are merged into a single, longer segment spanning from the first
    start to the last end — including any silence between those turns. If you
    need the strict per-row behavior, set `merge_consecutive=False`.

    Parameters
    ----------
    source_wav : str | Path
        Path to the source WAV.
    transcript_csv_path : str | Path
        CSV with timing and speaker columns (e.g., diarization output).
    output_dir : str | Path | None, optional
        Where to write the per-speaker files. If None, defaults to
        ``./audio_split/<source_stem>/``.
    start_col, end_col, speaker_col : str
        Column names in the transcript CSV.
    time_unit : {"ms","s"}, default "ms"
        Units for start/end columns.
    silence_ms : int, default 1000
        If `pre_silence_ms`/`post_silence_ms` are None, use this for both sides.
    pre_silence_ms, post_silence_ms : int | None
        Explicit padding (ms) before/after each segment; overrides `silence_ms`.
    sr : int | None, default 16000
        Resample output to this rate. If None, keep original rate.
    mono : bool, default True
        Downmix to mono if True.
    min_dur_ms : int, default 50
        Skip segments shorter than this duration (ms).
    merge_consecutive : bool, default True
        Merge back-to-back turns for the same speaker into one segment span
        (including any inter-turn silence). If False, emit one clip per row.

    Returns
    -------
    dict[str, Path]
        Mapping from friendly speaker label → output WAV path.

    Behavior
    --------
    - Input speaker labels are sanitized for filenames but a more readable label
      (without path-hostile characters) is preserved for naming.
    - Segments are sorted by start time per speaker before concatenation.
    - If a speaker ends up with zero valid segments, no file is written.

    Examples
    --------
    >>> make_speaker_wavs_from_csv(
    ...     source_wav="audio/session.wav",
    ...     transcript_csv_path="transcripts/session.csv",
    ...     time_unit="ms",
    ...     silence_ms=0,  # no padding
    ...     sr=16000,
    ...     mono=True,
    ... )
    """
    if time_unit not in ("ms", "s"):
        raise ValueError("time_unit must be 'ms' or 's'")

    def _friendly_filename_label(name: str) -> str:
        s = (name or "").strip()
        s = s.replace("/", "_").replace("\\", "_")
        s = re.sub(r'[<>:"|?*]', "", s)
        s = re.sub(r"\s+", " ", s)
        return s or "SPEAKER_0"

    source_wav = Path(source_wav)
    transcript_csv_path = Path(transcript_csv_path)
    out_dir = Path(output_dir) if output_dir is not None else (Path.cwd() / "audio_split" / source_wav.stem)
    out_dir.mkdir(parents=True, exist_ok=True)
    base_stem = source_wav.stem

    audio = AudioSegment.from_file(source_wav)
    if sr:
        audio = audio.set_frame_rate(sr)
    if mono:
        audio = audio.set_channels(1)

    factor = 1000.0 if time_unit == "s" else 1.0
    audio_len_ms = len(audio)

    with transcript_csv_path.open(newline="", encoding="utf-8") as f:
        rows = list(csv.DictReader(f))

    segs_by_spk: Dict[str, List[tuple[int, int]]] = {}
    label_for_key: Dict[str, str] = {}

    # Build segments with awareness of original row order so that we can merge
    # adjacent turns for the same speaker when requested.
    prev_spk_key: Optional[str] = None
    for row in rows:
        try:
            start_raw = float(row[start_col])
            end_raw   = float(row[end_col])
            raw_spk   = str(row.get(speaker_col, "SPEAKER_0"))
        except Exception:
            continue

        start_ms = int(round(start_raw * factor))
        end_ms   = int(round(end_raw   * factor))
        if end_ms <= start_ms:
            continue

        start_ms = _clamp(start_ms, 0, audio_len_ms)
        end_ms   = _clamp(end_ms,   0, audio_len_ms)
        if end_ms <= start_ms:
            continue

        spk_key = _sanitize_speaker(raw_spk)
        label_for_key.setdefault(spk_key, _friendly_filename_label(raw_spk))

        if merge_consecutive and prev_spk_key == spk_key and segs_by_spk.get(spk_key):
            # Extend the last segment for this speaker to cover the new end
            s0, e0 = segs_by_spk[spk_key][-1]
            # Keep the earliest start, extend to the latest end
            s_new = min(s0, start_ms)
            e_new = max(e0, end_ms)
            segs_by_spk[spk_key][-1] = (s_new, e_new)
        else:
            # Strictly append a new segment
            segs_by_spk.setdefault(spk_key, []).append((start_ms, end_ms))

        prev_spk_key = spk_key

    # Optional: drop very short segments after merging
    for spk_key, segs in list(segs_by_spk.items()):
        segs_by_spk[spk_key] = [(s, e) for (s, e) in segs if (e - s) >= min_dur_ms]

    pre_ms  = silence_ms if pre_silence_ms  is None else pre_silence_ms
    post_ms = silence_ms if post_silence_ms is None else post_silence_ms
    pre_sil  = AudioSegment.silent(duration=max(0, pre_ms),  frame_rate=audio.frame_rate)
    post_sil = AudioSegment.silent(duration=max(0, post_ms), frame_rate=audio.frame_rate)
    if mono:
        pre_sil  = pre_sil.set_channels(1)
        post_sil = post_sil.set_channels(1)

    results: Dict[str, Path] = {}
    for spk_key, segs in segs_by_spk.items():
        if not segs:
            continue

        friendly = label_for_key.get(spk_key, spk_key)
        out_path = out_dir / f"{base_stem}_{friendly}.wav"

        if (not overwrite_existing) and out_path.is_file():
            results[friendly] = out_path
            continue

        out = AudioSegment.silent(duration=0, frame_rate=audio.frame_rate)
        if mono:
            out = out.set_channels(1)

        for (s, e) in segs:
            clip = audio[s:e]
            if len(clip) < min_dur_ms:
                continue
            out += pre_sil + clip + post_sil

        if len(out) == 0:
            continue

        out.export(out_path, format="wav", codec="pcm_s16le")
        results[friendly] = out_path

    return results

options: members_order: alphabetical show_source: true


Practical notes

  • Paths: if you do not pass explicit outputs, tools write to predictable folders next to your project root (for example, ./audio, ./features/whisper-embeddings).
  • Overwrite behavior: by default, functions will not overwrite existing files; pass the relevant flag to force a rebuild.
  • Device selection: where supported, device="auto" picks sensibly; set cuda or cpu explicitly when you need control.