Skip to content

Analyzing Audio

Audio is where Taters earns its name. The goal is simple: make it easy to get from messy containers and long recordings to clean, analysis-ready artifacts you can iterate on. The core tools cover three phases:

  • extract and standardize audio (WAVs at predictable locations)
  • structure the speech (diarization and transcripts)
  • turn waveforms into features (embeddings, per-speaker splits)

Everything follows the same philosophy as the text stack: predictable outputs, friendly defaults, and a "do not overwrite unless asked" rule.


Extract audio from video

Many recordings arrive as multi-track containers (Zoom, OBS, ProRes). This utility lists every audio stream and writes one WAV per stream with sensible names that include stream index and tags like language/title. It is handy both for audits and for preparing inputs to downstream steps.

What it does

  • probes audio streams with ffprobe
  • writes one PCM WAV per stream at your chosen sample rate and bit depth
  • predictable filenames: <stem>_a<index>[_<lang>][_<title>].wav
  • default output directory if you do not pass one

When to use

  • you have a video/container and want clean WAVs for each embedded track
  • you intend to diarize and embed only one stream (e.g., the mixed program feed)

API: split audio streams to WAV

Extract each audio stream in a container to its own WAV file.

Parameters:

Name Type Description Default
input_path str | PathLike

Video or audio container readable by FFmpeg.

required
output_dir str | PathLike | None

Destination directory. If None, defaults to ./audio in the current working directory (predictable write location).

None
sample_rate int

Target sample rate for the output WAVs (Hz).

48000
bit_depth (16, 24, 32)

Output PCM bit depth (little-endian).

16,24,32
overwrite bool

If True, overwrite existing files. If False and a target exists, raises :class:FileExistsError.

True

Returns:

Type Description
list[str]

Absolute paths to the created WAVs.

Behavior
  • Output file names are constructed from the input base name and stream metadata: <stem>_a<index>[_<lang>][_<title>].wav with safe slugs.
  • Uses -map 0:a:<N> to select the N-th audio stream in the container.
  • Runs FFmpeg with -nostdin and quiet loglevel to avoid TTY lockups.

Examples:

>>> split_audio_streams_to_wav("session.mp4")
['.../audio/session_a0_eng.wav', '.../audio/session_a1_eng.wav']
Source code in src\taters\audio\extract_wav_from_video.py
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
def split_audio_streams_to_wav(
    input_path: str | os.PathLike,
    output_dir: str | os.PathLike | None = None,     # <-- now optional
    sample_rate: int = 48000,
    bit_depth: int = 16,
    overwrite: bool = True,
) -> List[str]:
    """
    Extract each audio stream in a container to its own WAV file.

    Parameters
    ----------
    input_path : str | os.PathLike
        Video or audio container readable by FFmpeg.
    output_dir : str | os.PathLike | None, optional
        Destination directory. If None, defaults to ``./audio`` in the current
        working directory (predictable write location).
    sample_rate : int, default 48000
        Target sample rate for the output WAVs (Hz).
    bit_depth : {16,24,32}, default 16
        Output PCM bit depth (little-endian).
    overwrite : bool, default True
        If True, overwrite existing files. If False and a target exists,
        raises :class:`FileExistsError`.

    Returns
    -------
    list[str]
        Absolute paths to the created WAVs.

    Behavior
    --------
    - Output file names are constructed from the input base name and stream
      metadata: ``<stem>_a<index>[_<lang>][_<title>].wav`` with safe slugs.
    - Uses ``-map 0:a:<N>`` to select the N-th audio stream in the container.
    - Runs FFmpeg with ``-nostdin`` and quiet loglevel to avoid TTY lockups.

    Examples
    --------
    >>> split_audio_streams_to_wav("session.mp4")
    ['.../audio/session_a0_eng.wav', '.../audio/session_a1_eng.wav']
    """

    _check_binaries()

    in_path = Path(input_path)
    if not in_path.exists():
        raise FileNotFoundError(f"Input file not found: {in_path}")

    # Default predictable location when none is provided
    if output_dir is None:
        out_dir = Path.cwd() / "audio"
    else:
        out_dir = Path(output_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    print(f"Extracting audio streams from {in_path} to {out_dir} at {sample_rate} Hz, bit depth: {bit_depth}")

    streams = _probe_audio_streams(in_path)
    if not streams:
        raise ValueError("No audio streams found in input.")

    pcm_fmt_map = {16: "pcm_s16le", 24: "pcm_s24le", 32: "pcm_s32le"}
    if bit_depth not in pcm_fmt_map:
        raise ValueError("bit_depth must be one of {16, 24, 32}.")
    pcm_codec = pcm_fmt_map[bit_depth]

    created_files: List[str] = []
    base = in_path.stem

    for s in streams:
        idx = s.get("index")
        tags = s.get("tags", {}) or {}
        lang = tags.get("language")
        title = tags.get("title")

        print(f"Extracting audio stream:\n"
              f"index: {idx}\n"
              f"tags: {tags}\n"
              f"language: {lang}\n"
              f"title: {title}\n")

        out_name = _build_wav_name(base, idx, lang, title)
        out_path = out_dir / out_name

        ffmpeg_cmd = [
            "ffmpeg",
            "-nostdin",
            "-hide_banner",
            "-loglevel", "error",
            "-y" if overwrite else "-n",
            "-i", str(in_path),
            "-map", f"0:a:{streams.index(s)}",  # Nth audio stream
            "-acodec", pcm_codec,
            "-ar", str(sample_rate),
            str(out_path),
        ]

        result = subprocess.run(ffmpeg_cmd, capture_output=True, text=True, stdin=subprocess.DEVNULL)
        if result.returncode != 0:
            if not overwrite and out_path.exists():
                raise FileExistsError(f"Target exists (use overwrite=True): {out_path}")
            raise RuntimeError(f"ffmpeg failed for stream {idx}: {result.stderr.strip()}")

        created_files.append(str(out_path))

    return created_files

Convert any audio to WAV

Standardize any FFmpeg-readable media into a linear PCM WAV at the sample rate, bit depth, and channel layout you specify. Defaults are sensible for ASR and most modeling pipelines (16 kHz, 16-bit, mono).

What it does

  • converts audio or extracts audio from video into a single WAV
  • preserves channel layout if you request it
  • uses ffmpeg with quiet, pipeline-safe flags and clear error reporting
  • predictable default output path if omitted

When to use

  • you need consistent, model-friendly WAVs from heterogeneous sources
  • you want a one-liner from notebooks or the CLI

API: convert audio to WAV

Convert any FFmpeg-readable audio/video file to a linear PCM WAV.

Parameters:

Name Type Description Default
input_path str | Path

Source media file (audio or video container). FFmpeg must be able to read it.

required
output_path str | Path | None

Target WAV path. If None, defaults to <cwd>/audio/<input_stem>.wav.

None
sample_rate int

Desired sample rate (Hz).

16000
bit_depth (16, 24, 32)

Output PCM bit depth; maps to pcm_s{bit_depth}le codec.

16,24,32
channels int | None

If provided, set number of output channels (e.g., 1=mono, 2=stereo). If None, keep original channel count.

1
overwrite_existing bool

Overwrite output_path if it already exists.

False

Returns:

Type Description
Path

Path to the written WAV file.

Raises:

Type Description
FileNotFoundError

If input_path does not exist.

RuntimeError

If FFmpeg/FFprobe are missing or the conversion fails.

Notes
  • Video inputs are supported: the audio stream is extracted and converted.
  • For multi-channel sources and channels is None, channel layout is preserved.
  • We run FFmpeg with -nostdin to avoid TTY issues in pipelines.
Source code in src\taters\audio\convert_to_wav.py
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
def convert_audio_to_wav(
    input_path: Union[str, Path],
    *,
    output_path: Optional[Union[str, Path]] = None,
    output_dir: Optional[Union[str, Path]] = None,
    sample_rate: int = 16000,          # common for ASR
    bit_depth: int = 16,               # 16/24/32 signed PCM
    channels: int = 1,                 # 1=mono, 2=stereo
    overwrite_existing: bool = False,  # if the file already exists, let's not overwrite by default
) -> Path:
    """
    Convert any FFmpeg-readable audio/video file to a linear PCM WAV.

    Parameters
    ----------
    input_path : str | Path
        Source media file (audio or video container). FFmpeg must be able to read it.
    output_path : str | Path | None, optional
        Target WAV path. If None, defaults to
        ``<cwd>/audio/<input_stem>.wav``.
    sample_rate : int, default 16000
        Desired sample rate (Hz).
    bit_depth : {16,24,32}, default 16
        Output PCM bit depth; maps to ``pcm_s{bit_depth}le`` codec.
    channels : int | None, default 1
        If provided, set number of output channels (e.g., 1=mono, 2=stereo).
        If None, keep original channel count.
    overwrite_existing : bool, default False
        Overwrite `output_path` if it already exists.

    Returns
    -------
    Path
        Path to the written WAV file.

    Raises
    ------
    FileNotFoundError
        If `input_path` does not exist.
    RuntimeError
        If FFmpeg/FFprobe are missing or the conversion fails.

    Notes
    -----
    - Video inputs are supported: the audio stream is extracted and converted.
    - For multi-channel sources and `channels is None`, channel layout is preserved.
    - We run FFmpeg with ``-nostdin`` to avoid TTY issues in pipelines.
    """

    _check_ffmpeg()

    in_path = Path(input_path).resolve()
    if not in_path.exists():
        raise FileNotFoundError(f"Input file not found: {in_path}")

    if output_path and output_dir:
        raise ValueError("Provide at most one of output_path or output_dir.")

    if output_path:
        out_path = Path(output_path).resolve()
    else:
        base = in_path.stem + ".wav"
        out_dir = Path(output_dir).resolve() if output_dir else Path.cwd() / "audio"
        out_dir.mkdir(parents=True, exist_ok=True)
        out_path = out_dir / base

    if not overwrite_existing and Path(out_path).is_file():
        print("WAV file already exists; returning existing file.")
        return out_path

    pcm_map = {16: "pcm_s16le", 24: "pcm_s24le", 32: "pcm_s32le"}
    if bit_depth not in pcm_map:
        raise ValueError("bit_depth must be one of {16, 24, 32}.")
    if channels not in (1, 2):
        raise ValueError("channels must be 1 (mono) or 2 (stereo).")

    cmd = [
        "ffmpeg",
        "-nostdin",
        "-hide_banner", "-loglevel", "error",
        "-y" if overwrite_existing else "-n",
        "-i", str(in_path),
        "-vn",                        # ignore video
        "-acodec", pcm_map[bit_depth],
        "-ar", str(sample_rate),
        "-ac", str(channels),
        str(out_path),
    ]

    result = subprocess.run(cmd, capture_output=True, text=True, stdin=subprocess.DEVNULL)
    if result.returncode != 0:
        if not overwrite_existing and out_path.exists():
            raise FileExistsError(f"Target exists (use overwrite=True): {out_path}")
        raise RuntimeError(f"ffmpeg failed: {result.stderr.strip()}")

    return out_path

Diarize and transcribe

This is a thin CLI shim that forwards to the vendored Whisper diarization wrapper. It handles device selection and writes transcripts alongside subtitles with helpful defaults. Use it to get a timestamped CSV (start_time,end_time,speaker,text) plus SRT/TXT that other Taters tools can consume immediately.

What it does

  • delegates to the underlying whisper diarization wrapper
  • writes transcripts to a predictable folder by input stem
  • accepts device hints (cuda/cpu/auto)

When to use

  • you are preparing per-segment text for embeddings or dictionary coding
  • you plan to split a long recording into per-speaker WAVs

API: diarize with third-party wrapper (CLI entry)

Important note: This function is also exposed more easily via taters.audio.diarize_with_thirdparty

Run the vendored Whisper diarization scripts and normalize their outputs.

Parameters:

Name Type Description Default
audio_path str | Path

Input audio (WAV recommended).

required
out_dir str | Path

Output directory for transcript artifacts. If it does not exist, it will be created.

None
repo_dir str | Path | None

Optional explicit location of the diarization repo. If None, the vendored copy is used.

None
whisper_model str

Whisper ASR model to use (e.g., "small", "base", "large-v3").

"medium.en"
language str | None

Language hint for Whisper (e.g., "en"); if None, autodetection is used.

None
device ('cpu', 'cuda')

Runtime device. If "cpu", environment variables are set to hide GPUs.

"cpu","cuda"
batch_size int

Whisper batch size; 0 disables batching.

0
no_stem bool

Pass through to demucs/whisper scripts to disable vocal/instrument stems.

False
suppress_numerals bool

Heuristic to reduce spurious numeral tokens.

False
parallel bool

Use parallel diarization script if available.

False
timeout int | None

Subprocess timeout in seconds; None means no timeout.

None
use_custom bool

Prefer the customized script if present (adds CSV emission and minor cleanup).

True
keep_temp bool

If False (default), temporary folders created by demucs/whisper are removed.

False
num_speakers int | None

Force a fixed number of speakers, if the downstream diarizer supports it.

None

Returns:

Type Description
DiarizationOutputFiles

Paths to .txt, .srt, and .csv (if produced) in a per-file working directory, plus an (empty) speaker_wavs mapping for API compatibility.

Notes
  • The function copies the input WAV to a per-file work directory before running, to ensure relative paths inside the third-party scripts resolve correctly.
  • If device="cpu", CUDA is disabled in the child environment.
  • On success, the local WAV copy is deleted and temporary folders are tidied up.
See Also

taters.audio.split_wav_by_speaker.make_speaker_wavs_from_csv : Build per-speaker WAVs from the diarization CSV.

Source code in src\taters\audio\diarizer\whisper_diar_wrapper.py
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
def run_whisper_diarization_repo(
    audio_path: str | Path,
    out_dir: Optional[str] | Optional[Path] | None = None,
    *,
    overwrite_existing: bool = False,  # if the file already exists, let's not overwrite by default
    repo_dir: str | Path | None = None,      # ← now Optional
    whisper_model: str = "base.en",
    language: Optional[str] = None,
    device: Optional[str] = None,            # "cuda" / "cpu"
    batch_size: int = 0,
    no_stem: bool = False,
    suppress_numerals: bool = False,
    parallel: bool = False,
    timeout: Optional[int] = None,
    use_custom: bool = True,
    keep_temp: bool = False,
    num_speakers: Optional[int] = None,
) -> DiarizationOutputFiles:
    """
    Run the vendored Whisper diarization scripts and normalize their outputs.

    Parameters
    ----------
    audio_path : str | Path
        Input audio (WAV recommended).
    out_dir : str | Path
        Output directory for transcript artifacts. If it does not exist, it will
        be created.
    repo_dir : str | Path | None
        Optional explicit location of the diarization repo. If None, the
        vendored copy is used.
    whisper_model : str, default "medium.en"
        Whisper ASR model to use (e.g., "small", "base", "large-v3").
    language : str | None
        Language hint for Whisper (e.g., "en"); if None, autodetection is used.
    device : {"cpu","cuda"} | None
        Runtime device. If "cpu", environment variables are set to hide GPUs.
    batch_size : int, default 0
        Whisper batch size; 0 disables batching.
    no_stem : bool, default False
        Pass through to demucs/whisper scripts to disable vocal/instrument stems.
    suppress_numerals : bool, default False
        Heuristic to reduce spurious numeral tokens.
    parallel : bool, default False
        Use parallel diarization script if available.
    timeout : int | None
        Subprocess timeout in seconds; None means no timeout.
    use_custom : bool, default True
        Prefer the customized script if present (adds CSV emission and minor cleanup).
    keep_temp : bool, default False
        If False (default), temporary folders created by demucs/whisper are removed.
    num_speakers : int | None
        Force a fixed number of speakers, if the downstream diarizer supports it.

    Returns
    -------
    DiarizationOutputFiles
        Paths to ``.txt``, ``.srt``, and ``.csv`` (if produced) in a per-file working
        directory, plus an (empty) ``speaker_wavs`` mapping for API compatibility.

    Notes
    -----
    - The function copies the input WAV to a per-file work directory before running,
      to ensure relative paths inside the third-party scripts resolve correctly.
    - If `device="cpu"`, CUDA is disabled in the child environment.
    - On success, the local WAV copy is deleted and temporary folders are tidied up.

    See Also
    --------
    taters.audio.split_wav_by_speaker.make_speaker_wavs_from_csv :
        Build per-speaker WAVs from the diarization CSV.
    """


    # Decide device if user passed "auto" (or None)
    resolved_device = _resolve_device(device)
    print(f"Resolved device for whisper extraction: {resolved_device}")

    audio_path = Path(audio_path).resolve()
    # default transcripts folder next to current working dir
    out_dir = Path(out_dir).resolve() if out_dir is not None else (Path.cwd() / "transcripts")
    out_dir.mkdir(parents=True, exist_ok=True)

    # Isolated working folder
    work_dir = out_dir / f"{audio_path.stem}"
    work_dir.mkdir(parents=True, exist_ok=True)

    local_audio = work_dir / audio_path.name

    # CSV default: <work_dir>/<stem>.csv
    csv_path = work_dir / f"{local_audio.stem}.csv"
    if not overwrite_existing and Path(csv_path).is_file():
        print("Diarized transcript output file already exists; returning existing file.")
        raw = _guess_outputs_from_stem(work_dir, local_audio.stem)
        return DiarizationOutputFiles(work_dir=work_dir, raw_files=raw, speaker_wavs={})


    # Copy input audio next to outputs so the CLI can use simple relative paths
    if not local_audio.exists():
        shutil.copy2(audio_path, local_audio)

    # Resolve path to vendored repo (or use user-supplied path)
    with ExitStack() as stack:
        if repo_dir is None:
            repo_trav = _resolve_vendored_repo_dir()
            repo_dir_path = stack.enter_context(as_file(repo_trav))  # real FS path
        else:
            repo_dir_path = Path(repo_dir).resolve()

        # Validate the script exists inside the repo
        script_name = ("diarize_custom.py" if (use_custom and (repo_dir_path / "diarize_custom.py").exists())
                       else ("diarize_parallel.py" if parallel else "diarize.py"))
        script_path = (repo_dir_path / script_name)
        if not script_path.exists():
            raise FileNotFoundError(f"Expected script not found: {script_path}")

        # Run the repo script (cwd = work_dir so temp_outputs land there)
        _run_repo_script(
            repo_dir=repo_dir_path,
            audio_path=local_audio,
            work_dir=work_dir,
            whisper_model=whisper_model,
            language=language,
            device=resolved_device,
            batch_size=batch_size,
            no_stem=no_stem,
            suppress_numerals=suppress_numerals,
            parallel=parallel,
            timeout=timeout,
            use_custom=use_custom,
            csv_out=csv_path,
            num_speakers=num_speakers,
        )

    # Tidy temp dirs
    _cleanup_temps(work_dir, keep_temp)

    # Collect outputs (.txt/.srt/.csv)
    raw = _guess_outputs_from_stem(work_dir, local_audio.stem)

    # Remove the copied WAV now that we're done
    try:
        if local_audio.exists():
            local_audio.unlink()
    except Exception:
        pass

    return DiarizationOutputFiles(work_dir=work_dir, raw_files=raw, speaker_wavs={})

Whisper encoder embeddings

Export Whisper encoder embeddings as features. Two modes:

  • Transcript-driven: one vector per transcript row (e.g., per diarized segment)
  • General-audio: segment the WAV with fixed windows or non-silent spans; optionally mean-pool to a single vector for the whole file

Outputs land under ./features/whisper-embeddings/ by default with a stable <stem>_embeddings.csv name. The wrapper runs extraction in a subprocess by default to avoid CUDA/Torch collisions elsewhere in your pipeline.

What it does

  • computes D-dimensional vectors using Faster-Whisper/CTranslate2 backends
  • segmenting strategies for raw audio; segment-level when given a transcript
  • optional single-row mean pooling for whole-file summaries
  • isolates heavy GPU state in a child process by default

When to use

  • you want robust, speech-centric features for clustering, retrieval, or as inputs to downstream models
  • you have transcripts already and want segment-level representations aligned to text

API: extract Whisper embeddings

Export Whisper encoder embeddings to a CSV file, using a subprocess by default.

Parameters:

Name Type Description Default
source_wav str | Path

Path to the input WAV. Must be readable by librosa.

required
transcript_csv str | Path | None

If provided, enables transcript-driven mode. The CSV is expected to contain timestamp columns and (optionally) a speaker column. A row is emitted per transcript segment.

None
time_unit ('auto', 'ms', 's', 'samples')

How to interpret timestamps in transcript_csv. In "auto", the worker heuristically infers the unit from max end time vs audio duration.

"auto","ms","s","samples"
strategy ('windows', 'nonsilent')

General-audio mode only. "windows" uses fixed sized windows with overlap; "nonsilent" uses an energy-based splitter (librosa.effects.split).

"windows","nonsilent"
window_s float

General-audio mode only. Window length and hop (seconds).

30.0, 15.0
hop_s float

General-audio mode only. Window length and hop (seconds).

30.0, 15.0
min_seg_s float

General-audio mode only. Skip segments shorter than this many seconds.

1.0
top_db float

General-audio mode only ("nonsilent"). Threshold (dB) below reference to consider as silence. Smaller → more segments; larger → fewer.

30.0
aggregate ('none', 'mean')

General-audio mode only. If "mean", a single pooled row is written covering the entire file; otherwise one row per segment.

"none","mean"
output_dir str | Path | None

Directory for the output CSV. If None, defaults to ./features/whisper-embeddings.

None
model_name str

Model identifier passed through to the worker (e.g., "tiny", "base", "small", "large-v3" or a local CTranslate2 model directory).

"base"
device ('auto', 'cuda', 'cpu')

Runtime device. If "cpu", environment variables are set to disable CUDA in the child process.

"auto","cuda","cpu"
compute_type str

CTranslate2 compute type (e.g., "float16", "int8", "float32"); passed to the worker module.

"float16"
run_in_subprocess bool

If True (recommended), runs extraction in a separate Python process to isolate Torch/CUDA state from the parent process.

True
extra_env dict | None

Additional environment variables to inject into the child process.

None
verbose bool

If True, print the launched command and the child's stdout.

True
extractor_module str

Dotted module path whose __main__ implements the extractor CLI.

"chopshop.audio.extract_whisper_embeddings_subproc"

Returns:

Type Description
Path

Path to the written embeddings CSV. Pattern: <output_dir>/<source_stem>_embeddings.csv.

Notes
  • The subprocess writes and exits. The parent returns once the file exists.
  • If transcript_csv is supplied, the worker runs in transcript mode; otherwise general-audio mode is used with the given segmentation strategy.
  • Failures in the child process are re-raised with the captured stdout/stderr to ease debugging.

Examples:

Transcript per-segment embeddings:

>>> extract_whisper_embeddings(
...     source_wav="audio/session.wav",
...     transcript_csv="transcripts/session.csv",
...     time_unit="ms",
...     model_name="small",
...     device="cuda",
... )

Whole-file mean embedding:

>>> extract_whisper_embeddings(
...     source_wav="audio/session.wav",
...     strategy="nonsilent",
...     aggregate="mean",
...     output_dir="features/whisper-embeddings",
... )
Source code in src\taters\audio\extract_whisper_embeddings.py
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
def extract_whisper_embeddings(
    *,
    # required
    source_wav: Union[str, Path],

    # optional transcript-driven mode
    transcript_csv: Optional[Union[str, Path]] = None,
    time_unit: Literal["auto", "ms", "s", "samples"] = "auto",

    # general-audio mode (used when transcript_csv is None)
    strategy: Literal["windows", "nonsilent"] = "windows",
    window_s: float = 30.0,
    hop_s: float = 15.0,
    min_seg_s: float = 1.0,
    top_db: float = 30.0,
    aggregate: Literal["none", "mean"] = "none",

    # outputs
    output_dir: Optional[Union[str, Path]] = None,
    overwrite_existing: bool = False,  # if the file already exists, let's not overwrite by default

    # model/runtime
    model_name: str = "base",
    device: Literal["auto", "cuda", "cpu"] = "auto",
    compute_type: str = "float16",

    # execution strategy
    run_in_subprocess: bool = True,
    extra_env: Optional[dict] = None,
    verbose: bool = True,

    # where the extractor lives (python -m <module>)
    extractor_module: str = "taters.audio.extract_whisper_embeddings_subproc",
) -> Path:
    """
    Export Whisper encoder embeddings to a CSV file, using a subprocess by default.

    Parameters
    ----------
    source_wav : str | Path
        Path to the input WAV. Must be readable by `librosa`.
    transcript_csv : str | Path | None, optional
        If provided, enables transcript-driven mode. The CSV is expected to contain
        timestamp columns and (optionally) a speaker column. A row is emitted per
        transcript segment.
    time_unit : {"auto","ms","s","samples"}, default "auto"
        How to interpret timestamps in `transcript_csv`. In "auto", the worker
        heuristically infers the unit from max end time vs audio duration.
    strategy : {"windows","nonsilent"}, default "windows"
        General-audio mode only. "windows" uses fixed sized windows with overlap;
        "nonsilent" uses an energy-based splitter (librosa.effects.split).
    window_s, hop_s : float, default 30.0, 15.0
        General-audio mode only. Window length and hop (seconds).
    min_seg_s : float, default 1.0
        General-audio mode only. Skip segments shorter than this many seconds.
    top_db : float, default 30.0
        General-audio mode only ("nonsilent"). Threshold (dB) below reference to
        consider as silence. Smaller → more segments; larger → fewer.
    aggregate : {"none","mean"}, default "none"
        General-audio mode only. If "mean", a single pooled row is written covering
        the entire file; otherwise one row per segment.
    output_dir : str | Path | None, optional
        Directory for the output CSV. If None, defaults to
        ``./features/whisper-embeddings``.
    model_name : str, default "base"
        Model identifier passed through to the worker (e.g., "tiny", "base",
        "small", "large-v3" or a local CTranslate2 model directory).
    device : {"auto","cuda","cpu"}, default "auto"
        Runtime device. If "cpu", environment variables are set to disable CUDA
        in the child process.
    compute_type : str, default "float16"
        CTranslate2 compute type (e.g., "float16", "int8", "float32"); passed to
        the worker module.
    run_in_subprocess : bool, default True
        If True (recommended), runs extraction in a separate Python process to
        isolate Torch/CUDA state from the parent process.
    extra_env : dict | None, optional
        Additional environment variables to inject into the child process.
    verbose : bool, default True
        If True, print the launched command and the child's stdout.
    extractor_module : str, default "chopshop.audio.extract_whisper_embeddings_subproc"
        Dotted module path whose ``__main__`` implements the extractor CLI.

    Returns
    -------
    Path
        Path to the written embeddings CSV. Pattern:
        ``<output_dir>/<source_stem>_embeddings.csv``.

    Notes
    -----
    - The subprocess writes and exits. The parent returns once the file exists.
    - If `transcript_csv` is supplied, the worker runs in transcript mode; otherwise
      general-audio mode is used with the given segmentation strategy.
    - Failures in the child process are re-raised with the captured stdout/stderr
      to ease debugging.

    Examples
    --------
    Transcript per-segment embeddings:

    >>> extract_whisper_embeddings(
    ...     source_wav="audio/session.wav",
    ...     transcript_csv="transcripts/session.csv",
    ...     time_unit="ms",
    ...     model_name="small",
    ...     device="cuda",
    ... )

    Whole-file mean embedding:

    >>> extract_whisper_embeddings(
    ...     source_wav="audio/session.wav",
    ...     strategy="nonsilent",
    ...     aggregate="mean",
    ...     output_dir="features/whisper-embeddings",
    ... )
    """

    source_wav = Path(source_wav).resolve()
    # default to ./features/whisper-embeddings when not provided
    out_dir_final = (
        Path(output_dir).resolve()
        if output_dir
        else (Path.cwd() / "features" / "whisper-embeddings")
    )

    out_dir_final.mkdir(parents=True, exist_ok=True)
    output_csv = out_dir_final / f"{source_wav.stem}_embeddings.csv"

    if not overwrite_existing and Path(output_csv).is_file():
        print("Whisper embedding feature output file already exists; returning existing file.")
        return output_csv

    if not run_in_subprocess:
        # ---- In-process path (only when you’re sure no Torch/CUDA conflicts) ----
        from ..audio.extract_whisper_embeddings import (  # type: ignore
            export_segment_embeddings_csv,
            export_audio_embeddings_csv,
            EmbedConfig,
        )
        cfg = EmbedConfig(model_name=model_name, device=device, compute_type=compute_type, time_unit=time_unit)
        if transcript_csv is not None:
            transcript_csv = Path(transcript_csv).resolve()
            return Path(
                export_segment_embeddings_csv(
                    transcript_csv=transcript_csv,
                    source_wav=source_wav,
                    output_dir=out_dir_final,
                    config=cfg,
                )
            )
        else:
            return Path(
                export_audio_embeddings_csv(
                    source_wav=source_wav,
                    output_dir=out_dir_final,
                    config=cfg,
                    strategy=strategy,
                    window_s=window_s,
                    hop_s=hop_s,
                    min_seg_s=min_seg_s,
                    top_db=top_db,
                    aggregate=aggregate,
                )
            )

    # ---- Subprocess path (recommended) ----
    env = os.environ.copy()
    # Keep Transformers from importing heavy backends in the child
    env.setdefault("TRANSFORMERS_NO_TORCH", "1")
    env.setdefault("TRANSFORMERS_NO_TF", "1")
    env.setdefault("TRANSFORMERS_NO_FLAX", "1")

    if extra_env:
        env.update({k: str(v) for k, v in extra_env.items()})

    if device == "cpu":
        # Make sure the child won’t try CUDA
        env.update({"CUDA_VISIBLE_DEVICES": "", "USE_CUDA": "0", "FORCE_CPU": "1"})
    else:
        # Best-effort: prepend cuDNN wheel's lib dir if available
        try:
            import nvidia.cudnn, pathlib  # type: ignore
            cudnn_lib = str(pathlib.Path(nvidia.cudnn.__file__).with_name("lib"))
            env["LD_LIBRARY_PATH"] = cudnn_lib + ":" + env.get("LD_LIBRARY_PATH", "")
        except Exception:
            pass

    cmd = [
        sys.executable, "-m", extractor_module,
        "--source_wav", str(source_wav),
        "--output_dir", str(out_dir_final),
        "--model_name", model_name,
        "--device", device,
        "--compute_type", compute_type,
    ]

    if transcript_csv is not None:
        transcript_csv = Path(transcript_csv).resolve()
        cmd += ["--transcript_csv", str(transcript_csv), "--time_unit", time_unit]
    else:
        cmd += [
            "--strategy", strategy,
            "--window_s", str(window_s),
            "--hop_s", str(hop_s),
            "--min_seg_s", str(min_seg_s),
            "--top_db", str(top_db),
            "--aggregate", aggregate,
        ]

    if verbose:
        print("Launching embedding subprocess:")
        print(" ", shlex.join(cmd))

    try:
        res = subprocess.run(cmd, check=True, env=env, capture_output=True, text=True, stdin=subprocess.DEVNULL)
        if verbose and res.stdout:
            print(res.stdout.strip())
    except subprocess.CalledProcessError as e:
        raise RuntimeError(
            f"Embedding subprocess failed with code {e.returncode}\n"
            f"CMD: {shlex.join(cmd)}\n"
            f"STDOUT:\n{(e.stdout or '').strip()}\n\n"
            f"STDERR:\n{(e.stderr or '').strip()}"
        ) from e

    if not output_csv.exists():
        raise FileNotFoundError(f"Expected embeddings CSV not found: {output_csv}")

    if verbose:
        print(f"Embeddings CSV written to: {output_csv}")

    return output_csv

options: members_order: alphabetical show_source: true


Here’s a drop-in section you can paste into your Analyzing Audio guide right after the existing modules. It mirrors the tone/structure of your text pages and surfaces the key knobs you added (preprocessing, VAD tuning, etc.).


Vocal acoustics (Parselmouth/Praat)

This extractor turns a WAV (or a transcript-guided set of turns) into framewise tracks and clean summaries of voice acoustics. Under the hood it uses Praat/Parselmouth for f0, formants (F1–F4), loudness, HNR, jitter/shimmer, and optional tremor/glottal metrics, plus MFCC stats and simple pause/silence measures. Use it when you want interpretable, physiology-adjacent voice features that play nicely with clinical and social-science workflows.

Warning: I am not and acoustics person. Treat the output from this module with skepticism for time being.

Preprocessing

By default, the audio is standardized to reduce "garbage in → garbage out" risk:

  • resample to target_sr (default 44.1 kHz)
  • remove DC offset (channel-wise when stereo)
  • loudness-normalize to target_dbfs (default −20 dBFS)

You can turn this off (preprocess=False) or tweak the targets as needed.

Operating modes

  • Whole file: analyze a single WAV and write summary (+ framewise, by default).
  • Per turn (with transcript): segment the WAV using start_time/end_time from a CSV (optionally group/aggregate by columns like speaker). Framewise output includes segment_index and any IDs you pass through.

Outputs

Two CSVs, with predictable default names under features/acoustics/:

  • Framewise (on by default): one row per short-time frame with time_s, f0_hz, f1_hzf4_hz, loudness_db, hnr_db (+ IDs in per-turn mode).
  • Summary (always): mean/std/range of the framewise series (optionally only on voiced segmentssummarize_on_voiced_segments_ms), silence ratio, jitter/shimmer variants, MFCC means/variances, CPP (if available), and optional tremor/glottal metrics.

Practical parameters (hey, that's alliteration!)

  • Voiced-only summaries: summarize_on_voiced_segments_ms=100 limits stats to voiced stretches ≥100 ms (helps avoid tiny blips).
  • Pause/VAD tuning: pause_top_db, pause_frame_length, pause_hop_length let you adjust librosa’s non-silence detection (useful for noisy rooms).
  • Pitch range: f0_min, f0_max bound the search for f0 (frame values outside the range are treated as unvoiced).
  • Cepstra: n_mfcc controls how many MFCCs are summarized.
  • Tremor/glottal: set mode="tremor" (requires a Praat script) or mode="advanced" (adds DisVoice-based glottal features).

CLI examples

Whole-file, default preprocessing and framewise+summary:

python -m taters.audio.analyze_vocal_acoustics \
  --wav "audio/session.wav" \
  --mode simple --overwrite_existing

Per-turn, aggregate by speaker:

python -m taters.audio.analyze_vocal_acoustics \
  --wav "audio/session.wav" \
  --transcript "transcripts/session.csv" \
  --time-unit ms \
  --group-by speaker \
  --mode simple \
  --overwrite_existing

Tuning pause detection (looser threshold, longer window):

python -m taters.audio.analyze_vocal_acoustics \
  --wav "audio/session.wav" \
  --pause-top-db 40 --pause-frame-length 4096 --pause-hop-length 1024

Python example

from taters import Taters
t = Taters()

res = t.audio.analyze_vocal_acoustics(
    wav_path="audio/session.wav",
    # or: transcript_csv="transcripts/session.csv", group_by=["speaker"],
    mode="simple",
    summarize_on_voiced_segments_ms=100,
    preprocess=True, target_sr=44100, target_dbfs=-20.0, remove_dc=True,
    pause_top_db=30, pause_frame_length=2048, pause_hop_length=512,
    overwrite_existing=True
)

print("Framewise CSV:", res["framewise_csv"])
print("Summary CSV:", res["summary_csv"])

Troubleshooting tips

  • Many Nones for formants/HNR: often means unvoiced frames or out-of-range pitch. Try widening f0_min/f0_max, or ensure preprocessing is on. Very narrow-band or music-heavy audio can also throw formant estimation off.
  • CPP/GNE missing: those depend on Praat commands that are not available in every environment; the extractor will warn and continue.
  • Weird pause counts: adjust pause_top_db upwards (e.g., 40) or use larger pause_frame_length to be more conservative in noisy recordings.

API: acoustic analysis

Extract acoustic features and write a summary CSV and (by default) a framewise CSV.

This function computes a battery of speech/voice features using Praat/Parselmouth-style workflows with optional cepstral, tremor, and glottal measures. It supports two operating modes:

1) Whole-file analysis Features are derived across the entire WAV. Summary statistics can be restricted to voiced segments longer than a threshold.

2) Per-turn analysis (transcript-guided) The WAV is segmented using start_time/end_time from a transcript CSV, features are computed per segment, and (optionally) per-segment rows are aggregated via group_by (e.g., one row per speaker).

Two artifacts can be written: • Summary CSV (always): means/SDs/ranges of framewise series; silence ratio; jitter/shimmer; MFCC means/variances; CPP (if available); optional tremor/glottal. With a transcript and group_by, the summary is aggregated per group. • Framewise CSV (default): one row per short-time frame (f0, F1–F4, loudness, HNR). Disable with include_framewise=False or set a custom path.

Parameters:

Name Type Description Default
wav_path str or Path

Path to a WAV file (mono or stereo, PCM). Required for both whole-file and per-turn modes. If both wav_path and transcript_csv are None, a ValueError is raised.

None
transcript_csv str or Path

Path to a transcript CSV with at least start_time, end_time (and typically speaker). Intervals with non-positive duration are skipped. When provided, per-turn analysis is performed.

None
time_unit ('ms', 's')

Units for start_time and end_time in transcript_csv.

"ms"
group_by sequence of str

Column names from transcript_csv used to aggregate per-turn summaries into higher-level rows (e.g., ["speaker"]). If omitted, per-turn rows are written without aggregation.

None
extra_id_cols sequence of str

Identifier/metadata columns to pass through when present (and to use as grouping keys where applicable). These are not numerically aggregated.

("source", "speaker")
out_dir str or Path

Base directory for outputs if file paths are not given. Defaults to ./features/acoustics (created if missing).

None
out_framewise_csv str or Path

Path for the framewise CSV. If omitted and include_framewise=True, defaults to <out_dir>/<stem>_framewise.csv.

None
out_summary_csv str or Path

Path for the summary CSV. If omitted, defaults to <out_dir>/<stem>_summary.csv (or an equivalent name in per-turn mode).

None
overwrite_existing bool

If False and an output already exists, returns existing paths without recomputation. If True, outputs are recomputed and overwritten.

False
include_framewise bool

If True, also write the framewise table. Set to False to write only the summary.

True
mode ('simple', 'tremor', 'advanced')

Feature families to compute: - "simple": framewise f0, formants (F1–F4), loudness, HNR; summary stats; silence ratio; jitter/shimmer; MFCC means/variances; CPP (if available). - "tremor": everything in simple plus tremor metrics via a Praat script (requires tremor_script). - "advanced": everything in tremor plus glottal features (requires DisVoice and dependencies).

"simple"
summarize_on_voiced_segments_ms int or None

If an integer, summary statistics for framewise series are computed only on voiced segments whose duration is at least this many milliseconds. If None, all frames are used.

100
f0_min float

Minimum fundamental frequency (Hz) for pitch tracking. Out-of-range f0 values are treated as unvoiced (0 in framewise; excluded from voiced summaries).

75.0
f0_max float

Maximum fundamental frequency (Hz) for pitch tracking.

500.0
n_mfcc int

Number of MFCC coefficients to summarize (means and variances).

14
tremor_script str or Path

Path to a Praat tremor script. Required when mode in {"tremor","advanced"}.

None
preprocess bool

If True (whole-file mode), resample to target_sr, optionally remove DC offset (remove_dc), and normalize level toward target_dbfs with headroom protection. In per-turn mode, slices are analyzed with consistent parameters and are not re-normalized per slice.

True
target_sr int

Target sample rate for preprocessing (whole-file mode).

44100
target_dbfs float

Target loudness (dBFS) for level normalization (whole-file mode).

-20.0
remove_dc bool

If True, attempt to remove DC offset during preprocessing (whole-file mode).

True
pause_top_db int

Non-silence threshold for pause detection (higher → fewer speech segments). Passed to librosa.effects.split.

30
pause_frame_length int

Frame length (samples) for pause detection.

2048
pause_hop_length int

Hop length (samples) for pause detection.

512

Returns:

Type Description
dict

Mapping with: {"framewise_csv": pathlib.Path or None, "summary_csv": pathlib.Path}.

Raises:

Type Description
ValueError

If neither wav_path nor transcript_csv is provided; if transcript_csv is provided without wav_path; if required transcript columns are missing; or if mode requires unavailable dependencies (e.g., tremor_script for "tremor", DisVoice for "advanced").

FileNotFoundError

If provided paths do not exist.

RuntimeError

If feature extraction fails due to decoding errors, invalid audio, or downstream library issues.

Notes

Framewise CSV (written when include_framewise=True)
One row per short-time frame with: frame_index, time_s, f0_hz, f1_hzf4_hz, loudness_db, hnr_db. In per-turn mode, also includes segment_index, start_s, end_s, and any extra_id_cols present.

Summary CSV (always written)
Whole-file: one row.
Per-turn (no group_by): one row per interval.
Per-turn with group_by: one row per group (e.g., per speaker).
Columns include summary stats of framewise series (on all frames or voiced segments ≥ summarize_on_voiced_segments_ms), silence ratio, jitter/shimmer variants, MFCC means/variances, CPP (if available), optional tremor/glottal metrics, and any extra_id_cols/group_by columns.

Performance

Per-turn analysis can be I/O intensive for long files with dense transcripts. Tremor/glottal metrics are substantially more expensive than simple mode.

Examples:

Whole-file analysis with framewise output:

>>> analyze_acoustics(
...     wav_path="session.wav",
...     out_dir="features/acoustics",
... )

Per-turn analysis aggregated by speaker:

>>> analyze_acoustics(
...     wav_path="session.wav",
...     transcript_csv="transcripts/session.csv",
...     time_unit="ms",
...     group_by=["speaker"],
...     extra_id_cols=["source", "speaker"],
...     out_summary_csv="features/acoustics/session_by_speaker.csv",
...     summarize_on_voiced_segments_ms=100,
...     mode="simple",
... )
Source code in src\taters\audio\analyze_vocal_acoustics.py
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
def analyze_acoustics(
    *,
    # Inputs (choose ONE of these two paths)
    wav_path: Optional[Union[str, Path]] = None,
    transcript_csv: Optional[Union[str, Path]] = None,  # if provided, we do per-turn analysis
    # Transcript options
    time_unit: Literal["ms","s"] = "ms",
    group_by: Optional[Sequence[str]] = None,       # e.g., ["speaker"]
    extra_id_cols: Sequence[str] = ("source","speaker"),
    # Output
    out_dir: Optional[Union[str, Path]] = None,
    out_framewise_csv: Optional[Union[str, Path]] = None,
    out_summary_csv: Optional[Union[str, Path]] = None,
    overwrite_existing: bool = False,
    include_framewise: bool = True,              # ← default ON now
    # Analysis options
    mode: Mode = "simple",
    summarize_on_voiced_segments_ms: Optional[int] = 100,
    f0_min: float = 75.0,
    f0_max: float = 500.0,
    n_mfcc: int = 14,
    tremor_script: Optional[Union[str, Path]] = None,
    # preprocessing controls
    preprocess: bool = True,
    target_sr: int = 44100,
    target_dbfs: float = -20.0,
    remove_dc: bool = True,
    # VAD/"pause" tuning
    pause_top_db: int = 30,
    pause_frame_length: int = 2048,
    pause_hop_length: int = 512,
) -> Dict[str, Optional[Path]]:
    """
    Extract acoustic features and write a summary CSV and (by default) a framewise CSV.

    This function computes a battery of speech/voice features using
    Praat/Parselmouth-style workflows with optional cepstral, tremor, and glottal
    measures. It supports two operating modes:

    1) Whole-file analysis
       Features are derived across the entire WAV. Summary statistics can be
       restricted to voiced segments longer than a threshold.

    2) Per-turn analysis (transcript-guided)
       The WAV is segmented using `start_time`/`end_time` from a transcript CSV,
       features are computed per segment, and (optionally) per-segment rows are
       aggregated via `group_by` (e.g., one row per speaker).

    Two artifacts can be written:
      • Summary CSV (always): means/SDs/ranges of framewise series; silence ratio;
        jitter/shimmer; MFCC means/variances; CPP (if available); optional tremor/glottal.
        With a transcript and `group_by`, the summary is aggregated per group.
      • Framewise CSV (default): one row per short-time frame (f0, F1–F4, loudness, HNR).
        Disable with `include_framewise=False` or set a custom path.

    Parameters
    ----------
    wav_path : str or pathlib.Path, optional
        Path to a WAV file (mono or stereo, PCM). Required for both whole-file
        and per-turn modes. If both `wav_path` and `transcript_csv` are ``None``,
        a ``ValueError`` is raised.
    transcript_csv : str or pathlib.Path, optional
        Path to a transcript CSV with at least `start_time`, `end_time` (and
        typically `speaker`). Intervals with non-positive duration are skipped.
        When provided, per-turn analysis is performed.
    time_unit : {"ms", "s"}, default "ms"
        Units for `start_time` and `end_time` in `transcript_csv`.
    group_by : sequence of str, optional
        Column names from `transcript_csv` used to aggregate per-turn summaries
        into higher-level rows (e.g., `["speaker"]`). If omitted, per-turn rows
        are written without aggregation.
    extra_id_cols : sequence of str, default ("source", "speaker")
        Identifier/metadata columns to pass through when present (and to use as
        grouping keys where applicable). These are not numerically aggregated.
    out_dir : str or pathlib.Path, optional
        Base directory for outputs if file paths are not given. Defaults to
        ``./features/acoustics`` (created if missing).
    out_framewise_csv : str or pathlib.Path, optional
        Path for the framewise CSV. If omitted and `include_framewise=True`,
        defaults to ``<out_dir>/<stem>_framewise.csv``.
    out_summary_csv : str or pathlib.Path, optional
        Path for the summary CSV. If omitted, defaults to
        ``<out_dir>/<stem>_summary.csv`` (or an equivalent name in per-turn mode).
    overwrite_existing : bool, default False
        If ``False`` and an output already exists, returns existing paths without
        recomputation. If ``True``, outputs are recomputed and overwritten.
    include_framewise : bool, default True
        If ``True``, also write the framewise table. Set to ``False`` to write
        only the summary.
    mode : {"simple", "tremor", "advanced"}, default "simple"
        Feature families to compute:
          - ``"simple"``: framewise f0, formants (F1–F4), loudness, HNR; summary stats;
            silence ratio; jitter/shimmer; MFCC means/variances; CPP (if available).
          - ``"tremor"``: everything in *simple* plus tremor metrics via a Praat script
            (requires `tremor_script`).
          - ``"advanced"``: everything in *tremor* plus glottal features (requires
            DisVoice and dependencies).
    summarize_on_voiced_segments_ms : int or None, default 100
        If an integer, summary statistics for framewise series are computed only
        on voiced segments whose duration is at least this many milliseconds.
        If ``None``, all frames are used.
    f0_min : float, default 75.0
        Minimum fundamental frequency (Hz) for pitch tracking. Out-of-range f0
        values are treated as unvoiced (0 in framewise; excluded from voiced summaries).
    f0_max : float, default 500.0
        Maximum fundamental frequency (Hz) for pitch tracking.
    n_mfcc : int, default 14
        Number of MFCC coefficients to summarize (means and variances).
    tremor_script : str or pathlib.Path, optional
        Path to a Praat tremor script. Required when ``mode in {"tremor","advanced"}``.
    preprocess : bool, default True
        If ``True`` (whole-file mode), resample to `target_sr`, optionally remove
        DC offset (`remove_dc`), and normalize level toward `target_dbfs` with
        headroom protection. In per-turn mode, slices are analyzed with consistent
        parameters and are not re-normalized per slice.
    target_sr : int, default 44100
        Target sample rate for preprocessing (whole-file mode).
    target_dbfs : float, default -20.0
        Target loudness (dBFS) for level normalization (whole-file mode).
    remove_dc : bool, default True
        If ``True``, attempt to remove DC offset during preprocessing (whole-file mode).
    pause_top_db : int, default 30
        Non-silence threshold for pause detection (higher → fewer speech segments).
        Passed to ``librosa.effects.split``.
    pause_frame_length : int, default 2048
        Frame length (samples) for pause detection.
    pause_hop_length : int, default 512
        Hop length (samples) for pause detection.

    Returns
    -------
    dict
        Mapping with:
        ``{"framewise_csv": pathlib.Path or None, "summary_csv": pathlib.Path}``.

    Raises
    ------
    ValueError
        If neither `wav_path` nor `transcript_csv` is provided; if `transcript_csv`
        is provided without `wav_path`; if required transcript columns are missing; or
        if `mode` requires unavailable dependencies (e.g., `tremor_script` for
        ``"tremor"``, DisVoice for ``"advanced"``).
    FileNotFoundError
        If provided paths do not exist.
    RuntimeError
        If feature extraction fails due to decoding errors, invalid audio, or
        downstream library issues.

    Notes
    -----
    **Framewise CSV** (written when `include_framewise=True`)  
    One row per short-time frame with: `frame_index`, `time_s`, `f0_hz`,
    `f1_hz`–`f4_hz`, `loudness_db`, `hnr_db`. In per-turn mode, also includes
    `segment_index`, `start_s`, `end_s`, and any `extra_id_cols` present.

    **Summary CSV** (always written)  
    Whole-file: one row.  
    Per-turn (no `group_by`): one row per interval.  
    Per-turn with `group_by`: one row per group (e.g., per speaker).  
    Columns include summary stats of framewise series (on all frames or voiced
    segments ≥ `summarize_on_voiced_segments_ms`), silence ratio, jitter/shimmer
    variants, MFCC means/variances, CPP (if available), optional tremor/glottal
    metrics, and any `extra_id_cols`/`group_by` columns.

    Performance
    -----------
    Per-turn analysis can be I/O intensive for long files with dense transcripts.
    Tremor/glottal metrics are substantially more expensive than *simple* mode.

    Examples
    --------
    Whole-file analysis with framewise output:

    >>> analyze_acoustics(
    ...     wav_path="session.wav",
    ...     out_dir="features/acoustics",
    ... )

    Per-turn analysis aggregated by speaker:

    >>> analyze_acoustics(
    ...     wav_path="session.wav",
    ...     transcript_csv="transcripts/session.csv",
    ...     time_unit="ms",
    ...     group_by=["speaker"],
    ...     extra_id_cols=["source", "speaker"],
    ...     out_summary_csv="features/acoustics/session_by_speaker.csv",
    ...     summarize_on_voiced_segments_ms=100,
    ...     mode="simple",
    ... )
    """

    if wav_path is None:
        raise ValueError("wav_path is required")

    wav_path = Path(wav_path)
    if out_dir is None:
        out_dir = Path("features") / "acoustics"
    out_dir = Path(out_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    # Default output paths
    stem = wav_path.stem
    if out_framewise_csv is None:
        out_framewise_csv = out_dir / f"{stem}_framewise.csv"
    else:
        out_framewise_csv = Path(out_framewise_csv)

    if out_summary_csv is None:
        suffix = "_by_" + "_".join(group_by) if (transcript_csv and group_by) else ""
        out_summary_csv = out_dir / f"{stem}_summary{suffix}.csv"
    else:
        out_summary_csv = Path(out_summary_csv)

    # Respect overwrite_existing
    if (not overwrite_existing) and out_summary_csv.exists():
        print(f"[acoustics] Summary output already exists; returning existing file: {out_summary_csv}")
        return {"framewise_csv": out_framewise_csv if out_framewise_csv.exists() else None,
                "summary_csv": out_summary_csv}

    # Run
    if transcript_csv:
        framewise_df, summary_df = _analyze_turns(
            wav_path=wav_path,
            transcript_csv=transcript_csv,
            time_unit=time_unit,
            group_by=group_by,
            extra_id_cols=extra_id_cols,
            mode=mode,
            summarize_on_voiced_segments_ms=summarize_on_voiced_segments_ms,
            include_framewise=include_framewise,
            tremor_script=tremor_script,
            f0_min=f0_min,
            f0_max=f0_max,
            n_mfcc=n_mfcc,
            preprocess=preprocess,
            target_sr=target_sr,
            target_dbfs=target_dbfs,
            remove_dc=remove_dc,
            pause_top_db=pause_top_db,
            pause_frame_length=pause_frame_length,
            pause_hop_length=pause_hop_length,
        )
    else:
        fdf, summ = _analyze_clip(
            wav_path,
            mode=mode,
            summarize_on_voiced_segments_ms=summarize_on_voiced_segments_ms,
            include_framewise=include_framewise,
            tremor_script=tremor_script,
            f0_min=f0_min,
            f0_max=f0_max,
            n_mfcc=n_mfcc,
            preprocess=preprocess,
            target_sr=target_sr,
            target_dbfs=target_dbfs,
            remove_dc=remove_dc,
            pause_top_db=pause_top_db,
            pause_frame_length=pause_frame_length,
            pause_hop_length=pause_hop_length,
        )
        # Build summary DF; carry extra_id_cols if user offered any via filename context later
        summary_df = pd.DataFrame([summ])
        framewise_df = fdf

    # Write outputs
    frame_path_out: Optional[Path] = None
    if include_framewise and framewise_df is not None:
        framewise_df.to_csv(out_framewise_csv, index=False, encoding="utf-8-sig")
        frame_path_out = out_framewise_csv

    summary_df.to_csv(out_summary_csv, index=False, encoding="utf-8-sig")

    return {"framewise_csv": frame_path_out, "summary_csv": out_summary_csv}

options: members_order: alphabetical show_source: true


Split WAV by speaker

Given a diarization transcript, create one WAV per speaker by concatenating that speaker's segments. You can insert small silences to avoid clicks at joins and resample/downmix on the fly. Filenames are readable and stable.

What it does

  • reads a timestamped CSV with start_time,end_time,speaker
  • builds one output WAV per unique speaker
  • optional silence padding, resampling, mono mixdown
  • skips ultra-short segments; clamps times to audio bounds

When to use

  • you want per-speaker audio for targeted feature extraction or human coding
  • you plan to model speakers separately or compute speaker-level aggregates

API: make per-speaker WAVs from a transcript

Concatenate speaker-specific segments into per-speaker WAV files.

If merge_consecutive=True (default), adjacent transcript rows with the same speaker are merged into a single, longer segment spanning from the first start to the last end — including any silence between those turns. If you need the strict per-row behavior, set merge_consecutive=False.

Parameters:

Name Type Description Default
source_wav str | Path

Path to the source WAV.

required
transcript_csv_path str | Path

CSV with timing and speaker columns (e.g., diarization output).

required
output_dir str | Path | None

Where to write the per-speaker files. If None, defaults to ./audio_split/<source_stem>/.

None
start_col str

Column names in the transcript CSV.

'start_time'
end_col str

Column names in the transcript CSV.

'start_time'
speaker_col str

Column names in the transcript CSV.

'start_time'
time_unit ('ms', 's')

Units for start/end columns.

"ms","s"
silence_ms int

If pre_silence_ms/post_silence_ms are None, use this for both sides.

1000
pre_silence_ms int | None

Explicit padding (ms) before/after each segment; overrides silence_ms.

None
post_silence_ms int | None

Explicit padding (ms) before/after each segment; overrides silence_ms.

None
sr int | None

Resample output to this rate. If None, keep original rate.

16000
mono bool

Downmix to mono if True.

True
min_dur_ms int

Skip segments shorter than this duration (ms).

50
merge_consecutive bool

Merge back-to-back turns for the same speaker into one segment span (including any inter-turn silence). If False, emit one clip per row.

True

Returns:

Type Description
dict[str, Path]

Mapping from friendly speaker label → output WAV path.

Behavior
  • Input speaker labels are sanitized for filenames but a more readable label (without path-hostile characters) is preserved for naming.
  • Segments are sorted by start time per speaker before concatenation.
  • If a speaker ends up with zero valid segments, no file is written.

Examples:

>>> make_speaker_wavs_from_csv(
...     source_wav="audio/session.wav",
...     transcript_csv_path="transcripts/session.csv",
...     time_unit="ms",
...     silence_ms=0,  # no padding
...     sr=16000,
...     mono=True,
... )
Source code in src\taters\audio\split_wav_by_speaker.py
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
def make_speaker_wavs_from_csv(
    source_wav: Union[str, Path],
    transcript_csv_path: Union[str, Path],
    output_dir: Union[str, Path, None] = None,
    *,
    overwrite_existing: bool = False,
    start_col: str = "start_time",
    end_col: str = "end_time",
    speaker_col: str = "speaker",
    time_unit: str = "ms",             # "ms" or "s"
    silence_ms: int = 1000,
    pre_silence_ms: Optional[int] = None,
    post_silence_ms: Optional[int] = None,
    sr: Optional[int] = 16000,
    mono: bool = True,
    min_dur_ms: int = 50,
    merge_consecutive: bool = True,    # NEW: merge back-to-back turns by same speaker
) -> Dict[str, Path]:
    """
    Concatenate speaker-specific segments into per-speaker WAV files.

    If `merge_consecutive=True` (default), adjacent transcript rows with the same
    speaker are merged into a single, longer segment spanning from the first
    start to the last end — including any silence between those turns. If you
    need the strict per-row behavior, set `merge_consecutive=False`.

    Parameters
    ----------
    source_wav : str | Path
        Path to the source WAV.
    transcript_csv_path : str | Path
        CSV with timing and speaker columns (e.g., diarization output).
    output_dir : str | Path | None, optional
        Where to write the per-speaker files. If None, defaults to
        ``./audio_split/<source_stem>/``.
    start_col, end_col, speaker_col : str
        Column names in the transcript CSV.
    time_unit : {"ms","s"}, default "ms"
        Units for start/end columns.
    silence_ms : int, default 1000
        If `pre_silence_ms`/`post_silence_ms` are None, use this for both sides.
    pre_silence_ms, post_silence_ms : int | None
        Explicit padding (ms) before/after each segment; overrides `silence_ms`.
    sr : int | None, default 16000
        Resample output to this rate. If None, keep original rate.
    mono : bool, default True
        Downmix to mono if True.
    min_dur_ms : int, default 50
        Skip segments shorter than this duration (ms).
    merge_consecutive : bool, default True
        Merge back-to-back turns for the same speaker into one segment span
        (including any inter-turn silence). If False, emit one clip per row.

    Returns
    -------
    dict[str, Path]
        Mapping from friendly speaker label → output WAV path.

    Behavior
    --------
    - Input speaker labels are sanitized for filenames but a more readable label
      (without path-hostile characters) is preserved for naming.
    - Segments are sorted by start time per speaker before concatenation.
    - If a speaker ends up with zero valid segments, no file is written.

    Examples
    --------
    >>> make_speaker_wavs_from_csv(
    ...     source_wav="audio/session.wav",
    ...     transcript_csv_path="transcripts/session.csv",
    ...     time_unit="ms",
    ...     silence_ms=0,  # no padding
    ...     sr=16000,
    ...     mono=True,
    ... )
    """
    if time_unit not in ("ms", "s"):
        raise ValueError("time_unit must be 'ms' or 's'")

    def _friendly_filename_label(name: str) -> str:
        s = (name or "").strip()
        s = s.replace("/", "_").replace("\\", "_")
        s = re.sub(r'[<>:"|?*]', "", s)
        s = re.sub(r"\s+", " ", s)
        return s or "SPEAKER_0"

    source_wav = Path(source_wav)
    transcript_csv_path = Path(transcript_csv_path)
    out_dir = Path(output_dir) if output_dir is not None else (Path.cwd() / "audio_split" / source_wav.stem)
    out_dir.mkdir(parents=True, exist_ok=True)
    base_stem = source_wav.stem

    audio = AudioSegment.from_file(source_wav)
    if sr:
        audio = audio.set_frame_rate(sr)
    if mono:
        audio = audio.set_channels(1)

    factor = 1000.0 if time_unit == "s" else 1.0
    audio_len_ms = len(audio)

    with transcript_csv_path.open(newline="", encoding="utf-8") as f:
        rows = list(csv.DictReader(f))

    segs_by_spk: Dict[str, List[tuple[int, int]]] = {}
    label_for_key: Dict[str, str] = {}

    # Build segments with awareness of original row order so that we can merge
    # adjacent turns for the same speaker when requested.
    prev_spk_key: Optional[str] = None
    for row in rows:
        try:
            start_raw = float(row[start_col])
            end_raw   = float(row[end_col])
            raw_spk   = str(row.get(speaker_col, "SPEAKER_0"))
        except Exception:
            continue

        start_ms = int(round(start_raw * factor))
        end_ms   = int(round(end_raw   * factor))
        if end_ms <= start_ms:
            continue

        start_ms = _clamp(start_ms, 0, audio_len_ms)
        end_ms   = _clamp(end_ms,   0, audio_len_ms)
        if end_ms <= start_ms:
            continue

        spk_key = _sanitize_speaker(raw_spk)
        label_for_key.setdefault(spk_key, _friendly_filename_label(raw_spk))

        if merge_consecutive and prev_spk_key == spk_key and segs_by_spk.get(spk_key):
            # Extend the last segment for this speaker to cover the new end
            s0, e0 = segs_by_spk[spk_key][-1]
            # Keep the earliest start, extend to the latest end
            s_new = min(s0, start_ms)
            e_new = max(e0, end_ms)
            segs_by_spk[spk_key][-1] = (s_new, e_new)
        else:
            # Strictly append a new segment
            segs_by_spk.setdefault(spk_key, []).append((start_ms, end_ms))

        prev_spk_key = spk_key

    # Optional: drop very short segments after merging
    for spk_key, segs in list(segs_by_spk.items()):
        segs_by_spk[spk_key] = [(s, e) for (s, e) in segs if (e - s) >= min_dur_ms]

    pre_ms  = silence_ms if pre_silence_ms  is None else pre_silence_ms
    post_ms = silence_ms if post_silence_ms is None else post_silence_ms
    pre_sil  = AudioSegment.silent(duration=max(0, pre_ms),  frame_rate=audio.frame_rate)
    post_sil = AudioSegment.silent(duration=max(0, post_ms), frame_rate=audio.frame_rate)
    if mono:
        pre_sil  = pre_sil.set_channels(1)
        post_sil = post_sil.set_channels(1)

    results: Dict[str, Path] = {}
    for spk_key, segs in segs_by_spk.items():
        if not segs:
            continue

        friendly = label_for_key.get(spk_key, spk_key)
        out_path = out_dir / f"{base_stem}_{friendly}.wav"

        if (not overwrite_existing) and out_path.is_file():
            results[friendly] = out_path
            continue

        out = AudioSegment.silent(duration=0, frame_rate=audio.frame_rate)
        if mono:
            out = out.set_channels(1)

        for (s, e) in segs:
            clip = audio[s:e]
            if len(clip) < min_dur_ms:
                continue
            out += pre_sil + clip + post_sil

        if len(out) == 0:
            continue

        out.export(out_path, format="wav", codec="pcm_s16le")
        results[friendly] = out_path

    return results

options: members_order: alphabetical show_source: true


Practical notes

  • Paths: if you do not pass explicit outputs, tools write to predictable folders next to your project root (for example, ./audio, ./features/whisper-embeddings).
  • Overwrite behavior: by default, functions will not overwrite existing files; pass the relevant flag to force a rebuild.
  • Device selection: where supported, device="auto" picks sensibly; set cuda or cpu explicitly when you need control.