Analyzing Audio¶

Audio is where Taters earns its name. The goal is simple: make it easy to get from messy containers and long recordings to clean, analysis-ready artifacts you can iterate on. The core tools cover three phases:

extract and standardize audio (WAVs at predictable locations)
structure the speech (diarization and transcripts)
turn waveforms into features (embeddings, per-speaker splits)

Everything follows the same philosophy as the text stack: predictable outputs, friendly defaults, and a "do not overwrite unless asked" rule.

Extract audio from video¶

Many recordings arrive as multi-track containers (Zoom, OBS, ProRes). This utility lists every audio stream and writes one WAV per stream with sensible names that include stream index and tags like language/title. It is handy both for audits and for preparing inputs to downstream steps.

What it does

probes audio streams with ffprobe
writes one PCM WAV per stream at your chosen sample rate and bit depth
predictable filenames: <stem>_a<index>[_<lang>][_<title>].wav
default output directory if you do not pass one

When to use

you have a video/container and want clean WAVs for each embedded track
you intend to diarize and embed only one stream (e.g., the mixed program feed)

API: split audio streams to WAV¶

Extract each audio stream in a container to its own WAV file.

Parameters:

Name	Type	Description	Default
`input_path`	`str \| PathLike`	Video or audio container readable by FFmpeg.	required
`output_dir`	`str \| PathLike \| None`	Destination directory. If None, defaults to `./audio` in the current working directory (predictable write location).	`None`
`sample_rate`	`int`	Target sample rate for the output WAVs (Hz).	`48000`
`bit_depth`	`(16, 24, 32)`	Output PCM bit depth (little-endian).	`16,24,32`
`overwrite`	`bool`	If True, overwrite existing files. If False and a target exists, raises :class:`FileExistsError`.	`True`

Returns:

Type	Description
`list[str]`	Absolute paths to the created WAVs.

Behavior

Output file names are constructed from the input base name and stream metadata: <stem>_a<index>[_<lang>][_<title>].wav with safe slugs.
Uses -map 0:a:<N> to select the N-th audio stream in the container.
Runs FFmpeg with -nostdin and quiet loglevel to avoid TTY lockups.

Examples:

>>> split_audio_streams_to_wav("session.mp4")
['.../audio/session_a0_eng.wav', '.../audio/session_a1_eng.wav']

Source code in src\taters\audio\extract_wav_from_video.py

def split_audio_streams_to_wav(
    input_path: str | os.PathLike,
    output_dir: str | os.PathLike | None = None,     # <-- now optional
    sample_rate: int = 48000,
    bit_depth: int = 16,
    overwrite: bool = True,
) -> List[str]:
    """
    Extract each audio stream in a container to its own WAV file.

    Parameters
    ----------
    input_path : str | os.PathLike
        Video or audio container readable by FFmpeg.
    output_dir : str | os.PathLike | None, optional
        Destination directory. If None, defaults to ``./audio`` in the current
        working directory (predictable write location).
    sample_rate : int, default 48000
        Target sample rate for the output WAVs (Hz).
    bit_depth : {16,24,32}, default 16
        Output PCM bit depth (little-endian).
    overwrite : bool, default True
        If True, overwrite existing files. If False and a target exists,
        raises :class:`FileExistsError`.

    Returns
    -------
    list[str]
        Absolute paths to the created WAVs.

    Behavior
    --------
    - Output file names are constructed from the input base name and stream
      metadata: ``<stem>_a<index>[_<lang>][_<title>].wav`` with safe slugs.
    - Uses ``-map 0:a:<N>`` to select the N-th audio stream in the container.
    - Runs FFmpeg with ``-nostdin`` and quiet loglevel to avoid TTY lockups.

    Examples
    --------
    >>> split_audio_streams_to_wav("session.mp4")
    ['.../audio/session_a0_eng.wav', '.../audio/session_a1_eng.wav']
    """

    _check_binaries()

    in_path = Path(input_path)
    if not in_path.exists():
        raise FileNotFoundError(f"Input file not found: {in_path}")

    # Default predictable location when none is provided
    if output_dir is None:
        out_dir = Path.cwd() / "audio"
    else:
        out_dir = Path(output_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    print(f"Extracting audio streams from {in_path} to {out_dir} at {sample_rate} Hz, bit depth: {bit_depth}")

    streams = _probe_audio_streams(in_path)
    if not streams:
        raise ValueError("No audio streams found in input.")

    pcm_fmt_map = {16: "pcm_s16le", 24: "pcm_s24le", 32: "pcm_s32le"}
    if bit_depth not in pcm_fmt_map:
        raise ValueError("bit_depth must be one of {16, 24, 32}.")
    pcm_codec = pcm_fmt_map[bit_depth]

    created_files: List[str] = []
    base = in_path.stem

    for s in streams:
        idx = s.get("index")
        tags = s.get("tags", {}) or {}
        lang = tags.get("language")
        title = tags.get("title")

        print(f"Extracting audio stream:\n"
              f"index: {idx}\n"
              f"tags: {tags}\n"
              f"language: {lang}\n"
              f"title: {title}\n")

        out_name = _build_wav_name(base, idx, lang, title)
        out_path = out_dir / out_name

        ffmpeg_cmd = [
            "ffmpeg",
            "-nostdin",
            "-hide_banner",
            "-loglevel", "error",
            "-y" if overwrite else "-n",
            "-i", str(in_path),
            "-map", f"0:a:{streams.index(s)}",  # Nth audio stream
            "-acodec", pcm_codec,
            "-ar", str(sample_rate),
            str(out_path),
        ]

        result = subprocess.run(ffmpeg_cmd, capture_output=True, text=True, stdin=subprocess.DEVNULL)
        if result.returncode != 0:
            if not overwrite and out_path.exists():
                raise FileExistsError(f"Target exists (use overwrite=True): {out_path}")
            raise RuntimeError(f"ffmpeg failed for stream {idx}: {result.stderr.strip()}")

        created_files.append(str(out_path))

    return created_files

Convert any audio to WAV¶

Standardize any FFmpeg-readable media into a linear PCM WAV at the sample rate, bit depth, and channel layout you specify. Defaults are sensible for ASR and most modeling pipelines (16 kHz, 16-bit, mono).

What it does

converts audio or extracts audio from video into a single WAV
preserves channel layout if you request it
uses ffmpeg with quiet, pipeline-safe flags and clear error reporting
predictable default output path if omitted

When to use

you need consistent, model-friendly WAVs from heterogeneous sources
you want a one-liner from notebooks or the CLI

API: convert audio to WAV¶

Convert any FFmpeg-readable audio/video file to a linear PCM WAV.

Parameters:

Name	Type	Description	Default
`input_path`	`str \| Path`	Source media file (audio or video container). FFmpeg must be able to read it.	required
`output_path`	`str \| Path \| None`	Target WAV path. If None, defaults to `<cwd>/audio/<input_stem>.wav`.	`None`
`sample_rate`	`int`	Desired sample rate (Hz).	`16000`
`bit_depth`	`(16, 24, 32)`	Output PCM bit depth; maps to `pcm_s{bit_depth}le` codec.	`16,24,32`
`channels`	`int \| None`	If provided, set number of output channels (e.g., 1=mono, 2=stereo). If None, keep original channel count.	`1`
`overwrite_existing`	`bool`	Overwrite `output_path` if it already exists.	`False`

Returns:

Type	Description
`Path`	Path to the written WAV file.

Raises:

Type	Description
`FileNotFoundError`	If `input_path` does not exist.
`RuntimeError`	If FFmpeg/FFprobe are missing or the conversion fails.

Notes

Video inputs are supported: the audio stream is extracted and converted.
For multi-channel sources and channels is None, channel layout is preserved.
We run FFmpeg with -nostdin to avoid TTY issues in pipelines.

Source code in src\taters\audio\convert_to_wav.py

def convert_audio_to_wav(
    input_path: Union[str, Path],
    *,
    output_path: Optional[Union[str, Path]] = None,
    output_dir: Optional[Union[str, Path]] = None,
    sample_rate: int = 16000,          # common for ASR
    bit_depth: int = 16,               # 16/24/32 signed PCM
    channels: int = 1,                 # 1=mono, 2=stereo
    overwrite_existing: bool = False,  # if the file already exists, let's not overwrite by default
) -> Path:
    """
    Convert any FFmpeg-readable audio/video file to a linear PCM WAV.

    Parameters
    ----------
    input_path : str | Path
        Source media file (audio or video container). FFmpeg must be able to read it.
    output_path : str | Path | None, optional
        Target WAV path. If None, defaults to
        ``<cwd>/audio/<input_stem>.wav``.
    sample_rate : int, default 16000
        Desired sample rate (Hz).
    bit_depth : {16,24,32}, default 16
        Output PCM bit depth; maps to ``pcm_s{bit_depth}le`` codec.
    channels : int | None, default 1
        If provided, set number of output channels (e.g., 1=mono, 2=stereo).
        If None, keep original channel count.
    overwrite_existing : bool, default False
        Overwrite `output_path` if it already exists.

    Returns
    -------
    Path
        Path to the written WAV file.

    Raises
    ------
    FileNotFoundError
        If `input_path` does not exist.
    RuntimeError
        If FFmpeg/FFprobe are missing or the conversion fails.

    Notes
    -----
    - Video inputs are supported: the audio stream is extracted and converted.
    - For multi-channel sources and `channels is None`, channel layout is preserved.
    - We run FFmpeg with ``-nostdin`` to avoid TTY issues in pipelines.
    """

    _check_ffmpeg()

    in_path = Path(input_path).resolve()
    if not in_path.exists():
        raise FileNotFoundError(f"Input file not found: {in_path}")

    if output_path and output_dir:
        raise ValueError("Provide at most one of output_path or output_dir.")

    if output_path:
        out_path = Path(output_path).resolve()
    else:
        base = in_path.stem + ".wav"
        out_dir = Path(output_dir).resolve() if output_dir else Path.cwd() / "audio"
        out_dir.mkdir(parents=True, exist_ok=True)
        out_path = out_dir / base

    if not overwrite_existing and Path(out_path).is_file():
        print("WAV file already exists; returning existing file.")
        return out_path

    pcm_map = {16: "pcm_s16le", 24: "pcm_s24le", 32: "pcm_s32le"}
    if bit_depth not in pcm_map:
        raise ValueError("bit_depth must be one of {16, 24, 32}.")
    if channels not in (1, 2):
        raise ValueError("channels must be 1 (mono) or 2 (stereo).")

    cmd = [
        "ffmpeg",
        "-nostdin",
        "-hide_banner", "-loglevel", "error",
        "-y" if overwrite_existing else "-n",
        "-i", str(in_path),
        "-vn",                        # ignore video
        "-acodec", pcm_map[bit_depth],
        "-ar", str(sample_rate),
        "-ac", str(channels),
        str(out_path),
    ]

    result = subprocess.run(cmd, capture_output=True, text=True, stdin=subprocess.DEVNULL)
    if result.returncode != 0:
        if not overwrite_existing and out_path.exists():
            raise FileExistsError(f"Target exists (use overwrite=True): {out_path}")
        raise RuntimeError(f"ffmpeg failed: {result.stderr.strip()}")

    return out_path

Diarize and transcribe¶

This is a thin CLI shim that forwards to the vendored Whisper diarization wrapper. It handles device selection and writes transcripts alongside subtitles with helpful defaults. Use it to get a timestamped CSV (start_time,end_time,speaker,text) plus SRT/TXT that other Taters tools can consume immediately.

What it does

delegates to the underlying whisper diarization wrapper
writes transcripts to a predictable folder by input stem
accepts device hints (cuda/cpu/auto)

When to use

you are preparing per-segment text for embeddings or dictionary coding
you plan to split a long recording into per-speaker WAVs

API: diarize with third-party wrapper (CLI entry)¶

Important note: This function is also exposed more easily via taters.audio.diarize_with_thirdparty

Run the vendored Whisper diarization scripts and normalize their outputs.

Parameters:

Name	Type	Description	Default
`audio_path`	`str \| Path`	Input audio (WAV recommended).	required
`out_dir`	`str \| Path`	Output directory for transcript artifacts. If it does not exist, it will be created.	`None`
`repo_dir`	`str \| Path \| None`	Optional explicit location of the diarization repo. If None, the vendored copy is used.	`None`
`whisper_model`	`str`	Whisper ASR model to use (e.g., "small", "base", "large-v3").	`"medium.en"`
`language`	`str \| None`	Language hint for Whisper (e.g., "en"); if None, autodetection is used.	`None`
`device`	`('cpu', 'cuda')`	Runtime device. If "cpu", environment variables are set to hide GPUs.	`"cpu","cuda"`
`batch_size`	`int`	Whisper batch size; 0 disables batching.	`0`
`no_stem`	`bool`	Pass through to demucs/whisper scripts to disable vocal/instrument stems.	`False`
`suppress_numerals`	`bool`	Heuristic to reduce spurious numeral tokens.	`False`
`parallel`	`bool`	Use parallel diarization script if available.	`False`
`timeout`	`int \| None`	Subprocess timeout in seconds; None means no timeout.	`None`
`use_custom`	`bool`	Prefer the customized script if present (adds CSV emission and minor cleanup).	`True`
`keep_temp`	`bool`	If False (default), temporary folders created by demucs/whisper are removed.	`False`
`num_speakers`	`int \| None`	Force a fixed number of speakers, if the downstream diarizer supports it.	`None`

Returns:

Type	Description
`DiarizationOutputFiles`	Paths to `.txt`, `.srt`, and `.csv` (if produced) in a per-file working directory, plus an (empty) `speaker_wavs` mapping for API compatibility.

Notes

The function copies the input WAV to a per-file work directory before running, to ensure relative paths inside the third-party scripts resolve correctly.
If device="cpu", CUDA is disabled in the child environment.
On success, the local WAV copy is deleted and temporary folders are tidied up.

Whisper encoder embeddings¶

Export Whisper encoder embeddings as features. Two modes:

Transcript-driven: one vector per transcript row (e.g., per diarized segment)
General-audio: segment the WAV with fixed windows or non-silent spans; optionally mean-pool to a single vector for the whole file

Outputs land under ./features/whisper-embeddings/ by default with a stable <stem>_embeddings.csv name. The wrapper runs extraction in a subprocess by default to avoid CUDA/Torch collisions elsewhere in your pipeline.

What it does

computes D-dimensional vectors using Faster-Whisper/CTranslate2 backends
segmenting strategies for raw audio; segment-level when given a transcript
optional single-row mean pooling for whole-file summaries
isolates heavy GPU state in a child process by default

When to use

you want robust, speech-centric features for clustering, retrieval, or as inputs to downstream models
you have transcripts already and want segment-level representations aligned to text

API: extract Whisper embeddings¶

Export Whisper encoder embeddings to a CSV file, using a subprocess by default.

Parameters:

Name	Type	Description	Default
`source_wav`	`str \| Path`	Path to the input WAV. Must be readable by `librosa`.	required
`transcript_csv`	`str \| Path \| None`	If provided, enables transcript-driven mode. The CSV is expected to contain timestamp columns and (optionally) a speaker column. A row is emitted per transcript segment.	`None`
`time_unit`	`('auto', 'ms', 's', 'samples')`	How to interpret timestamps in `transcript_csv`. In "auto", the worker heuristically infers the unit from max end time vs audio duration.	`"auto","ms","s","samples"`
`strategy`	`('windows', 'nonsilent')`	General-audio mode only. "windows" uses fixed sized windows with overlap; "nonsilent" uses an energy-based splitter (librosa.effects.split).	`"windows","nonsilent"`
`window_s`	`float`	General-audio mode only. Window length and hop (seconds).	`30.0, 15.0`
`hop_s`	`float`	General-audio mode only. Window length and hop (seconds).	`30.0, 15.0`
`min_seg_s`	`float`	General-audio mode only. Skip segments shorter than this many seconds.	`1.0`
`top_db`	`float`	General-audio mode only ("nonsilent"). Threshold (dB) below reference to consider as silence. Smaller → more segments; larger → fewer.	`30.0`
`aggregate`	`('none', 'mean')`	General-audio mode only. If "mean", a single pooled row is written covering the entire file; otherwise one row per segment.	`"none","mean"`
`output_dir`	`str \| Path \| None`	Directory for the output CSV. If None, defaults to `./features/whisper-embeddings`.	`None`
`model_name`	`str`	Model identifier passed through to the worker (e.g., "tiny", "base", "small", "large-v3" or a local CTranslate2 model directory).	`"base"`
`device`	`('auto', 'cuda', 'cpu')`	Runtime device. If "cpu", environment variables are set to disable CUDA in the child process.	`"auto","cuda","cpu"`
`compute_type`	`str`	CTranslate2 compute type (e.g., "float16", "int8", "float32"); passed to the worker module.	`"float16"`
`run_in_subprocess`	`bool`	If True (recommended), runs extraction in a separate Python process to isolate Torch/CUDA state from the parent process.	`True`
`extra_env`	`dict \| None`	Additional environment variables to inject into the child process.	`None`
`verbose`	`bool`	If True, print the launched command and the child's stdout.	`True`
`extractor_module`	`str`	Dotted module path whose `__main__` implements the extractor CLI.	`"chopshop.audio.extract_whisper_embeddings_subproc"`

Returns:

Type	Description
`Path`	Path to the written embeddings CSV. Pattern: `<output_dir>/<source_stem>_embeddings.csv`.

Notes

The subprocess writes and exits. The parent returns once the file exists.
If transcript_csv is supplied, the worker runs in transcript mode; otherwise general-audio mode is used with the given segmentation strategy.
Failures in the child process are re-raised with the captured stdout/stderr to ease debugging.

Examples:

Transcript per-segment embeddings:

>>> extract_whisper_embeddings(
...     source_wav="audio/session.wav",
...     transcript_csv="transcripts/session.csv",
...     time_unit="ms",
...     model_name="small",
...     device="cuda",
... )

Whole-file mean embedding:

>>> extract_whisper_embeddings(
...     source_wav="audio/session.wav",
...     strategy="nonsilent",
...     aggregate="mean",
...     output_dir="features/whisper-embeddings",
... )

Source code in src\taters\audio\extract_whisper_embeddings.py

def extract_whisper_embeddings(
    *,
    # required
    source_wav: Union[str, Path],

    # optional transcript-driven mode
    transcript_csv: Optional[Union[str, Path]] = None,
    time_unit: Literal["auto", "ms", "s", "samples"] = "auto",

    # general-audio mode (used when transcript_csv is None)
    strategy: Literal["windows", "nonsilent"] = "windows",
    window_s: float = 30.0,
    hop_s: float = 15.0,
    min_seg_s: float = 1.0,
    top_db: float = 30.0,
    aggregate: Literal["none", "mean"] = "none",

    # outputs
    output_dir: Optional[Union[str, Path]] = None,
    overwrite_existing: bool = False,  # if the file already exists, let's not overwrite by default

    # model/runtime
    model_name: str = "base",
    device: Literal["auto", "cuda", "cpu"] = "auto",
    compute_type: str = "float16",

    # execution strategy
    run_in_subprocess: bool = True,
    extra_env: Optional[dict] = None,
    verbose: bool = True,

    # where the extractor lives (python -m <module>)
    extractor_module: str = "taters.audio.extract_whisper_embeddings_subproc",
) -> Path:
    """
    Export Whisper encoder embeddings to a CSV file, using a subprocess by default.

    Parameters
    ----------
    source_wav : str | Path
        Path to the input WAV. Must be readable by `librosa`.
    transcript_csv : str | Path | None, optional
        If provided, enables transcript-driven mode. The CSV is expected to contain
        timestamp columns and (optionally) a speaker column. A row is emitted per
        transcript segment.
    time_unit : {"auto","ms","s","samples"}, default "auto"
        How to interpret timestamps in `transcript_csv`. In "auto", the worker
        heuristically infers the unit from max end time vs audio duration.
    strategy : {"windows","nonsilent"}, default "windows"
        General-audio mode only. "windows" uses fixed sized windows with overlap;
        "nonsilent" uses an energy-based splitter (librosa.effects.split).
    window_s, hop_s : float, default 30.0, 15.0
        General-audio mode only. Window length and hop (seconds).
    min_seg_s : float, default 1.0
        General-audio mode only. Skip segments shorter than this many seconds.
    top_db : float, default 30.0
        General-audio mode only ("nonsilent"). Threshold (dB) below reference to
        consider as silence. Smaller → more segments; larger → fewer.
    aggregate : {"none","mean"}, default "none"
        General-audio mode only. If "mean", a single pooled row is written covering
        the entire file; otherwise one row per segment.
    output_dir : str | Path | None, optional
        Directory for the output CSV. If None, defaults to
        ``./features/whisper-embeddings``.
    model_name : str, default "base"
        Model identifier passed through to the worker (e.g., "tiny", "base",
        "small", "large-v3" or a local CTranslate2 model directory).
    device : {"auto","cuda","cpu"}, default "auto"
        Runtime device. If "cpu", environment variables are set to disable CUDA
        in the child process.
    compute_type : str, default "float16"
        CTranslate2 compute type (e.g., "float16", "int8", "float32"); passed to
        the worker module.
    run_in_subprocess : bool, default True
        If True (recommended), runs extraction in a separate Python process to
        isolate Torch/CUDA state from the parent process.
    extra_env : dict | None, optional
        Additional environment variables to inject into the child process.
    verbose : bool, default True
        If True, print the launched command and the child's stdout.
    extractor_module : str, default "chopshop.audio.extract_whisper_embeddings_subproc"
        Dotted module path whose ``__main__`` implements the extractor CLI.

    Returns
    -------
    Path
        Path to the written embeddings CSV. Pattern:
        ``<output_dir>/<source_stem>_embeddings.csv``.

    Notes
    -----
    - The subprocess writes and exits. The parent returns once the file exists.
    - If `transcript_csv` is supplied, the worker runs in transcript mode; otherwise
      general-audio mode is used with the given segmentation strategy.
    - Failures in the child process are re-raised with the captured stdout/stderr
      to ease debugging.

    Examples
    --------
    Transcript per-segment embeddings:

    >>> extract_whisper_embeddings(
    ...     source_wav="audio/session.wav",
    ...     transcript_csv="transcripts/session.csv",
    ...     time_unit="ms",
    ...     model_name="small",
    ...     device="cuda",
    ... )

    Whole-file mean embedding:

    >>> extract_whisper_embeddings(
    ...     source_wav="audio/session.wav",
    ...     strategy="nonsilent",
    ...     aggregate="mean",
    ...     output_dir="features/whisper-embeddings",
    ... )
    """

    source_wav = Path(source_wav).resolve()
    # default to ./features/whisper-embeddings when not provided
    out_dir_final = (
        Path(output_dir).resolve()
        if output_dir
        else (Path.cwd() / "features" / "whisper-embeddings")
    )

    out_dir_final.mkdir(parents=True, exist_ok=True)
    output_csv = out_dir_final / f"{source_wav.stem}_embeddings.csv"

    if not overwrite_existing and Path(output_csv).is_file():
        print("Whisper embedding feature output file already exists; returning existing file.")
        return output_csv

    if not run_in_subprocess:
        # ---- In-process path (only when you’re sure no Torch/CUDA conflicts) ----
        from ..audio.extract_whisper_embeddings import (  # type: ignore
            export_segment_embeddings_csv,
            export_audio_embeddings_csv,
            EmbedConfig,
        )
        cfg = EmbedConfig(model_name=model_name, device=device, compute_type=compute_type, time_unit=time_unit)
        if transcript_csv is not None:
            transcript_csv = Path(transcript_csv).resolve()
            return Path(
                export_segment_embeddings_csv(
                    transcript_csv=transcript_csv,
                    source_wav=source_wav,
                    output_dir=out_dir_final,
                    config=cfg,
                )
            )
        else:
            return Path(
                export_audio_embeddings_csv(
                    source_wav=source_wav,
                    output_dir=out_dir_final,
                    config=cfg,
                    strategy=strategy,
                    window_s=window_s,
                    hop_s=hop_s,
                    min_seg_s=min_seg_s,
                    top_db=top_db,
                    aggregate=aggregate,
                )
            )

    # ---- Subprocess path (recommended) ----
    env = os.environ.copy()
    # Keep Transformers from importing heavy backends in the child
    env.setdefault("TRANSFORMERS_NO_TORCH", "1")
    env.setdefault("TRANSFORMERS_NO_TF", "1")
    env.setdefault("TRANSFORMERS_NO_FLAX", "1")

    if extra_env:
        env.update({k: str(v) for k, v in extra_env.items()})

    if device == "cpu":
        # Make sure the child won’t try CUDA
        env.update({"CUDA_VISIBLE_DEVICES": "", "USE_CUDA": "0", "FORCE_CPU": "1"})
    else:
        # Best-effort: prepend cuDNN wheel's lib dir if available
        try:
            import nvidia.cudnn, pathlib  # type: ignore
            cudnn_lib = str(pathlib.Path(nvidia.cudnn.__file__).with_name("lib"))
            env["LD_LIBRARY_PATH"] = cudnn_lib + ":" + env.get("LD_LIBRARY_PATH", "")
        except Exception:
            pass

    cmd = [
        sys.executable, "-m", extractor_module,
        "--source_wav", str(source_wav),
        "--output_dir", str(out_dir_final),
        "--model_name", model_name,
        "--device", device,
        "--compute_type", compute_type,
    ]

    if transcript_csv is not None:
        transcript_csv = Path(transcript_csv).resolve()
        cmd += ["--transcript_csv", str(transcript_csv), "--time_unit", time_unit]
    else:
        cmd += [
            "--strategy", strategy,
            "--window_s", str(window_s),
            "--hop_s", str(hop_s),
            "--min_seg_s", str(min_seg_s),
            "--top_db", str(top_db),
            "--aggregate", aggregate,
        ]

    if verbose:
        print("Launching embedding subprocess:")
        print(" ", shlex.join(cmd))

    try:
        res = subprocess.run(cmd, check=True, env=env, capture_output=True, text=True, stdin=subprocess.DEVNULL)
        if verbose and res.stdout:
            print(res.stdout.strip())
    except subprocess.CalledProcessError as e:
        raise RuntimeError(
            f"Embedding subprocess failed with code {e.returncode}\n"
            f"CMD: {shlex.join(cmd)}\n"
            f"STDOUT:\n{(e.stdout or '').strip()}\n\n"
            f"STDERR:\n{(e.stderr or '').strip()}"
        ) from e

    if not output_csv.exists():
        raise FileNotFoundError(f"Expected embeddings CSV not found: {output_csv}")

    if verbose:
        print(f"Embeddings CSV written to: {output_csv}")

    return output_csv

options: members_order: alphabetical show_source: true

Here’s a drop-in section you can paste into your Analyzing Audio guide right after the existing modules. It mirrors the tone/structure of your text pages and surfaces the key knobs you added (preprocessing, VAD tuning, etc.).

Vocal acoustics (Parselmouth/Praat)¶

This extractor turns a WAV (or a transcript-guided set of turns) into framewise tracks and clean summaries of voice acoustics. Under the hood it uses Praat/Parselmouth for f0, formants (F1–F4), loudness, HNR, jitter/shimmer, and optional tremor/glottal metrics, plus MFCC stats and simple pause/silence measures. Use it when you want interpretable, physiology-adjacent voice features that play nicely with clinical and social-science workflows.

Warning: I am not and acoustics person. Treat the output from this module with skepticism for time being.

Preprocessing¶

By default, the audio is standardized to reduce "garbage in → garbage out" risk:

resample to target_sr (default 44.1 kHz)
remove DC offset (channel-wise when stereo)
loudness-normalize to target_dbfs (default −20 dBFS)

You can turn this off (preprocess=False) or tweak the targets as needed.

Operating modes¶

Whole file: analyze a single WAV and write summary (+ framewise, by default).
Per turn (with transcript): segment the WAV using start_time/end_time from a CSV (optionally group/aggregate by columns like speaker). Framewise output includes segment_index and any IDs you pass through.

Outputs¶

Two CSVs, with predictable default names under features/acoustics/:

Framewise (on by default): one row per short-time frame with time_s, f0_hz, f1_hz–f4_hz, loudness_db, hnr_db (+ IDs in per-turn mode).
Summary (always): mean/std/range of the framewise series (optionally only on voiced segments ≥ summarize_on_voiced_segments_ms), silence ratio, jitter/shimmer variants, MFCC means/variances, CPP (if available), and optional tremor/glottal metrics.

Practical parameters (hey, that's alliteration!)¶

Voiced-only summaries: summarize_on_voiced_segments_ms=100 limits stats to voiced stretches ≥100 ms (helps avoid tiny blips).
Pause/VAD tuning: pause_top_db, pause_frame_length, pause_hop_length let you adjust librosa’s non-silence detection (useful for noisy rooms).
Pitch range: f0_min, f0_max bound the search for f0 (frame values outside the range are treated as unvoiced).
Cepstra: n_mfcc controls how many MFCCs are summarized.
Tremor/glottal: set mode="tremor" (requires a Praat script) or mode="advanced" (adds DisVoice-based glottal features).

CLI examples¶

Whole-file, default preprocessing and framewise+summary:

python -m taters.audio.analyze_vocal_acoustics \
  --wav "audio/session.wav" \
  --mode simple --overwrite_existing

Per-turn, aggregate by speaker:

python -m taters.audio.analyze_vocal_acoustics \
  --wav "audio/session.wav" \
  --transcript "transcripts/session.csv" \
  --time-unit ms \
  --group-by speaker \
  --mode simple \
  --overwrite_existing

Tuning pause detection (looser threshold, longer window):

python -m taters.audio.analyze_vocal_acoustics \
  --wav "audio/session.wav" \
  --pause-top-db 40 --pause-frame-length 4096 --pause-hop-length 1024

Python example¶

from taters import Taters
t = Taters()

res = t.audio.analyze_vocal_acoustics(
    wav_path="audio/session.wav",
    # or: transcript_csv="transcripts/session.csv", group_by=["speaker"],
    mode="simple",
    summarize_on_voiced_segments_ms=100,
    preprocess=True, target_sr=44100, target_dbfs=-20.0, remove_dc=True,
    pause_top_db=30, pause_frame_length=2048, pause_hop_length=512,
    overwrite_existing=True
)

print("Framewise CSV:", res["framewise_csv"])
print("Summary CSV:", res["summary_csv"])

Troubleshooting tips¶

Many Nones for formants/HNR: often means unvoiced frames or out-of-range pitch. Try widening f0_min/f0_max, or ensure preprocessing is on. Very narrow-band or music-heavy audio can also throw formant estimation off.
CPP/GNE missing: those depend on Praat commands that are not available in every environment; the extractor will warn and continue.
Weird pause counts: adjust pause_top_db upwards (e.g., 40) or use larger pause_frame_length to be more conservative in noisy recordings.

API: acoustic analysis¶

Extract acoustic features and write a summary CSV and (by default) a framewise CSV.

This function computes a battery of speech/voice features using Praat/Parselmouth-style workflows with optional cepstral, tremor, and glottal measures. It supports two operating modes:

1) Whole-file analysis Features are derived across the entire WAV. Summary statistics can be restricted to voiced segments longer than a threshold.

2) Per-turn analysis (transcript-guided) The WAV is segmented using start_time/end_time from a transcript CSV, features are computed per segment, and (optionally) per-segment rows are aggregated via group_by (e.g., one row per speaker).

Two artifacts can be written: • Summary CSV (always): means/SDs/ranges of framewise series; silence ratio; jitter/shimmer; MFCC means/variances; CPP (if available); optional tremor/glottal. With a transcript and group_by, the summary is aggregated per group. • Framewise CSV (default): one row per short-time frame (f0, F1–F4, loudness, HNR). Disable with include_framewise=False or set a custom path.

Parameters:

Name	Type	Description	Default
`wav_path`	`str or Path`	Path to a WAV file (mono or stereo, PCM). Required for both whole-file and per-turn modes. If both `wav_path` and `transcript_csv` are `None`, a `ValueError` is raised.	`None`
`transcript_csv`	`str or Path`	Path to a transcript CSV with at least `start_time`, `end_time` (and typically `speaker`). Intervals with non-positive duration are skipped. When provided, per-turn analysis is performed.	`None`
`time_unit`	`('ms', 's')`	Units for `start_time` and `end_time` in `transcript_csv`.	`"ms"`
`group_by`	`sequence of str`	Column names from `transcript_csv` used to aggregate per-turn summaries into higher-level rows (e.g., `["speaker"]`). If omitted, per-turn rows are written without aggregation.	`None`
`extra_id_cols`	`sequence of str`	Identifier/metadata columns to pass through when present (and to use as grouping keys where applicable). These are not numerically aggregated.	`("source", "speaker")`
`out_dir`	`str or Path`	Base directory for outputs if file paths are not given. Defaults to `./features/acoustics` (created if missing).	`None`
`out_framewise_csv`	`str or Path`	Path for the framewise CSV. If omitted and `include_framewise=True`, defaults to `<out_dir>/<stem>_framewise.csv`.	`None`
`out_summary_csv`	`str or Path`	Path for the summary CSV. If omitted, defaults to `<out_dir>/<stem>_summary.csv` (or an equivalent name in per-turn mode).	`None`
`overwrite_existing`	`bool`	If `False` and an output already exists, returns existing paths without recomputation. If `True`, outputs are recomputed and overwritten.	`False`
`include_framewise`	`bool`	If `True`, also write the framewise table. Set to `False` to write only the summary.	`True`
`mode`	`('simple', 'tremor', 'advanced')`	Feature families to compute: - `"simple"`: framewise f0, formants (F1–F4), loudness, HNR; summary stats; silence ratio; jitter/shimmer; MFCC means/variances; CPP (if available). - `"tremor"`: everything in simple plus tremor metrics via a Praat script (requires `tremor_script`). - `"advanced"`: everything in tremor plus glottal features (requires DisVoice and dependencies).	`"simple"`
`summarize_on_voiced_segments_ms`	`int or None`	If an integer, summary statistics for framewise series are computed only on voiced segments whose duration is at least this many milliseconds. If `None`, all frames are used.	`100`
`f0_min`	`float`	Minimum fundamental frequency (Hz) for pitch tracking. Out-of-range f0 values are treated as unvoiced (0 in framewise; excluded from voiced summaries).	`75.0`
`f0_max`	`float`	Maximum fundamental frequency (Hz) for pitch tracking.	`500.0`
`n_mfcc`	`int`	Number of MFCC coefficients to summarize (means and variances).	`14`
`tremor_script`	`str or Path`	Path to a Praat tremor script. Required when `mode in {"tremor","advanced"}`.	`None`
`preprocess`	`bool`	If `True` (whole-file mode), resample to `target_sr`, optionally remove DC offset (`remove_dc`), and normalize level toward `target_dbfs` with headroom protection. In per-turn mode, slices are analyzed with consistent parameters and are not re-normalized per slice.	`True`
`target_sr`	`int`	Target sample rate for preprocessing (whole-file mode).	`44100`
`target_dbfs`	`float`	Target loudness (dBFS) for level normalization (whole-file mode).	`-20.0`
`remove_dc`	`bool`	If `True`, attempt to remove DC offset during preprocessing (whole-file mode).	`True`
`pause_top_db`	`int`	Non-silence threshold for pause detection (higher → fewer speech segments). Passed to `librosa.effects.split`.	`30`
`pause_frame_length`	`int`	Frame length (samples) for pause detection.	`2048`
`pause_hop_length`	`int`	Hop length (samples) for pause detection.	`512`

Returns:

Type	Description
`dict`	Mapping with: `{"framewise_csv": pathlib.Path or None, "summary_csv": pathlib.Path}`.

Raises:

Type	Description
`ValueError`	If neither `wav_path` nor `transcript_csv` is provided; if `transcript_csv` is provided without `wav_path`; if required transcript columns are missing; or if `mode` requires unavailable dependencies (e.g., `tremor_script` for `"tremor"`, DisVoice for `"advanced"`).
`FileNotFoundError`	If provided paths do not exist.
`RuntimeError`	If feature extraction fails due to decoding errors, invalid audio, or downstream library issues.

Notes

Framewise CSV (written when include_framewise=True)
One row per short-time frame with: frame_index, time_s, f0_hz, f1_hz–f4_hz, loudness_db, hnr_db. In per-turn mode, also includes segment_index, start_s, end_s, and any extra_id_cols present.

Summary CSV (always written)
Whole-file: one row.
Per-turn (no group_by): one row per interval.
Per-turn with group_by: one row per group (e.g., per speaker).
Columns include summary stats of framewise series (on all frames or voiced segments ≥ summarize_on_voiced_segments_ms), silence ratio, jitter/shimmer variants, MFCC means/variances, CPP (if available), optional tremor/glottal metrics, and any extra_id_cols/group_by columns.

Performance

Per-turn analysis can be I/O intensive for long files with dense transcripts. Tremor/glottal metrics are substantially more expensive than simple mode.

Examples:

Whole-file analysis with framewise output:

>>> analyze_acoustics(
...     wav_path="session.wav",
...     out_dir="features/acoustics",
... )

Per-turn analysis aggregated by speaker:

>>> analyze_acoustics(
...     wav_path="session.wav",
...     transcript_csv="transcripts/session.csv",
...     time_unit="ms",
...     group_by=["speaker"],
...     extra_id_cols=["source", "speaker"],
...     out_summary_csv="features/acoustics/session_by_speaker.csv",
...     summarize_on_voiced_segments_ms=100,
...     mode="simple",
... )

Source code in src\taters\audio\analyze_vocal_acoustics.py

def analyze_acoustics(
    *,
    # Inputs (choose ONE of these two paths)
    wav_path: Optional[Union[str, Path]] = None,
    transcript_csv: Optional[Union[str, Path]] = None,  # if provided, we do per-turn analysis
    # Transcript options
    time_unit: Literal["ms","s"] = "ms",
    group_by: Optional[Sequence[str]] = None,       # e.g., ["speaker"]
    extra_id_cols: Sequence[str] = ("source","speaker"),
    # Output
    out_dir: Optional[Union[str, Path]] = None,
    out_framewise_csv: Optional[Union[str, Path]] = None,
    out_summary_csv: Optional[Union[str, Path]] = None,
    overwrite_existing: bool = False,
    include_framewise: bool = True,              # ← default ON now
    # Analysis options
    mode: Mode = "simple",
    summarize_on_voiced_segments_ms: Optional[int] = 100,
    f0_min: float = 75.0,
    f0_max: float = 500.0,
    n_mfcc: int = 14,
    tremor_script: Optional[Union[str, Path]] = None,
    # preprocessing controls
    preprocess: bool = True,
    target_sr: int = 44100,
    target_dbfs: float = -20.0,
    remove_dc: bool = True,
    # VAD/"pause" tuning
    pause_top_db: int = 30,
    pause_frame_length: int = 2048,
    pause_hop_length: int = 512,
) -> Dict[str, Optional[Path]]:
    """
    Extract acoustic features and write a summary CSV and (by default) a framewise CSV.

    This function computes a battery of speech/voice features using
    Praat/Parselmouth-style workflows with optional cepstral, tremor, and glottal
    measures. It supports two operating modes:

    1) Whole-file analysis
       Features are derived across the entire WAV. Summary statistics can be
       restricted to voiced segments longer than a threshold.

    2) Per-turn analysis (transcript-guided)
       The WAV is segmented using `start_time`/`end_time` from a transcript CSV,
       features are computed per segment, and (optionally) per-segment rows are
       aggregated via `group_by` (e.g., one row per speaker).

    Two artifacts can be written:
      • Summary CSV (always): means/SDs/ranges of framewise series; silence ratio;
        jitter/shimmer; MFCC means/variances; CPP (if available); optional tremor/glottal.
        With a transcript and `group_by`, the summary is aggregated per group.
      • Framewise CSV (default): one row per short-time frame (f0, F1–F4, loudness, HNR).
        Disable with `include_framewise=False` or set a custom path.

    Parameters
    ----------
    wav_path : str or pathlib.Path, optional
        Path to a WAV file (mono or stereo, PCM). Required for both whole-file
        and per-turn modes. If both `wav_path` and `transcript_csv` are ``None``,
        a ``ValueError`` is raised.
    transcript_csv : str or pathlib.Path, optional
        Path to a transcript CSV with at least `start_time`, `end_time` (and
        typically `speaker`). Intervals with non-positive duration are skipped.
        When provided, per-turn analysis is performed.
    time_unit : {"ms", "s"}, default "ms"
        Units for `start_time` and `end_time` in `transcript_csv`.
    group_by : sequence of str, optional
        Column names from `transcript_csv` used to aggregate per-turn summaries
        into higher-level rows (e.g., `["speaker"]`). If omitted, per-turn rows
        are written without aggregation.
    extra_id_cols : sequence of str, default ("source", "speaker")
        Identifier/metadata columns to pass through when present (and to use as
        grouping keys where applicable). These are not numerically aggregated.
    out_dir : str or pathlib.Path, optional
        Base directory for outputs if file paths are not given. Defaults to
        ``./features/acoustics`` (created if missing).
    out_framewise_csv : str or pathlib.Path, optional
        Path for the framewise CSV. If omitted and `include_framewise=True`,
        defaults to ``<out_dir>/<stem>_framewise.csv``.
    out_summary_csv : str or pathlib.Path, optional
        Path for the summary CSV. If omitted, defaults to
        ``<out_dir>/<stem>_summary.csv`` (or an equivalent name in per-turn mode).
    overwrite_existing : bool, default False
        If ``False`` and an output already exists, returns existing paths without
        recomputation. If ``True``, outputs are recomputed and overwritten.
    include_framewise : bool, default True
        If ``True``, also write the framewise table. Set to ``False`` to write
        only the summary.
    mode : {"simple", "tremor", "advanced"}, default "simple"
        Feature families to compute:
          - ``"simple"``: framewise f0, formants (F1–F4), loudness, HNR; summary stats;
            silence ratio; jitter/shimmer; MFCC means/variances; CPP (if available).
          - ``"tremor"``: everything in *simple* plus tremor metrics via a Praat script
            (requires `tremor_script`).
          - ``"advanced"``: everything in *tremor* plus glottal features (requires
            DisVoice and dependencies).
    summarize_on_voiced_segments_ms : int or None, default 100
        If an integer, summary statistics for framewise series are computed only
        on voiced segments whose duration is at least this many milliseconds.
        If ``None``, all frames are used.
    f0_min : float, default 75.0
        Minimum fundamental frequency (Hz) for pitch tracking. Out-of-range f0
        values are treated as unvoiced (0 in framewise; excluded from voiced summaries).
    f0_max : float, default 500.0
        Maximum fundamental frequency (Hz) for pitch tracking.
    n_mfcc : int, default 14
        Number of MFCC coefficients to summarize (means and variances).
    tremor_script : str or pathlib.Path, optional
        Path to a Praat tremor script. Required when ``mode in {"tremor","advanced"}``.
    preprocess : bool, default True
        If ``True`` (whole-file mode), resample to `target_sr`, optionally remove
        DC offset (`remove_dc`), and normalize level toward `target_dbfs` with
        headroom protection. In per-turn mode, slices are analyzed with consistent
        parameters and are not re-normalized per slice.
    target_sr : int, default 44100
        Target sample rate for preprocessing (whole-file mode).
    target_dbfs : float, default -20.0
        Target loudness (dBFS) for level normalization (whole-file mode).
    remove_dc : bool, default True
        If ``True``, attempt to remove DC offset during preprocessing (whole-file mode).
    pause_top_db : int, default 30
        Non-silence threshold for pause detection (higher → fewer speech segments).
        Passed to ``librosa.effects.split``.
    pause_frame_length : int, default 2048
        Frame length (samples) for pause detection.
    pause_hop_length : int, default 512
        Hop length (samples) for pause detection.

    Returns
    -------
    dict
        Mapping with:
        ``{"framewise_csv": pathlib.Path or None, "summary_csv": pathlib.Path}``.

    Raises
    ------
    ValueError
        If neither `wav_path` nor `transcript_csv` is provided; if `transcript_csv`
        is provided without `wav_path`; if required transcript columns are missing; or
        if `mode` requires unavailable dependencies (e.g., `tremor_script` for
        ``"tremor"``, DisVoice for ``"advanced"``).
    FileNotFoundError
        If provided paths do not exist.
    RuntimeError
        If feature extraction fails due to decoding errors, invalid audio, or
        downstream library issues.

    Notes
    -----
    **Framewise CSV** (written when `include_framewise=True`)  
    One row per short-time frame with: `frame_index`, `time_s`, `f0_hz`,
    `f1_hz`–`f4_hz`, `loudness_db`, `hnr_db`. In per-turn mode, also includes
    `segment_index`, `start_s`, `end_s`, and any `extra_id_cols` present.

    **Summary CSV** (always written)  
    Whole-file: one row.  
    Per-turn (no `group_by`): one row per interval.  
    Per-turn with `group_by`: one row per group (e.g., per speaker).  
    Columns include summary stats of framewise series (on all frames or voiced
    segments ≥ `summarize_on_voiced_segments_ms`), silence ratio, jitter/shimmer
    variants, MFCC means/variances, CPP (if available), optional tremor/glottal
    metrics, and any `extra_id_cols`/`group_by` columns.

    Performance
    -----------
    Per-turn analysis can be I/O intensive for long files with dense transcripts.
    Tremor/glottal metrics are substantially more expensive than *simple* mode.

    Examples
    --------
    Whole-file analysis with framewise output:

    >>> analyze_acoustics(
    ...     wav_path="session.wav",
    ...     out_dir="features/acoustics",
    ... )

    Per-turn analysis aggregated by speaker:

    >>> analyze_acoustics(
    ...     wav_path="session.wav",
    ...     transcript_csv="transcripts/session.csv",
    ...     time_unit="ms",
    ...     group_by=["speaker"],
    ...     extra_id_cols=["source", "speaker"],
    ...     out_summary_csv="features/acoustics/session_by_speaker.csv",
    ...     summarize_on_voiced_segments_ms=100,
    ...     mode="simple",
    ... )
    """

    if wav_path is None:
        raise ValueError("wav_path is required")

    wav_path = Path(wav_path)
    if out_dir is None:
        out_dir = Path("features") / "acoustics"
    out_dir = Path(out_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    # Default output paths
    stem = wav_path.stem
    if out_framewise_csv is None:
        out_framewise_csv = out_dir / f"{stem}_framewise.csv"
    else:
        out_framewise_csv = Path(out_framewise_csv)

    if out_summary_csv is None:
        suffix = "_by_" + "_".join(group_by) if (transcript_csv and group_by) else ""
        out_summary_csv = out_dir / f"{stem}_summary{suffix}.csv"
    else:
        out_summary_csv = Path(out_summary_csv)

    # Respect overwrite_existing
    if (not overwrite_existing) and out_summary_csv.exists():
        print(f"[acoustics] Summary output already exists; returning existing file: {out_summary_csv}")
        return {"framewise_csv": out_framewise_csv if out_framewise_csv.exists() else None,
                "summary_csv": out_summary_csv}

    # Run
    if transcript_csv:
        framewise_df, summary_df = _analyze_turns(
            wav_path=wav_path,
            transcript_csv=transcript_csv,
            time_unit=time_unit,
            group_by=group_by,
            extra_id_cols=extra_id_cols,
            mode=mode,
            summarize_on_voiced_segments_ms=summarize_on_voiced_segments_ms,
            include_framewise=include_framewise,
            tremor_script=tremor_script,
            f0_min=f0_min,
            f0_max=f0_max,
            n_mfcc=n_mfcc,
            preprocess=preprocess,
            target_sr=target_sr,
            target_dbfs=target_dbfs,
            remove_dc=remove_dc,
            pause_top_db=pause_top_db,
            pause_frame_length=pause_frame_length,
            pause_hop_length=pause_hop_length,
        )
    else:
        fdf, summ = _analyze_clip(
            wav_path,
            mode=mode,
            summarize_on_voiced_segments_ms=summarize_on_voiced_segments_ms,
            include_framewise=include_framewise,
            tremor_script=tremor_script,
            f0_min=f0_min,
            f0_max=f0_max,
            n_mfcc=n_mfcc,
            preprocess=preprocess,
            target_sr=target_sr,
            target_dbfs=target_dbfs,
            remove_dc=remove_dc,
            pause_top_db=pause_top_db,
            pause_frame_length=pause_frame_length,
            pause_hop_length=pause_hop_length,
        )
        # Build summary DF; carry extra_id_cols if user offered any via filename context later
        summary_df = pd.DataFrame([summ])
        framewise_df = fdf

    # Write outputs
    frame_path_out: Optional[Path] = None
    if include_framewise and framewise_df is not None:
        framewise_df.to_csv(out_framewise_csv, index=False, encoding="utf-8-sig")
        frame_path_out = out_framewise_csv

    summary_df.to_csv(out_summary_csv, index=False, encoding="utf-8-sig")

    return {"framewise_csv": frame_path_out, "summary_csv": out_summary_csv}

options: members_order: alphabetical show_source: true

Split WAV by speaker¶

Given a diarization transcript, create one WAV per speaker by concatenating that speaker's segments. You can insert small silences to avoid clicks at joins and resample/downmix on the fly. Filenames are readable and stable.

What it does

reads a timestamped CSV with start_time,end_time,speaker
builds one output WAV per unique speaker
optional silence padding, resampling, mono mixdown
skips ultra-short segments; clamps times to audio bounds

When to use

you want per-speaker audio for targeted feature extraction or human coding
you plan to model speakers separately or compute speaker-level aggregates

API: make per-speaker WAVs from a transcript¶

Concatenate speaker-specific segments into per-speaker WAV files.

If merge_consecutive=True (default), adjacent transcript rows with the same speaker are merged into a single, longer segment spanning from the first start to the last end — including any silence between those turns. If you need the strict per-row behavior, set merge_consecutive=False.

Parameters:

Name	Type	Description	Default
`source_wav`	`str \| Path`	Path to the source WAV.	required
`transcript_csv_path`	`str \| Path`	CSV with timing and speaker columns (e.g., diarization output).	required
`output_dir`	`str \| Path \| None`	Where to write the per-speaker files. If None, defaults to `./audio_split/<source_stem>/`.	`None`
`start_col`	`str`	Column names in the transcript CSV.	`'start_time'`
`end_col`	`str`	Column names in the transcript CSV.	`'start_time'`
`speaker_col`	`str`	Column names in the transcript CSV.	`'start_time'`
`time_unit`	`('ms', 's')`	Units for start/end columns.	`"ms","s"`
`silence_ms`	`int`	If `pre_silence_ms`/`post_silence_ms` are None, use this for both sides.	`1000`
`pre_silence_ms`	`int \| None`	Explicit padding (ms) before/after each segment; overrides `silence_ms`.	`None`
`post_silence_ms`	`int \| None`	Explicit padding (ms) before/after each segment; overrides `silence_ms`.	`None`
`sr`	`int \| None`	Resample output to this rate. If None, keep original rate.	`16000`
`mono`	`bool`	Downmix to mono if True.	`True`
`min_dur_ms`	`int`	Skip segments shorter than this duration (ms).	`50`
`merge_consecutive`	`bool`	Merge back-to-back turns for the same speaker into one segment span (including any inter-turn silence). If False, emit one clip per row.	`True`

Returns:

Type	Description
`dict[str, Path]`	Mapping from friendly speaker label → output WAV path.

Behavior

Input speaker labels are sanitized for filenames but a more readable label (without path-hostile characters) is preserved for naming.
Segments are sorted by start time per speaker before concatenation.
If a speaker ends up with zero valid segments, no file is written.

Examples:

>>> make_speaker_wavs_from_csv(
...     source_wav="audio/session.wav",
...     transcript_csv_path="transcripts/session.csv",
...     time_unit="ms",
...     silence_ms=0,  # no padding
...     sr=16000,
...     mono=True,
... )

Source code in src\taters\audio\split_wav_by_speaker.py

def make_speaker_wavs_from_csv(
    source_wav: Union[str, Path],
    transcript_csv_path: Union[str, Path],
    output_dir: Union[str, Path, None] = None,
    *,
    overwrite_existing: bool = False,
    start_col: str = "start_time",
    end_col: str = "end_time",
    speaker_col: str = "speaker",
    time_unit: str = "ms",             # "ms" or "s"
    silence_ms: int = 1000,
    pre_silence_ms: Optional[int] = None,
    post_silence_ms: Optional[int] = None,
    sr: Optional[int] = 16000,
    mono: bool = True,
    min_dur_ms: int = 50,
    merge_consecutive: bool = True,    # NEW: merge back-to-back turns by same speaker
) -> Dict[str, Path]:
    """
    Concatenate speaker-specific segments into per-speaker WAV files.

    If `merge_consecutive=True` (default), adjacent transcript rows with the same
    speaker are merged into a single, longer segment spanning from the first
    start to the last end — including any silence between those turns. If you
    need the strict per-row behavior, set `merge_consecutive=False`.

    Parameters
    ----------
    source_wav : str | Path
        Path to the source WAV.
    transcript_csv_path : str | Path
        CSV with timing and speaker columns (e.g., diarization output).
    output_dir : str | Path | None, optional
        Where to write the per-speaker files. If None, defaults to
        ``./audio_split/<source_stem>/``.
    start_col, end_col, speaker_col : str
        Column names in the transcript CSV.
    time_unit : {"ms","s"}, default "ms"
        Units for start/end columns.
    silence_ms : int, default 1000
        If `pre_silence_ms`/`post_silence_ms` are None, use this for both sides.
    pre_silence_ms, post_silence_ms : int | None
        Explicit padding (ms) before/after each segment; overrides `silence_ms`.
    sr : int | None, default 16000
        Resample output to this rate. If None, keep original rate.
    mono : bool, default True
        Downmix to mono if True.
    min_dur_ms : int, default 50
        Skip segments shorter than this duration (ms).
    merge_consecutive : bool, default True
        Merge back-to-back turns for the same speaker into one segment span
        (including any inter-turn silence). If False, emit one clip per row.

    Returns
    -------
    dict[str, Path]
        Mapping from friendly speaker label → output WAV path.

    Behavior
    --------
    - Input speaker labels are sanitized for filenames but a more readable label
      (without path-hostile characters) is preserved for naming.
    - Segments are sorted by start time per speaker before concatenation.
    - If a speaker ends up with zero valid segments, no file is written.

    Examples
    --------
    >>> make_speaker_wavs_from_csv(
    ...     source_wav="audio/session.wav",
    ...     transcript_csv_path="transcripts/session.csv",
    ...     time_unit="ms",
    ...     silence_ms=0,  # no padding
    ...     sr=16000,
    ...     mono=True,
    ... )
    """
    if time_unit not in ("ms", "s"):
        raise ValueError("time_unit must be 'ms' or 's'")

    def _friendly_filename_label(name: str) -> str:
        s = (name or "").strip()
        s = s.replace("/", "_").replace("\\", "_")
        s = re.sub(r'[<>:"|?*]', "", s)
        s = re.sub(r"\s+", " ", s)
        return s or "SPEAKER_0"

    source_wav = Path(source_wav)
    transcript_csv_path = Path(transcript_csv_path)
    out_dir = Path(output_dir) if output_dir is not None else (Path.cwd() / "audio_split" / source_wav.stem)
    out_dir.mkdir(parents=True, exist_ok=True)
    base_stem = source_wav.stem

    audio = AudioSegment.from_file(source_wav)
    if sr:
        audio = audio.set_frame_rate(sr)
    if mono:
        audio = audio.set_channels(1)

    factor = 1000.0 if time_unit == "s" else 1.0
    audio_len_ms = len(audio)

    with transcript_csv_path.open(newline="", encoding="utf-8") as f:
        rows = list(csv.DictReader(f))

    segs_by_spk: Dict[str, List[tuple[int, int]]] = {}
    label_for_key: Dict[str, str] = {}

    # Build segments with awareness of original row order so that we can merge
    # adjacent turns for the same speaker when requested.
    prev_spk_key: Optional[str] = None
    for row in rows:
        try:
            start_raw = float(row[start_col])
            end_raw   = float(row[end_col])
            raw_spk   = str(row.get(speaker_col, "SPEAKER_0"))
        except Exception:
            continue

        start_ms = int(round(start_raw * factor))
        end_ms   = int(round(end_raw   * factor))
        if end_ms <= start_ms:
            continue

        start_ms = _clamp(start_ms, 0, audio_len_ms)
        end_ms   = _clamp(end_ms,   0, audio_len_ms)
        if end_ms <= start_ms:
            continue

        spk_key = _sanitize_speaker(raw_spk)
        label_for_key.setdefault(spk_key, _friendly_filename_label(raw_spk))

        if merge_consecutive and prev_spk_key == spk_key and segs_by_spk.get(spk_key):
            # Extend the last segment for this speaker to cover the new end
            s0, e0 = segs_by_spk[spk_key][-1]
            # Keep the earliest start, extend to the latest end
            s_new = min(s0, start_ms)
            e_new = max(e0, end_ms)
            segs_by_spk[spk_key][-1] = (s_new, e_new)
        else:
            # Strictly append a new segment
            segs_by_spk.setdefault(spk_key, []).append((start_ms, end_ms))

        prev_spk_key = spk_key

    # Optional: drop very short segments after merging
    for spk_key, segs in list(segs_by_spk.items()):
        segs_by_spk[spk_key] = [(s, e) for (s, e) in segs if (e - s) >= min_dur_ms]

    pre_ms  = silence_ms if pre_silence_ms  is None else pre_silence_ms
    post_ms = silence_ms if post_silence_ms is None else post_silence_ms
    pre_sil  = AudioSegment.silent(duration=max(0, pre_ms),  frame_rate=audio.frame_rate)
    post_sil = AudioSegment.silent(duration=max(0, post_ms), frame_rate=audio.frame_rate)
    if mono:
        pre_sil  = pre_sil.set_channels(1)
        post_sil = post_sil.set_channels(1)

    results: Dict[str, Path] = {}
    for spk_key, segs in segs_by_spk.items():
        if not segs:
            continue

        friendly = label_for_key.get(spk_key, spk_key)
        out_path = out_dir / f"{base_stem}_{friendly}.wav"

        if (not overwrite_existing) and out_path.is_file():
            results[friendly] = out_path
            continue

        out = AudioSegment.silent(duration=0, frame_rate=audio.frame_rate)
        if mono:
            out = out.set_channels(1)

        for (s, e) in segs:
            clip = audio[s:e]
            if len(clip) < min_dur_ms:
                continue
            out += pre_sil + clip + post_sil

        if len(out) == 0:
            continue

        out.export(out_path, format="wav", codec="pcm_s16le")
        results[friendly] = out_path

    return results

options: members_order: alphabetical show_source: true

Practical notes¶

Paths: if you do not pass explicit outputs, tools write to predictable folders next to your project root (for example, ./audio, ./features/whisper-embeddings).
Overwrite behavior: by default, functions will not overwrite existing files; pass the relevant flag to force a rebuild.
Device selection: where supported, device="auto" picks sensibly; set cuda or cpu explicitly when you need control.