Audio Modules¶

convert_audio_to_wav ¶

convert_audio_to_wav(
    input_path,
    *,
    output_path=None,
    output_dir=None,
    sample_rate=16000,
    bit_depth=16,
    channels=1,
    overwrite_existing=False
)

Convert any FFmpeg-readable audio/video file to a linear PCM WAV.

Parameters:

Name	Type	Description	Default
`input_path`	`str \| Path`	Source media file (audio or video container). FFmpeg must be able to read it.	required
`output_path`	`str \| Path \| None`	Target WAV path. If None, defaults to `<cwd>/audio/<input_stem>.wav`.	`None`
`sample_rate`	`int`	Desired sample rate (Hz).	`16000`
`bit_depth`	`(16, 24, 32)`	Output PCM bit depth; maps to `pcm_s{bit_depth}le` codec.	`16,24,32`
`channels`	`int \| None`	If provided, set number of output channels (e.g., 1=mono, 2=stereo). If None, keep original channel count.	`1`
`overwrite_existing`	`bool`	Overwrite `output_path` if it already exists.	`False`

Returns:

Type	Description
`Path`	Path to the written WAV file.

Raises:

Type	Description
`FileNotFoundError`	If `input_path` does not exist.
`RuntimeError`	If FFmpeg/FFprobe are missing or the conversion fails.

Notes

Video inputs are supported: the audio stream is extracted and converted.
For multi-channel sources and channels is None, channel layout is preserved.
We run FFmpeg with -nostdin to avoid TTY issues in pipelines.

Source code in src\taters\audio\convert_to_wav.py

def convert_audio_to_wav(
    input_path: Union[str, Path],
    *,
    output_path: Optional[Union[str, Path]] = None,
    output_dir: Optional[Union[str, Path]] = None,
    sample_rate: int = 16000,          # common for ASR
    bit_depth: int = 16,               # 16/24/32 signed PCM
    channels: int = 1,                 # 1=mono, 2=stereo
    overwrite_existing: bool = False,  # if the file already exists, let's not overwrite by default
) -> Path:
    """
    Convert any FFmpeg-readable audio/video file to a linear PCM WAV.

    Parameters
    ----------
    input_path : str | Path
        Source media file (audio or video container). FFmpeg must be able to read it.
    output_path : str | Path | None, optional
        Target WAV path. If None, defaults to
        ``<cwd>/audio/<input_stem>.wav``.
    sample_rate : int, default 16000
        Desired sample rate (Hz).
    bit_depth : {16,24,32}, default 16
        Output PCM bit depth; maps to ``pcm_s{bit_depth}le`` codec.
    channels : int | None, default 1
        If provided, set number of output channels (e.g., 1=mono, 2=stereo).
        If None, keep original channel count.
    overwrite_existing : bool, default False
        Overwrite `output_path` if it already exists.

    Returns
    -------
    Path
        Path to the written WAV file.

    Raises
    ------
    FileNotFoundError
        If `input_path` does not exist.
    RuntimeError
        If FFmpeg/FFprobe are missing or the conversion fails.

    Notes
    -----
    - Video inputs are supported: the audio stream is extracted and converted.
    - For multi-channel sources and `channels is None`, channel layout is preserved.
    - We run FFmpeg with ``-nostdin`` to avoid TTY issues in pipelines.
    """

    _check_ffmpeg()

    in_path = Path(input_path).resolve()
    if not in_path.exists():
        raise FileNotFoundError(f"Input file not found: {in_path}")

    if output_path and output_dir:
        raise ValueError("Provide at most one of output_path or output_dir.")

    if output_path:
        out_path = Path(output_path).resolve()
    else:
        base = in_path.stem + ".wav"
        out_dir = Path(output_dir).resolve() if output_dir else Path.cwd() / "audio"
        out_dir.mkdir(parents=True, exist_ok=True)
        out_path = out_dir / base

    if not overwrite_existing and Path(out_path).is_file():
        print("WAV file already exists; returning existing file.")
        return out_path

    pcm_map = {16: "pcm_s16le", 24: "pcm_s24le", 32: "pcm_s32le"}
    if bit_depth not in pcm_map:
        raise ValueError("bit_depth must be one of {16, 24, 32}.")
    if channels not in (1, 2):
        raise ValueError("channels must be 1 (mono) or 2 (stereo).")

    cmd = [
        "ffmpeg",
        "-nostdin",
        "-hide_banner", "-loglevel", "error",
        "-y" if overwrite_existing else "-n",
        "-i", str(in_path),
        "-vn",                        # ignore video
        "-acodec", pcm_map[bit_depth],
        "-ar", str(sample_rate),
        "-ac", str(channels),
        str(out_path),
    ]

    result = subprocess.run(cmd, capture_output=True, text=True, stdin=subprocess.DEVNULL)
    if result.returncode != 0:
        if not overwrite_existing and out_path.exists():
            raise FileExistsError(f"Target exists (use overwrite=True): {out_path}")
        raise RuntimeError(f"ffmpeg failed: {result.stderr.strip()}")

    return out_path

Thin CLI shim for the vendored Whisper diarization wrapper.

This module exists so you can run:

python -m taters.audio.diarize_with_thirdparty --audio_path ...

It simply delegates to the real implementation in taters/audio/diarizer/whisper_diar_wrapper.py. :contentReference[oaicite:0]{index=0}

Extract all audio streams from a video/container into standalone WAV files.

This utility probes the container with ffprobe, lists audio streams (with index and tags), and then maps each stream with ffmpeg to a separate PCM WAV. It is useful for multi-track recordings (e.g., Zoom, OBS, ProRes with stems). :contentReference[oaicite:1]{index=1}

split_audio_streams_to_wav ¶

split_audio_streams_to_wav(
    input_path,
    output_dir=None,
    sample_rate=48000,
    bit_depth=16,
    overwrite=True,
)

Extract each audio stream in a container to its own WAV file.

Parameters:

Name	Type	Description	Default
`input_path`	`str \| PathLike`	Video or audio container readable by FFmpeg.	required
`output_dir`	`str \| PathLike \| None`	Destination directory. If None, defaults to `./audio` in the current working directory (predictable write location).	`None`
`sample_rate`	`int`	Target sample rate for the output WAVs (Hz).	`48000`
`bit_depth`	`(16, 24, 32)`	Output PCM bit depth (little-endian).	`16,24,32`
`overwrite`	`bool`	If True, overwrite existing files. If False and a target exists, raises :class:`FileExistsError`.	`True`

Returns:

Type	Description
`list[str]`	Absolute paths to the created WAVs.

Behavior

Output file names are constructed from the input base name and stream metadata: <stem>_a<index>[_<lang>][_<title>].wav with safe slugs.
Uses -map 0:a:<N> to select the N-th audio stream in the container.
Runs FFmpeg with -nostdin and quiet loglevel to avoid TTY lockups.

Examples:

>>> split_audio_streams_to_wav("session.mp4")
['.../audio/session_a0_eng.wav', '.../audio/session_a1_eng.wav']

Source code in src\taters\audio\extract_wav_from_video.py

def split_audio_streams_to_wav(
    input_path: str | os.PathLike,
    output_dir: str | os.PathLike | None = None,     # <-- now optional
    sample_rate: int = 48000,
    bit_depth: int = 16,
    overwrite: bool = True,
) -> List[str]:
    """
    Extract each audio stream in a container to its own WAV file.

    Parameters
    ----------
    input_path : str | os.PathLike
        Video or audio container readable by FFmpeg.
    output_dir : str | os.PathLike | None, optional
        Destination directory. If None, defaults to ``./audio`` in the current
        working directory (predictable write location).
    sample_rate : int, default 48000
        Target sample rate for the output WAVs (Hz).
    bit_depth : {16,24,32}, default 16
        Output PCM bit depth (little-endian).
    overwrite : bool, default True
        If True, overwrite existing files. If False and a target exists,
        raises :class:`FileExistsError`.

    Returns
    -------
    list[str]
        Absolute paths to the created WAVs.

    Behavior
    --------
    - Output file names are constructed from the input base name and stream
      metadata: ``<stem>_a<index>[_<lang>][_<title>].wav`` with safe slugs.
    - Uses ``-map 0:a:<N>`` to select the N-th audio stream in the container.
    - Runs FFmpeg with ``-nostdin`` and quiet loglevel to avoid TTY lockups.

    Examples
    --------
    >>> split_audio_streams_to_wav("session.mp4")
    ['.../audio/session_a0_eng.wav', '.../audio/session_a1_eng.wav']
    """

    _check_binaries()

    in_path = Path(input_path)
    if not in_path.exists():
        raise FileNotFoundError(f"Input file not found: {in_path}")

    # Default predictable location when none is provided
    if output_dir is None:
        out_dir = Path.cwd() / "audio"
    else:
        out_dir = Path(output_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    print(f"Extracting audio streams from {in_path} to {out_dir} at {sample_rate} Hz, bit depth: {bit_depth}")

    streams = _probe_audio_streams(in_path)
    if not streams:
        raise ValueError("No audio streams found in input.")

    pcm_fmt_map = {16: "pcm_s16le", 24: "pcm_s24le", 32: "pcm_s32le"}
    if bit_depth not in pcm_fmt_map:
        raise ValueError("bit_depth must be one of {16, 24, 32}.")
    pcm_codec = pcm_fmt_map[bit_depth]

    created_files: List[str] = []
    base = in_path.stem

    for s in streams:
        idx = s.get("index")
        tags = s.get("tags", {}) or {}
        lang = tags.get("language")
        title = tags.get("title")

        print(f"Extracting audio stream:\n"
              f"index: {idx}\n"
              f"tags: {tags}\n"
              f"language: {lang}\n"
              f"title: {title}\n")

        out_name = _build_wav_name(base, idx, lang, title)
        out_path = out_dir / out_name

        ffmpeg_cmd = [
            "ffmpeg",
            "-nostdin",
            "-hide_banner",
            "-loglevel", "error",
            "-y" if overwrite else "-n",
            "-i", str(in_path),
            "-map", f"0:a:{streams.index(s)}",  # Nth audio stream
            "-acodec", pcm_codec,
            "-ar", str(sample_rate),
            str(out_path),
        ]

        result = subprocess.run(ffmpeg_cmd, capture_output=True, text=True, stdin=subprocess.DEVNULL)
        if result.returncode != 0:
            if not overwrite and out_path.exists():
                raise FileExistsError(f"Target exists (use overwrite=True): {out_path}")
            raise RuntimeError(f"ffmpeg failed for stream {idx}: {result.stderr.strip()}")

        created_files.append(str(out_path))

    return created_files

Acoustic feature extraction (Praat/Parselmouth-based) with optional per-turn analysis and OpenWillis-style "simple / tremor / advanced" modes.

This module targets parity with the OpenWillis vocal acoustics stack: - Framewise tracks: f0, formants (F1–F4), loudness (intensity), HNR. - Summary stats of those tracks (mean, std, range), with an option to summarize only on voiced segments longer than 100 ms. - Phonation metrics via Praat (jitter/shimmer families, GNE). - Pause metrics (SPIR, DurMED, DurMAD) using energy-based VAD. - Cepstral features (MFCC mean/var; CPPS via Praat PowerCepstrogram). - Optional tremor metrics (requires the "tremor.praat" script). - Optional glottal features (HRF, NAQ, OQ) via DisVoice (if installed).

It also supports "per-turn" analysis using a transcript CSV, so you can compute speaker-level or utterance-level acoustics aligned with your diarized segments.

Outputs

1) Framewise CSV (optional): one row per analysis frame (or per frame per turn). 2) Summary CSV: one row per file (or per speaker/turn grouping), including pass-through metadata (e.g., source, speaker) if provided.

Dependencies

Required: parselmouth (Praat), numpy, pandas (for CSV I/O only), librosa
Optional: DisVoice (glottal metrics), pysptk (DisVoice dependency)
Optional: Praat tremor script file if you want tremor metrics

Notes on design choices

f0 range (75–500 Hz) matches OpenWillis defaults. Out-of-range f0 frames are set to 0 for "framewise", and are excluded from summary calculations (like OpenWillis).
Voiced frames are derived from Praat/Parselmouth tracks. A "voiced-segment >=100ms" filter is available for summary statistics (again, matching OpenWillis semantics).
Pause metrics (SPIR, DurMED, DurMAD) follow OpenWillis thresholds (50 ms < pause < 2 s).
CPPS is computed via Praat PowerCepstrogram calls; if Praat/Parselmouth lacks the function in your local build, we skip with a warning.
Tremor metrics require a Praat script (tremor.praat). Provide its path if you want them.

CLI

python -m taters.audio.analyze_acoustics --wav audio/speaker.wav --out-dir features/acoustics --mode simple --voiced-segments true --transcript-csv transcripts/X/X.csv --time-unit ms --group-by speaker --pass-through source speaker

FramewiseTracks `dataclass` ¶

FramewiseTracks(
    times, f0, f1, f2, f3, f4, loudness_db, hnr_db
)

Aligned framewise series at a fixed hop (e.g., 10 ms).

analyze_acoustics ¶

analyze_acoustics(
    *,
    wav_path=None,
    transcript_csv=None,
    time_unit="ms",
    group_by=None,
    extra_id_cols=("source", "speaker"),
    out_dir=None,
    out_framewise_csv=None,
    out_summary_csv=None,
    overwrite_existing=False,
    include_framewise=True,
    mode="simple",
    summarize_on_voiced_segments_ms=100,
    f0_min=75.0,
    f0_max=500.0,
    n_mfcc=14,
    tremor_script=None,
    preprocess=True,
    target_sr=44100,
    target_dbfs=-20.0,
    remove_dc=True,
    pause_top_db=30,
    pause_frame_length=2048,
    pause_hop_length=512
)

Extract acoustic features and write a summary CSV and (by default) a framewise CSV.

This function computes a battery of speech/voice features using Praat/Parselmouth-style workflows with optional cepstral, tremor, and glottal measures. It supports two operating modes:

1) Whole-file analysis Features are derived across the entire WAV. Summary statistics can be restricted to voiced segments longer than a threshold.

2) Per-turn analysis (transcript-guided) The WAV is segmented using start_time/end_time from a transcript CSV, features are computed per segment, and (optionally) per-segment rows are aggregated via group_by (e.g., one row per speaker).

Two artifacts can be written: • Summary CSV (always): means/SDs/ranges of framewise series; silence ratio; jitter/shimmer; MFCC means/variances; CPP (if available); optional tremor/glottal. With a transcript and group_by, the summary is aggregated per group. • Framewise CSV (default): one row per short-time frame (f0, F1–F4, loudness, HNR). Disable with include_framewise=False or set a custom path.

Parameters:

Name	Type	Description	Default
`wav_path`	`str or Path`	Path to a WAV file (mono or stereo, PCM). Required for both whole-file and per-turn modes. If both `wav_path` and `transcript_csv` are `None`, a `ValueError` is raised.	`None`
`transcript_csv`	`str or Path`	Path to a transcript CSV with at least `start_time`, `end_time` (and typically `speaker`). Intervals with non-positive duration are skipped. When provided, per-turn analysis is performed.	`None`
`time_unit`	`('ms', 's')`	Units for `start_time` and `end_time` in `transcript_csv`.	`"ms"`
`group_by`	`sequence of str`	Column names from `transcript_csv` used to aggregate per-turn summaries into higher-level rows (e.g., `["speaker"]`). If omitted, per-turn rows are written without aggregation.	`None`
`extra_id_cols`	`sequence of str`	Identifier/metadata columns to pass through when present (and to use as grouping keys where applicable). These are not numerically aggregated.	`("source", "speaker")`
`out_dir`	`str or Path`	Base directory for outputs if file paths are not given. Defaults to `./features/acoustics` (created if missing).	`None`
`out_framewise_csv`	`str or Path`	Path for the framewise CSV. If omitted and `include_framewise=True`, defaults to `<out_dir>/<stem>_framewise.csv`.	`None`
`out_summary_csv`	`str or Path`	Path for the summary CSV. If omitted, defaults to `<out_dir>/<stem>_summary.csv` (or an equivalent name in per-turn mode).	`None`
`overwrite_existing`	`bool`	If `False` and an output already exists, returns existing paths without recomputation. If `True`, outputs are recomputed and overwritten.	`False`
`include_framewise`	`bool`	If `True`, also write the framewise table. Set to `False` to write only the summary.	`True`
`mode`	`('simple', 'tremor', 'advanced')`	Feature families to compute: - `"simple"`: framewise f0, formants (F1–F4), loudness, HNR; summary stats; silence ratio; jitter/shimmer; MFCC means/variances; CPP (if available). - `"tremor"`: everything in simple plus tremor metrics via a Praat script (requires `tremor_script`). - `"advanced"`: everything in tremor plus glottal features (requires DisVoice and dependencies).	`"simple"`
`summarize_on_voiced_segments_ms`	`int or None`	If an integer, summary statistics for framewise series are computed only on voiced segments whose duration is at least this many milliseconds. If `None`, all frames are used.	`100`
`f0_min`	`float`	Minimum fundamental frequency (Hz) for pitch tracking. Out-of-range f0 values are treated as unvoiced (0 in framewise; excluded from voiced summaries).	`75.0`
`f0_max`	`float`	Maximum fundamental frequency (Hz) for pitch tracking.	`500.0`
`n_mfcc`	`int`	Number of MFCC coefficients to summarize (means and variances).	`14`
`tremor_script`	`str or Path`	Path to a Praat tremor script. Required when `mode in {"tremor","advanced"}`.	`None`
`preprocess`	`bool`	If `True` (whole-file mode), resample to `target_sr`, optionally remove DC offset (`remove_dc`), and normalize level toward `target_dbfs` with headroom protection. In per-turn mode, slices are analyzed with consistent parameters and are not re-normalized per slice.	`True`
`target_sr`	`int`	Target sample rate for preprocessing (whole-file mode).	`44100`
`target_dbfs`	`float`	Target loudness (dBFS) for level normalization (whole-file mode).	`-20.0`
`remove_dc`	`bool`	If `True`, attempt to remove DC offset during preprocessing (whole-file mode).	`True`
`pause_top_db`	`int`	Non-silence threshold for pause detection (higher → fewer speech segments). Passed to `librosa.effects.split`.	`30`
`pause_frame_length`	`int`	Frame length (samples) for pause detection.	`2048`
`pause_hop_length`	`int`	Hop length (samples) for pause detection.	`512`

Returns:

Type	Description
`dict`	Mapping with: `{"framewise_csv": pathlib.Path or None, "summary_csv": pathlib.Path}`.

Raises:

Type	Description
`ValueError`	If neither `wav_path` nor `transcript_csv` is provided; if `transcript_csv` is provided without `wav_path`; if required transcript columns are missing; or if `mode` requires unavailable dependencies (e.g., `tremor_script` for `"tremor"`, DisVoice for `"advanced"`).
`FileNotFoundError`	If provided paths do not exist.
`RuntimeError`	If feature extraction fails due to decoding errors, invalid audio, or downstream library issues.

Notes

Framewise CSV (written when include_framewise=True)
One row per short-time frame with: frame_index, time_s, f0_hz, f1_hz–f4_hz, loudness_db, hnr_db. In per-turn mode, also includes segment_index, start_s, end_s, and any extra_id_cols present.

Summary CSV (always written)
Whole-file: one row.
Per-turn (no group_by): one row per interval.
Per-turn with group_by: one row per group (e.g., per speaker).
Columns include summary stats of framewise series (on all frames or voiced segments ≥ summarize_on_voiced_segments_ms), silence ratio, jitter/shimmer variants, MFCC means/variances, CPP (if available), optional tremor/glottal metrics, and any extra_id_cols/group_by columns.

Performance

Per-turn analysis can be I/O intensive for long files with dense transcripts. Tremor/glottal metrics are substantially more expensive than simple mode.

Examples:

Whole-file analysis with framewise output:

>>> analyze_acoustics(
...     wav_path="session.wav",
...     out_dir="features/acoustics",
... )

Per-turn analysis aggregated by speaker:

>>> analyze_acoustics(
...     wav_path="session.wav",
...     transcript_csv="transcripts/session.csv",
...     time_unit="ms",
...     group_by=["speaker"],
...     extra_id_cols=["source", "speaker"],
...     out_summary_csv="features/acoustics/session_by_speaker.csv",
...     summarize_on_voiced_segments_ms=100,
...     mode="simple",
... )

Source code in src\taters\audio\analyze_vocal_acoustics.py

def analyze_acoustics(
    *,
    # Inputs (choose ONE of these two paths)
    wav_path: Optional[Union[str, Path]] = None,
    transcript_csv: Optional[Union[str, Path]] = None,  # if provided, we do per-turn analysis
    # Transcript options
    time_unit: Literal["ms","s"] = "ms",
    group_by: Optional[Sequence[str]] = None,       # e.g., ["speaker"]
    extra_id_cols: Sequence[str] = ("source","speaker"),
    # Output
    out_dir: Optional[Union[str, Path]] = None,
    out_framewise_csv: Optional[Union[str, Path]] = None,
    out_summary_csv: Optional[Union[str, Path]] = None,
    overwrite_existing: bool = False,
    include_framewise: bool = True,              # ← default ON now
    # Analysis options
    mode: Mode = "simple",
    summarize_on_voiced_segments_ms: Optional[int] = 100,
    f0_min: float = 75.0,
    f0_max: float = 500.0,
    n_mfcc: int = 14,
    tremor_script: Optional[Union[str, Path]] = None,
    # preprocessing controls
    preprocess: bool = True,
    target_sr: int = 44100,
    target_dbfs: float = -20.0,
    remove_dc: bool = True,
    # VAD/"pause" tuning
    pause_top_db: int = 30,
    pause_frame_length: int = 2048,
    pause_hop_length: int = 512,
) -> Dict[str, Optional[Path]]:
    """
    Extract acoustic features and write a summary CSV and (by default) a framewise CSV.

    This function computes a battery of speech/voice features using
    Praat/Parselmouth-style workflows with optional cepstral, tremor, and glottal
    measures. It supports two operating modes:

    1) Whole-file analysis
       Features are derived across the entire WAV. Summary statistics can be
       restricted to voiced segments longer than a threshold.

    2) Per-turn analysis (transcript-guided)
       The WAV is segmented using `start_time`/`end_time` from a transcript CSV,
       features are computed per segment, and (optionally) per-segment rows are
       aggregated via `group_by` (e.g., one row per speaker).

    Two artifacts can be written:
      • Summary CSV (always): means/SDs/ranges of framewise series; silence ratio;
        jitter/shimmer; MFCC means/variances; CPP (if available); optional tremor/glottal.
        With a transcript and `group_by`, the summary is aggregated per group.
      • Framewise CSV (default): one row per short-time frame (f0, F1–F4, loudness, HNR).
        Disable with `include_framewise=False` or set a custom path.

    Parameters
    ----------
    wav_path : str or pathlib.Path, optional
        Path to a WAV file (mono or stereo, PCM). Required for both whole-file
        and per-turn modes. If both `wav_path` and `transcript_csv` are ``None``,
        a ``ValueError`` is raised.
    transcript_csv : str or pathlib.Path, optional
        Path to a transcript CSV with at least `start_time`, `end_time` (and
        typically `speaker`). Intervals with non-positive duration are skipped.
        When provided, per-turn analysis is performed.
    time_unit : {"ms", "s"}, default "ms"
        Units for `start_time` and `end_time` in `transcript_csv`.
    group_by : sequence of str, optional
        Column names from `transcript_csv` used to aggregate per-turn summaries
        into higher-level rows (e.g., `["speaker"]`). If omitted, per-turn rows
        are written without aggregation.
    extra_id_cols : sequence of str, default ("source", "speaker")
        Identifier/metadata columns to pass through when present (and to use as
        grouping keys where applicable). These are not numerically aggregated.
    out_dir : str or pathlib.Path, optional
        Base directory for outputs if file paths are not given. Defaults to
        ``./features/acoustics`` (created if missing).
    out_framewise_csv : str or pathlib.Path, optional
        Path for the framewise CSV. If omitted and `include_framewise=True`,
        defaults to ``<out_dir>/<stem>_framewise.csv``.
    out_summary_csv : str or pathlib.Path, optional
        Path for the summary CSV. If omitted, defaults to
        ``<out_dir>/<stem>_summary.csv`` (or an equivalent name in per-turn mode).
    overwrite_existing : bool, default False
        If ``False`` and an output already exists, returns existing paths without
        recomputation. If ``True``, outputs are recomputed and overwritten.
    include_framewise : bool, default True
        If ``True``, also write the framewise table. Set to ``False`` to write
        only the summary.
    mode : {"simple", "tremor", "advanced"}, default "simple"
        Feature families to compute:
          - ``"simple"``: framewise f0, formants (F1–F4), loudness, HNR; summary stats;
            silence ratio; jitter/shimmer; MFCC means/variances; CPP (if available).
          - ``"tremor"``: everything in *simple* plus tremor metrics via a Praat script
            (requires `tremor_script`).
          - ``"advanced"``: everything in *tremor* plus glottal features (requires
            DisVoice and dependencies).
    summarize_on_voiced_segments_ms : int or None, default 100
        If an integer, summary statistics for framewise series are computed only
        on voiced segments whose duration is at least this many milliseconds.
        If ``None``, all frames are used.
    f0_min : float, default 75.0
        Minimum fundamental frequency (Hz) for pitch tracking. Out-of-range f0
        values are treated as unvoiced (0 in framewise; excluded from voiced summaries).
    f0_max : float, default 500.0
        Maximum fundamental frequency (Hz) for pitch tracking.
    n_mfcc : int, default 14
        Number of MFCC coefficients to summarize (means and variances).
    tremor_script : str or pathlib.Path, optional
        Path to a Praat tremor script. Required when ``mode in {"tremor","advanced"}``.
    preprocess : bool, default True
        If ``True`` (whole-file mode), resample to `target_sr`, optionally remove
        DC offset (`remove_dc`), and normalize level toward `target_dbfs` with
        headroom protection. In per-turn mode, slices are analyzed with consistent
        parameters and are not re-normalized per slice.
    target_sr : int, default 44100
        Target sample rate for preprocessing (whole-file mode).
    target_dbfs : float, default -20.0
        Target loudness (dBFS) for level normalization (whole-file mode).
    remove_dc : bool, default True
        If ``True``, attempt to remove DC offset during preprocessing (whole-file mode).
    pause_top_db : int, default 30
        Non-silence threshold for pause detection (higher → fewer speech segments).
        Passed to ``librosa.effects.split``.
    pause_frame_length : int, default 2048
        Frame length (samples) for pause detection.
    pause_hop_length : int, default 512
        Hop length (samples) for pause detection.

    Returns
    -------
    dict
        Mapping with:
        ``{"framewise_csv": pathlib.Path or None, "summary_csv": pathlib.Path}``.

    Raises
    ------
    ValueError
        If neither `wav_path` nor `transcript_csv` is provided; if `transcript_csv`
        is provided without `wav_path`; if required transcript columns are missing; or
        if `mode` requires unavailable dependencies (e.g., `tremor_script` for
        ``"tremor"``, DisVoice for ``"advanced"``).
    FileNotFoundError
        If provided paths do not exist.
    RuntimeError
        If feature extraction fails due to decoding errors, invalid audio, or
        downstream library issues.

    Notes
    -----
    **Framewise CSV** (written when `include_framewise=True`)  
    One row per short-time frame with: `frame_index`, `time_s`, `f0_hz`,
    `f1_hz`–`f4_hz`, `loudness_db`, `hnr_db`. In per-turn mode, also includes
    `segment_index`, `start_s`, `end_s`, and any `extra_id_cols` present.

    **Summary CSV** (always written)  
    Whole-file: one row.  
    Per-turn (no `group_by`): one row per interval.  
    Per-turn with `group_by`: one row per group (e.g., per speaker).  
    Columns include summary stats of framewise series (on all frames or voiced
    segments ≥ `summarize_on_voiced_segments_ms`), silence ratio, jitter/shimmer
    variants, MFCC means/variances, CPP (if available), optional tremor/glottal
    metrics, and any `extra_id_cols`/`group_by` columns.

    Performance
    -----------
    Per-turn analysis can be I/O intensive for long files with dense transcripts.
    Tremor/glottal metrics are substantially more expensive than *simple* mode.

    Examples
    --------
    Whole-file analysis with framewise output:

    >>> analyze_acoustics(
    ...     wav_path="session.wav",
    ...     out_dir="features/acoustics",
    ... )

    Per-turn analysis aggregated by speaker:

    >>> analyze_acoustics(
    ...     wav_path="session.wav",
    ...     transcript_csv="transcripts/session.csv",
    ...     time_unit="ms",
    ...     group_by=["speaker"],
    ...     extra_id_cols=["source", "speaker"],
    ...     out_summary_csv="features/acoustics/session_by_speaker.csv",
    ...     summarize_on_voiced_segments_ms=100,
    ...     mode="simple",
    ... )
    """

    if wav_path is None:
        raise ValueError("wav_path is required")

    wav_path = Path(wav_path)
    if out_dir is None:
        out_dir = Path("features") / "acoustics"
    out_dir = Path(out_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    # Default output paths
    stem = wav_path.stem
    if out_framewise_csv is None:
        out_framewise_csv = out_dir / f"{stem}_framewise.csv"
    else:
        out_framewise_csv = Path(out_framewise_csv)

    if out_summary_csv is None:
        suffix = "_by_" + "_".join(group_by) if (transcript_csv and group_by) else ""
        out_summary_csv = out_dir / f"{stem}_summary{suffix}.csv"
    else:
        out_summary_csv = Path(out_summary_csv)

    # Respect overwrite_existing
    if (not overwrite_existing) and out_summary_csv.exists():
        print(f"[acoustics] Summary output already exists; returning existing file: {out_summary_csv}")
        return {"framewise_csv": out_framewise_csv if out_framewise_csv.exists() else None,
                "summary_csv": out_summary_csv}

    # Run
    if transcript_csv:
        framewise_df, summary_df = _analyze_turns(
            wav_path=wav_path,
            transcript_csv=transcript_csv,
            time_unit=time_unit,
            group_by=group_by,
            extra_id_cols=extra_id_cols,
            mode=mode,
            summarize_on_voiced_segments_ms=summarize_on_voiced_segments_ms,
            include_framewise=include_framewise,
            tremor_script=tremor_script,
            f0_min=f0_min,
            f0_max=f0_max,
            n_mfcc=n_mfcc,
            preprocess=preprocess,
            target_sr=target_sr,
            target_dbfs=target_dbfs,
            remove_dc=remove_dc,
            pause_top_db=pause_top_db,
            pause_frame_length=pause_frame_length,
            pause_hop_length=pause_hop_length,
        )
    else:
        fdf, summ = _analyze_clip(
            wav_path,
            mode=mode,
            summarize_on_voiced_segments_ms=summarize_on_voiced_segments_ms,
            include_framewise=include_framewise,
            tremor_script=tremor_script,
            f0_min=f0_min,
            f0_max=f0_max,
            n_mfcc=n_mfcc,
            preprocess=preprocess,
            target_sr=target_sr,
            target_dbfs=target_dbfs,
            remove_dc=remove_dc,
            pause_top_db=pause_top_db,
            pause_frame_length=pause_frame_length,
            pause_hop_length=pause_hop_length,
        )
        # Build summary DF; carry extra_id_cols if user offered any via filename context later
        summary_df = pd.DataFrame([summ])
        framewise_df = fdf

    # Write outputs
    frame_path_out: Optional[Path] = None
    if include_framewise and framewise_df is not None:
        framewise_df.to_csv(out_framewise_csv, index=False, encoding="utf-8-sig")
        frame_path_out = out_framewise_csv

    summary_df.to_csv(out_summary_csv, index=False, encoding="utf-8-sig")

    return {"framewise_csv": frame_path_out, "summary_csv": out_summary_csv}

High-level, environment-safe wrapper for exporting Whisper encoder embeddings.

This module provides a single entry point, :func:extract_whisper_embeddings, which (by default) launches a subprocess to extract embeddings using a dedicated worker module. The subprocess approach avoids CUDA/Torch collisions with other parts of your pipeline.

Two modes are supported:

1) Transcript-driven mode Pass transcript_csv to compute one embedding vector per transcript row (e.g., per diarized segment). The output is a CSV with columns start_time,end_time,speaker,e0..e{D-1}.

2) General-audio mode Omit transcript_csv to analyze the raw WAV. You can segment by fixed windows or by non-silent regions; optionally aggregate to a single mean row.

extract_whisper_embeddings ¶

extract_whisper_embeddings(
    *,
    source_wav,
    transcript_csv=None,
    time_unit="auto",
    strategy="windows",
    window_s=30.0,
    hop_s=15.0,
    min_seg_s=1.0,
    top_db=30.0,
    aggregate="none",
    output_dir=None,
    overwrite_existing=False,
    model_name="base",
    device="auto",
    compute_type="float16",
    run_in_subprocess=True,
    extra_env=None,
    verbose=True,
    extractor_module="taters.audio.extract_whisper_embeddings_subproc"
)

Export Whisper encoder embeddings to a CSV file, using a subprocess by default.

Parameters:

Name	Type	Description	Default
`source_wav`	`str \| Path`	Path to the input WAV. Must be readable by `librosa`.	required
`transcript_csv`	`str \| Path \| None`	If provided, enables transcript-driven mode. The CSV is expected to contain timestamp columns and (optionally) a speaker column. A row is emitted per transcript segment.	`None`
`time_unit`	`('auto', 'ms', 's', 'samples')`	How to interpret timestamps in `transcript_csv`. In "auto", the worker heuristically infers the unit from max end time vs audio duration.	`"auto","ms","s","samples"`
`strategy`	`('windows', 'nonsilent')`	General-audio mode only. "windows" uses fixed sized windows with overlap; "nonsilent" uses an energy-based splitter (librosa.effects.split).	`"windows","nonsilent"`
`window_s`	`float`	General-audio mode only. Window length and hop (seconds).	`30.0, 15.0`
`hop_s`	`float`	General-audio mode only. Window length and hop (seconds).	`30.0, 15.0`
`min_seg_s`	`float`	General-audio mode only. Skip segments shorter than this many seconds.	`1.0`
`top_db`	`float`	General-audio mode only ("nonsilent"). Threshold (dB) below reference to consider as silence. Smaller → more segments; larger → fewer.	`30.0`
`aggregate`	`('none', 'mean')`	General-audio mode only. If "mean", a single pooled row is written covering the entire file; otherwise one row per segment.	`"none","mean"`
`output_dir`	`str \| Path \| None`	Directory for the output CSV. If None, defaults to `./features/whisper-embeddings`.	`None`
`model_name`	`str`	Model identifier passed through to the worker (e.g., "tiny", "base", "small", "large-v3" or a local CTranslate2 model directory).	`"base"`
`device`	`('auto', 'cuda', 'cpu')`	Runtime device. If "cpu", environment variables are set to disable CUDA in the child process.	`"auto","cuda","cpu"`
`compute_type`	`str`	CTranslate2 compute type (e.g., "float16", "int8", "float32"); passed to the worker module.	`"float16"`
`run_in_subprocess`	`bool`	If True (recommended), runs extraction in a separate Python process to isolate Torch/CUDA state from the parent process.	`True`
`extra_env`	`dict \| None`	Additional environment variables to inject into the child process.	`None`
`verbose`	`bool`	If True, print the launched command and the child's stdout.	`True`
`extractor_module`	`str`	Dotted module path whose `__main__` implements the extractor CLI.	`"chopshop.audio.extract_whisper_embeddings_subproc"`

Returns:

Type	Description
`Path`	Path to the written embeddings CSV. Pattern: `<output_dir>/<source_stem>_embeddings.csv`.

Notes

The subprocess writes and exits. The parent returns once the file exists.
If transcript_csv is supplied, the worker runs in transcript mode; otherwise general-audio mode is used with the given segmentation strategy.
Failures in the child process are re-raised with the captured stdout/stderr to ease debugging.

Examples:

Transcript per-segment embeddings:

>>> extract_whisper_embeddings(
...     source_wav="audio/session.wav",
...     transcript_csv="transcripts/session.csv",
...     time_unit="ms",
...     model_name="small",
...     device="cuda",
... )

Whole-file mean embedding:

>>> extract_whisper_embeddings(
...     source_wav="audio/session.wav",
...     strategy="nonsilent",
...     aggregate="mean",
...     output_dir="features/whisper-embeddings",
... )

Source code in src\taters\audio\extract_whisper_embeddings.py

def extract_whisper_embeddings(
    *,
    # required
    source_wav: Union[str, Path],

    # optional transcript-driven mode
    transcript_csv: Optional[Union[str, Path]] = None,
    time_unit: Literal["auto", "ms", "s", "samples"] = "auto",

    # general-audio mode (used when transcript_csv is None)
    strategy: Literal["windows", "nonsilent"] = "windows",
    window_s: float = 30.0,
    hop_s: float = 15.0,
    min_seg_s: float = 1.0,
    top_db: float = 30.0,
    aggregate: Literal["none", "mean"] = "none",

    # outputs
    output_dir: Optional[Union[str, Path]] = None,
    overwrite_existing: bool = False,  # if the file already exists, let's not overwrite by default

    # model/runtime
    model_name: str = "base",
    device: Literal["auto", "cuda", "cpu"] = "auto",
    compute_type: str = "float16",

    # execution strategy
    run_in_subprocess: bool = True,
    extra_env: Optional[dict] = None,
    verbose: bool = True,

    # where the extractor lives (python -m <module>)
    extractor_module: str = "taters.audio.extract_whisper_embeddings_subproc",
) -> Path:
    """
    Export Whisper encoder embeddings to a CSV file, using a subprocess by default.

    Parameters
    ----------
    source_wav : str | Path
        Path to the input WAV. Must be readable by `librosa`.
    transcript_csv : str | Path | None, optional
        If provided, enables transcript-driven mode. The CSV is expected to contain
        timestamp columns and (optionally) a speaker column. A row is emitted per
        transcript segment.
    time_unit : {"auto","ms","s","samples"}, default "auto"
        How to interpret timestamps in `transcript_csv`. In "auto", the worker
        heuristically infers the unit from max end time vs audio duration.
    strategy : {"windows","nonsilent"}, default "windows"
        General-audio mode only. "windows" uses fixed sized windows with overlap;
        "nonsilent" uses an energy-based splitter (librosa.effects.split).
    window_s, hop_s : float, default 30.0, 15.0
        General-audio mode only. Window length and hop (seconds).
    min_seg_s : float, default 1.0
        General-audio mode only. Skip segments shorter than this many seconds.
    top_db : float, default 30.0
        General-audio mode only ("nonsilent"). Threshold (dB) below reference to
        consider as silence. Smaller → more segments; larger → fewer.
    aggregate : {"none","mean"}, default "none"
        General-audio mode only. If "mean", a single pooled row is written covering
        the entire file; otherwise one row per segment.
    output_dir : str | Path | None, optional
        Directory for the output CSV. If None, defaults to
        ``./features/whisper-embeddings``.
    model_name : str, default "base"
        Model identifier passed through to the worker (e.g., "tiny", "base",
        "small", "large-v3" or a local CTranslate2 model directory).
    device : {"auto","cuda","cpu"}, default "auto"
        Runtime device. If "cpu", environment variables are set to disable CUDA
        in the child process.
    compute_type : str, default "float16"
        CTranslate2 compute type (e.g., "float16", "int8", "float32"); passed to
        the worker module.
    run_in_subprocess : bool, default True
        If True (recommended), runs extraction in a separate Python process to
        isolate Torch/CUDA state from the parent process.
    extra_env : dict | None, optional
        Additional environment variables to inject into the child process.
    verbose : bool, default True
        If True, print the launched command and the child's stdout.
    extractor_module : str, default "chopshop.audio.extract_whisper_embeddings_subproc"
        Dotted module path whose ``__main__`` implements the extractor CLI.

    Returns
    -------
    Path
        Path to the written embeddings CSV. Pattern:
        ``<output_dir>/<source_stem>_embeddings.csv``.

    Notes
    -----
    - The subprocess writes and exits. The parent returns once the file exists.
    - If `transcript_csv` is supplied, the worker runs in transcript mode; otherwise
      general-audio mode is used with the given segmentation strategy.
    - Failures in the child process are re-raised with the captured stdout/stderr
      to ease debugging.

    Examples
    --------
    Transcript per-segment embeddings:

    >>> extract_whisper_embeddings(
    ...     source_wav="audio/session.wav",
    ...     transcript_csv="transcripts/session.csv",
    ...     time_unit="ms",
    ...     model_name="small",
    ...     device="cuda",
    ... )

    Whole-file mean embedding:

    >>> extract_whisper_embeddings(
    ...     source_wav="audio/session.wav",
    ...     strategy="nonsilent",
    ...     aggregate="mean",
    ...     output_dir="features/whisper-embeddings",
    ... )
    """

    source_wav = Path(source_wav).resolve()
    # default to ./features/whisper-embeddings when not provided
    out_dir_final = (
        Path(output_dir).resolve()
        if output_dir
        else (Path.cwd() / "features" / "whisper-embeddings")
    )

    out_dir_final.mkdir(parents=True, exist_ok=True)
    output_csv = out_dir_final / f"{source_wav.stem}_embeddings.csv"

    if not overwrite_existing and Path(output_csv).is_file():
        print("Whisper embedding feature output file already exists; returning existing file.")
        return output_csv

    if not run_in_subprocess:
        # ---- In-process path (only when you’re sure no Torch/CUDA conflicts) ----
        from ..audio.extract_whisper_embeddings import (  # type: ignore
            export_segment_embeddings_csv,
            export_audio_embeddings_csv,
            EmbedConfig,
        )
        cfg = EmbedConfig(model_name=model_name, device=device, compute_type=compute_type, time_unit=time_unit)
        if transcript_csv is not None:
            transcript_csv = Path(transcript_csv).resolve()
            return Path(
                export_segment_embeddings_csv(
                    transcript_csv=transcript_csv,
                    source_wav=source_wav,
                    output_dir=out_dir_final,
                    config=cfg,
                )
            )
        else:
            return Path(
                export_audio_embeddings_csv(
                    source_wav=source_wav,
                    output_dir=out_dir_final,
                    config=cfg,
                    strategy=strategy,
                    window_s=window_s,
                    hop_s=hop_s,
                    min_seg_s=min_seg_s,
                    top_db=top_db,
                    aggregate=aggregate,
                )
            )

    # ---- Subprocess path (recommended) ----
    env = os.environ.copy()
    # Keep Transformers from importing heavy backends in the child
    env.setdefault("TRANSFORMERS_NO_TORCH", "1")
    env.setdefault("TRANSFORMERS_NO_TF", "1")
    env.setdefault("TRANSFORMERS_NO_FLAX", "1")

    if extra_env:
        env.update({k: str(v) for k, v in extra_env.items()})

    if device == "cpu":
        # Make sure the child won’t try CUDA
        env.update({"CUDA_VISIBLE_DEVICES": "", "USE_CUDA": "0", "FORCE_CPU": "1"})
    else:
        # Best-effort: prepend cuDNN wheel's lib dir if available
        try:
            import nvidia.cudnn, pathlib  # type: ignore
            cudnn_lib = str(pathlib.Path(nvidia.cudnn.__file__).with_name("lib"))
            env["LD_LIBRARY_PATH"] = cudnn_lib + ":" + env.get("LD_LIBRARY_PATH", "")
        except Exception:
            pass

    cmd = [
        sys.executable, "-m", extractor_module,
        "--source_wav", str(source_wav),
        "--output_dir", str(out_dir_final),
        "--model_name", model_name,
        "--device", device,
        "--compute_type", compute_type,
    ]

    if transcript_csv is not None:
        transcript_csv = Path(transcript_csv).resolve()
        cmd += ["--transcript_csv", str(transcript_csv), "--time_unit", time_unit]
    else:
        cmd += [
            "--strategy", strategy,
            "--window_s", str(window_s),
            "--hop_s", str(hop_s),
            "--min_seg_s", str(min_seg_s),
            "--top_db", str(top_db),
            "--aggregate", aggregate,
        ]

    if verbose:
        print("Launching embedding subprocess:")
        print(" ", shlex.join(cmd))

    try:
        res = subprocess.run(cmd, check=True, env=env, capture_output=True, text=True, stdin=subprocess.DEVNULL)
        if verbose and res.stdout:
            print(res.stdout.strip())
    except subprocess.CalledProcessError as e:
        raise RuntimeError(
            f"Embedding subprocess failed with code {e.returncode}\n"
            f"CMD: {shlex.join(cmd)}\n"
            f"STDOUT:\n{(e.stdout or '').strip()}\n\n"
            f"STDERR:\n{(e.stderr or '').strip()}"
        ) from e

    if not output_csv.exists():
        raise FileNotFoundError(f"Expected embeddings CSV not found: {output_csv}")

    if verbose:
        print(f"Embeddings CSV written to: {output_csv}")

    return output_csv

Subprocess worker that computes Whisper encoder embeddings.

This module is meant to be executed with python -m ... by the wrapper in extract_whisper_embeddings.py. It avoids importing heavyweight torch packages in the parent process and keeps CUDA state isolated.

Two entry functions implement I/O and shape-handling:

:func:export_segment_embeddings_csv — transcript-driven, one vector per row.
:func:export_audio_embeddings_csv — general WAVs; segmentation + optional pooling.

Both functions use faster-whisper (CTranslate2 backend) and WhisperFeatureExtractor to produce encoder features, then pool the encoder outputs into fixed-length vectors.

export_audio_embeddings_csv ¶

export_audio_embeddings_csv(
    source_wav,
    output_dir=None,
    *,
    config=EmbedConfig(),
    sr=16000,
    strategy="windows",
    window_s=30.0,
    hop_s=15.0,
    min_seg_s=1.0,
    top_db=30.0,
    apply_l2_normalization=False,
    aggregate="none"
)

Compute Whisper encoder embeddings for an arbitrary WAV (no transcript).

Parameters:

Name	Type	Description	Default
`source_wav`	`str \| Path`	Input audio (any format `librosa` can read).	required
`output_dir`	`str \| Path \| None`	Directory for the output CSV. Defaults to the WAV's parent if None.	`None`
`config`	`(EmbedConfig, keyword - only)`	Model/device/compute configuration.	`EmbedConfig()`
`sr`	`int`	Resample rate used by the feature extractor.	`16000`
`strategy`	`('windows', 'nonsilent')`	"windows": fixed windows with hop (overlap allowed). "nonsilent": energy-based voice activity detection via librosa.	`"windows","nonsilent"`
`window_s`	`float`	Window length and hop size (seconds). Used by both strategies.	`30.0`
`hop_s`	`float`	Window length and hop size (seconds). Used by both strategies.	`30.0`
`min_seg_s`	`float`	Discard segments shorter than this length (seconds).	`1.0`
`top_db`	`float`	Silence threshold for "nonsilent". Higher → fewer segments.	`30.0`
`aggregate`	`('none', 'mean')`	If "mean", write a single pooled vector over the whole file.	`"none","mean"`

Returns:

Type	Description
`Path`	CSV path: `<output_dir>/<wav_stem>_embeddings.csv`.

Notes

When aggregate="none", rows are start_time,end_time,SEGMENT_i,e0...
When aggregate="mean", a single row 0.000,<dur>,GLOBAL_MEAN,e0.. is written.

Source code in src\taters\audio\extract_whisper_embeddings_subproc.py

def export_audio_embeddings_csv(
    source_wav: str | Path,
    output_dir: Optional[str | Path] = None,
    *,
    config: EmbedConfig = EmbedConfig(),
    sr: int = 16000,
    strategy: Literal["windows", "nonsilent"] = "windows",
    window_s: float = 30.0,
    hop_s: float = 15.0,
    min_seg_s: float = 1.0,
    top_db: float = 30.0,
    apply_l2_normalization: bool = False,
    aggregate: Literal["none", "mean"] = "none",
) -> Path:
    """
    Compute Whisper encoder embeddings for an arbitrary WAV (no transcript).

    Parameters
    ----------
    source_wav : str | Path
        Input audio (any format `librosa` can read).
    output_dir : str | Path | None, optional
        Directory for the output CSV. Defaults to the WAV's parent if None.
    config : EmbedConfig, keyword-only
        Model/device/compute configuration.
    sr : int, default 16000
        Resample rate used by the feature extractor.
    strategy : {"windows","nonsilent"}, default "windows"
        - "windows": fixed windows with hop (overlap allowed).
        - "nonsilent": energy-based voice activity detection via librosa.
    window_s, hop_s : float
        Window length and hop size (seconds). Used by both strategies.
    min_seg_s : float
        Discard segments shorter than this length (seconds).
    top_db : float
        Silence threshold for "nonsilent". Higher → fewer segments.
    aggregate : {"none","mean"}, default "none"
        If "mean", write a single pooled vector over the whole file.

    Returns
    -------
    Path
        CSV path: ``<output_dir>/<wav_stem>_embeddings.csv``.

    Notes
    -----
    - When `aggregate="none"`, rows are ``start_time,end_time,SEGMENT_i,e0..``.
    - When `aggregate="mean"`, a single row ``0.000,<dur>,GLOBAL_MEAN,e0..`` is written.
    """

    source_wav = Path(source_wav)
    if output_dir is None:
        output_dir = source_wav.parent
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    out_csv = output_dir / f"{source_wav.stem}_embeddings.csv"

    # 1) Load audio
    y, in_sr = librosa.load(str(source_wav), sr=sr, mono=True)
    n = len(y)
    if n == 0:
        # Write an empty header-only file
        with out_csv.open("w", encoding="utf-8", newline="") as f:
            csv.writer(f).writerow(["start_time", "end_time", "speaker"])
        return out_csv

    # 2) Load faster-whisper + ct2 + feature extractor (same as transcript path)
    fw = WhisperModel(config.model_name, device=config.device, compute_type=config.compute_type)
    try:
        ct2_model: ctranslate2.models.Whisper = fw.model  # type: ignore[attr-defined]
    except AttributeError:
        model_dir = getattr(fw, "model_dir", None) or getattr(fw, "_model_dir", None)
        if not model_dir:
            raise RuntimeError(
                "Could not access the underlying CTranslate2 model from faster-whisper. "
                "Consider passing a local CTranslate2 model directory as model_name."
            )
        ct2_model = ctranslate2.models.Whisper(str(model_dir), device=config.device, compute_type=config.compute_type)

    fe = WhisperFeatureExtractor.from_pretrained(_hf_repo_for(config.model_name))

    # 3) Build segments (in samples)
    segs: list[tuple[int, int]] = []
    win = max(1, int(round(window_s * sr)))
    hop = max(1, int(round(hop_s * sr)))
    min_len = max(1, int(round(min_seg_s * sr)))

    if strategy == "windows":
        if n <= win:
            segs = [(0, n)]
        else:
            s = 0
            while s < n:
                e = min(n, s + win)
                segs.append((s, e))
                if e >= n:
                    break
                s += hop
    elif strategy == "nonsilent":
        # basic energy-based VAD; torch-free and fast
        intervals = librosa.effects.split(y, top_db=top_db)
        for s, e in intervals:
            if e - s < min_len:
                continue
            # subdivide very long spans into ~window_s chunks
            cur = s
            while cur < e:
                nxt = min(e, cur + win)
                if nxt - cur >= min_len:
                    segs.append((cur, nxt))
                cur = nxt
        if not segs:
            # fallback: whole file as one segment
            segs = [(0, n)]
    else:
        raise ValueError("strategy must be 'windows' or 'nonsilent'")

    # 4) Encode each segment
    rows_out: list[list[Any]] = []
    embed_dim: Optional[int] = None
    vectors: list[np.ndarray] = []

    for i, (s, e) in enumerate(segs):
        clip = y[s:e]
        feats = fe(clip, sampling_rate=sr, return_tensors="np")["input_features"]
        vec = _encode_features_any_layout(ct2_model, feats)
        if vec is None:
            continue
        if embed_dim is None:
            embed_dim = int(vec.shape[-1])
        vectors.append(vec)
        # keep per-chunk row unless we're aggregating
        if aggregate == "none":
            t0 = s / float(sr)
            t1 = e / float(sr)
            rows_out.append([f"{t0:.3f}", f"{t1:.3f}", f"SEGMENT_{i}"] + vec.tolist())

    # 5) Aggregate if requested
    if vectors and aggregate == "mean":
        vec = np.vstack(vectors).mean(axis=0)
        if apply_l2_normalization:
            vec = l2_normalize(vec)
        embed_dim = int(vec.shape[-1])
        rows_out = [["0.000", f"{n/float(sr):.3f}", "GLOBAL_MEAN"] + vec.tolist()]

    # 6) Write CSV (header even if empty)
    if embed_dim is None:
        header = ["start_time", "end_time", "speaker"]
    else:
        header = ["start_time", "end_time", "speaker"] + [f"e{i}" for i in range(embed_dim)]

    with out_csv.open("w", encoding="utf-8", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(header)
        writer.writerows(rows_out)

    if _os.environ.get("TATERS_DEBUG") == "1":
        print(f"[emb-any] segments={len(segs)}, kept={len(rows_out)}, aggregate={aggregate}")
        print(f"[emb-any] wrote: {out_csv}")

    return out_csv

export_segment_embeddings_csv ¶

export_segment_embeddings_csv(
    transcript_csv,
    source_wav,
    output_dir=None,
    *,
    config=EmbedConfig(),
    start_col="start_time",
    end_col="end_time",
    speaker_col="speaker",
    apply_l2_normalization=False,
    sr=16000
)

Compute Whisper encoder embeddings for each transcript segment and write a CSV.

Expected transcript columns (auto-resolved with fallbacks): - start_time (or: start, from, t0, start_ms, start_sec) - end_time (or: end, to, t1, end_ms, end_sec) - speaker (optional; fallbacks include speaker_label, spk, speaker_id, ...)

Parameters:

Name	Type	Description	Default
`transcript_csv`	`str \| Path`	CSV with segment timings (and optionally speaker labels).	required
`source_wav`	`str \| Path`	Audio file to slice. Will be resampled to `sr`.	required
`output_dir`	`str \| Path \| None`	Directory for the output CSV. If None, defaults to the WAV's parent.	`None`
`config`	`(EmbedConfig, keyword - only)`	Configuration for model name, device, compute type, and time unit.	`EmbedConfig()`
`start_col`	`str`	Column name hints. The function will fall back to common aliases if the exact names are not present.	`'start_time'`
`end_col`	`str`	Column name hints. The function will fall back to common aliases if the exact names are not present.	`'start_time'`
`speaker_col`	`str`	Column name hints. The function will fall back to common aliases if the exact names are not present.	`'start_time'`
`sr`	`int`	Sample rate for feature extraction (audio is resampled as needed).	`16000`

Returns:

Type	Description
`Path`	Path to the written CSV: `<output_dir>/<wav_stem>_embeddings.csv`

Behavior

Attempts to infer time units ("s", "ms", "samples") when config.time_unit == "auto".
Skips invalid or tiny segments (< 2 samples after rounding).
Pools encoder outputs to a fixed-length vector (mean over time).
Writes header even if no valid segments remain (empty payload).

make_speaker_wavs_from_csv ¶

make_speaker_wavs_from_csv(
    source_wav,
    transcript_csv_path,
    output_dir=None,
    *,
    overwrite_existing=False,
    start_col="start_time",
    end_col="end_time",
    speaker_col="speaker",
    time_unit="ms",
    silence_ms=1000,
    pre_silence_ms=None,
    post_silence_ms=None,
    sr=16000,
    mono=True,
    min_dur_ms=50,
    merge_consecutive=True
)

Concatenate speaker-specific segments into per-speaker WAV files.

If merge_consecutive=True (default), adjacent transcript rows with the same speaker are merged into a single, longer segment spanning from the first start to the last end — including any silence between those turns. If you need the strict per-row behavior, set merge_consecutive=False.

Parameters:

Name	Type	Description	Default
`source_wav`	`str \| Path`	Path to the source WAV.	required
`transcript_csv_path`	`str \| Path`	CSV with timing and speaker columns (e.g., diarization output).	required
`output_dir`	`str \| Path \| None`	Where to write the per-speaker files. If None, defaults to `./audio_split/<source_stem>/`.	`None`
`start_col`	`str`	Column names in the transcript CSV.	`'start_time'`
`end_col`	`str`	Column names in the transcript CSV.	`'start_time'`
`speaker_col`	`str`	Column names in the transcript CSV.	`'start_time'`
`time_unit`	`('ms', 's')`	Units for start/end columns.	`"ms","s"`
`silence_ms`	`int`	If `pre_silence_ms`/`post_silence_ms` are None, use this for both sides.	`1000`
`pre_silence_ms`	`int \| None`	Explicit padding (ms) before/after each segment; overrides `silence_ms`.	`None`
`post_silence_ms`	`int \| None`	Explicit padding (ms) before/after each segment; overrides `silence_ms`.	`None`
`sr`	`int \| None`	Resample output to this rate. If None, keep original rate.	`16000`
`mono`	`bool`	Downmix to mono if True.	`True`
`min_dur_ms`	`int`	Skip segments shorter than this duration (ms).	`50`
`merge_consecutive`	`bool`	Merge back-to-back turns for the same speaker into one segment span (including any inter-turn silence). If False, emit one clip per row.	`True`

Returns:

Type	Description
`dict[str, Path]`	Mapping from friendly speaker label → output WAV path.

Behavior

Input speaker labels are sanitized for filenames but a more readable label (without path-hostile characters) is preserved for naming.
Segments are sorted by start time per speaker before concatenation.
If a speaker ends up with zero valid segments, no file is written.

Examples:

>>> make_speaker_wavs_from_csv(
...     source_wav="audio/session.wav",
...     transcript_csv_path="transcripts/session.csv",
...     time_unit="ms",
...     silence_ms=0,  # no padding
...     sr=16000,
...     mono=True,
... )

Source code in src\taters\audio\split_wav_by_speaker.py

def make_speaker_wavs_from_csv(
    source_wav: Union[str, Path],
    transcript_csv_path: Union[str, Path],
    output_dir: Union[str, Path, None] = None,
    *,
    overwrite_existing: bool = False,
    start_col: str = "start_time",
    end_col: str = "end_time",
    speaker_col: str = "speaker",
    time_unit: str = "ms",             # "ms" or "s"
    silence_ms: int = 1000,
    pre_silence_ms: Optional[int] = None,
    post_silence_ms: Optional[int] = None,
    sr: Optional[int] = 16000,
    mono: bool = True,
    min_dur_ms: int = 50,
    merge_consecutive: bool = True,    # NEW: merge back-to-back turns by same speaker
) -> Dict[str, Path]:
    """
    Concatenate speaker-specific segments into per-speaker WAV files.

    If `merge_consecutive=True` (default), adjacent transcript rows with the same
    speaker are merged into a single, longer segment spanning from the first
    start to the last end — including any silence between those turns. If you
    need the strict per-row behavior, set `merge_consecutive=False`.

    Parameters
    ----------
    source_wav : str | Path
        Path to the source WAV.
    transcript_csv_path : str | Path
        CSV with timing and speaker columns (e.g., diarization output).
    output_dir : str | Path | None, optional
        Where to write the per-speaker files. If None, defaults to
        ``./audio_split/<source_stem>/``.
    start_col, end_col, speaker_col : str
        Column names in the transcript CSV.
    time_unit : {"ms","s"}, default "ms"
        Units for start/end columns.
    silence_ms : int, default 1000
        If `pre_silence_ms`/`post_silence_ms` are None, use this for both sides.
    pre_silence_ms, post_silence_ms : int | None
        Explicit padding (ms) before/after each segment; overrides `silence_ms`.
    sr : int | None, default 16000
        Resample output to this rate. If None, keep original rate.
    mono : bool, default True
        Downmix to mono if True.
    min_dur_ms : int, default 50
        Skip segments shorter than this duration (ms).
    merge_consecutive : bool, default True
        Merge back-to-back turns for the same speaker into one segment span
        (including any inter-turn silence). If False, emit one clip per row.

    Returns
    -------
    dict[str, Path]
        Mapping from friendly speaker label → output WAV path.

    Behavior
    --------
    - Input speaker labels are sanitized for filenames but a more readable label
      (without path-hostile characters) is preserved for naming.
    - Segments are sorted by start time per speaker before concatenation.
    - If a speaker ends up with zero valid segments, no file is written.

    Examples
    --------
    >>> make_speaker_wavs_from_csv(
    ...     source_wav="audio/session.wav",
    ...     transcript_csv_path="transcripts/session.csv",
    ...     time_unit="ms",
    ...     silence_ms=0,  # no padding
    ...     sr=16000,
    ...     mono=True,
    ... )
    """
    if time_unit not in ("ms", "s"):
        raise ValueError("time_unit must be 'ms' or 's'")

    def _friendly_filename_label(name: str) -> str:
        s = (name or "").strip()
        s = s.replace("/", "_").replace("\\", "_")
        s = re.sub(r'[<>:"|?*]', "", s)
        s = re.sub(r"\s+", " ", s)
        return s or "SPEAKER_0"

    source_wav = Path(source_wav)
    transcript_csv_path = Path(transcript_csv_path)
    out_dir = Path(output_dir) if output_dir is not None else (Path.cwd() / "audio_split" / source_wav.stem)
    out_dir.mkdir(parents=True, exist_ok=True)
    base_stem = source_wav.stem

    audio = AudioSegment.from_file(source_wav)
    if sr:
        audio = audio.set_frame_rate(sr)
    if mono:
        audio = audio.set_channels(1)

    factor = 1000.0 if time_unit == "s" else 1.0
    audio_len_ms = len(audio)

    with transcript_csv_path.open(newline="", encoding="utf-8") as f:
        rows = list(csv.DictReader(f))

    segs_by_spk: Dict[str, List[tuple[int, int]]] = {}
    label_for_key: Dict[str, str] = {}

    # Build segments with awareness of original row order so that we can merge
    # adjacent turns for the same speaker when requested.
    prev_spk_key: Optional[str] = None
    for row in rows:
        try:
            start_raw = float(row[start_col])
            end_raw   = float(row[end_col])
            raw_spk   = str(row.get(speaker_col, "SPEAKER_0"))
        except Exception:
            continue

        start_ms = int(round(start_raw * factor))
        end_ms   = int(round(end_raw   * factor))
        if end_ms <= start_ms:
            continue

        start_ms = _clamp(start_ms, 0, audio_len_ms)
        end_ms   = _clamp(end_ms,   0, audio_len_ms)
        if end_ms <= start_ms:
            continue

        spk_key = _sanitize_speaker(raw_spk)
        label_for_key.setdefault(spk_key, _friendly_filename_label(raw_spk))

        if merge_consecutive and prev_spk_key == spk_key and segs_by_spk.get(spk_key):
            # Extend the last segment for this speaker to cover the new end
            s0, e0 = segs_by_spk[spk_key][-1]
            # Keep the earliest start, extend to the latest end
            s_new = min(s0, start_ms)
            e_new = max(e0, end_ms)
            segs_by_spk[spk_key][-1] = (s_new, e_new)
        else:
            # Strictly append a new segment
            segs_by_spk.setdefault(spk_key, []).append((start_ms, end_ms))

        prev_spk_key = spk_key

    # Optional: drop very short segments after merging
    for spk_key, segs in list(segs_by_spk.items()):
        segs_by_spk[spk_key] = [(s, e) for (s, e) in segs if (e - s) >= min_dur_ms]

    pre_ms  = silence_ms if pre_silence_ms  is None else pre_silence_ms
    post_ms = silence_ms if post_silence_ms is None else post_silence_ms
    pre_sil  = AudioSegment.silent(duration=max(0, pre_ms),  frame_rate=audio.frame_rate)
    post_sil = AudioSegment.silent(duration=max(0, post_ms), frame_rate=audio.frame_rate)
    if mono:
        pre_sil  = pre_sil.set_channels(1)
        post_sil = post_sil.set_channels(1)

    results: Dict[str, Path] = {}
    for spk_key, segs in segs_by_spk.items():
        if not segs:
            continue

        friendly = label_for_key.get(spk_key, spk_key)
        out_path = out_dir / f"{base_stem}_{friendly}.wav"

        if (not overwrite_existing) and out_path.is_file():
            results[friendly] = out_path
            continue

        out = AudioSegment.silent(duration=0, frame_rate=audio.frame_rate)
        if mono:
            out = out.set_channels(1)

        for (s, e) in segs:
            clip = audio[s:e]
            if len(clip) < min_dur_ms:
                continue
            out += pre_sil + clip + post_sil

        if len(out) == 0:
            continue

        out.export(out_path, format="wav", codec="pcm_s16le")
        results[friendly] = out_path

    return results

Audio Modules¶

convert_audio_to_wav ¶

split_audio_streams_to_wav ¶

FramewiseTracks dataclass ¶

analyze_acoustics ¶

extract_whisper_embeddings ¶

export_audio_embeddings_csv ¶

export_segment_embeddings_csv ¶

make_speaker_wavs_from_csv ¶

FramewiseTracks `dataclass` ¶