Analyzing Audio¶
Audio is where Taters earns its name. The goal is simple: make it easy to get from messy containers and long recordings to clean, analysis-ready artifacts you can iterate on. The core tools cover three phases:
- extract and standardize audio (WAVs at predictable locations)
- structure the speech (diarization and transcripts)
- turn waveforms into features (embeddings, per-speaker splits)
Everything follows the same philosophy as the text stack: predictable outputs, friendly defaults, and a "do not overwrite unless asked" rule.
Extract audio from video¶
Many recordings arrive as multi-track containers (Zoom, OBS, ProRes). This utility lists every audio stream and writes one WAV per stream with sensible names that include stream index and tags like language/title. It is handy both for audits and for preparing inputs to downstream steps.
What it does
- probes audio streams with ffprobe
- writes one PCM WAV per stream at your chosen sample rate and bit depth
- predictable filenames:
<stem>_a<index>[_<lang>][_<title>].wav - default output directory if you do not pass one
When to use
- you have a video/container and want clean WAVs for each embedded track
- you intend to diarize and embed only one stream (e.g., the mixed program feed)
API: split audio streams to WAV¶
Extract each audio stream in a container to its own WAV file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_path
|
str | PathLike
|
Video or audio container readable by FFmpeg. |
required |
output_dir
|
str | PathLike | None
|
Destination directory. If None, defaults to |
None
|
sample_rate
|
int
|
Target sample rate for the output WAVs (Hz). |
48000
|
bit_depth
|
(16, 24, 32)
|
Output PCM bit depth (little-endian). |
16,24,32
|
overwrite
|
bool
|
If True, overwrite existing files. If False and a target exists,
raises :class: |
True
|
Returns:
| Type | Description |
|---|---|
list[str]
|
Absolute paths to the created WAVs. |
Behavior
- Output file names are constructed from the input base name and stream
metadata:
<stem>_a<index>[_<lang>][_<title>].wavwith safe slugs. - Uses
-map 0:a:<N>to select the N-th audio stream in the container. - Runs FFmpeg with
-nostdinand quiet loglevel to avoid TTY lockups.
Examples:
>>> split_audio_streams_to_wav("session.mp4")
['.../audio/session_a0_eng.wav', '.../audio/session_a1_eng.wav']
Source code in src\taters\audio\extract_wav_from_video.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | |
Convert any audio to WAV¶
Standardize any FFmpeg-readable media into a linear PCM WAV at the sample rate, bit depth, and channel layout you specify. Defaults are sensible for ASR and most modeling pipelines (16 kHz, 16-bit, mono).
What it does
- converts audio or extracts audio from video into a single WAV
- preserves channel layout if you request it
- uses ffmpeg with quiet, pipeline-safe flags and clear error reporting
- predictable default output path if omitted
When to use
- you need consistent, model-friendly WAVs from heterogeneous sources
- you want a one-liner from notebooks or the CLI
API: convert audio to WAV¶
Convert any FFmpeg-readable audio/video file to a linear PCM WAV.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_path
|
str | Path
|
Source media file (audio or video container). FFmpeg must be able to read it. |
required |
output_path
|
str | Path | None
|
Target WAV path. If None, defaults to
|
None
|
sample_rate
|
int
|
Desired sample rate (Hz). |
16000
|
bit_depth
|
(16, 24, 32)
|
Output PCM bit depth; maps to |
16,24,32
|
channels
|
int | None
|
If provided, set number of output channels (e.g., 1=mono, 2=stereo). If None, keep original channel count. |
1
|
overwrite_existing
|
bool
|
Overwrite |
False
|
Returns:
| Type | Description |
|---|---|
Path
|
Path to the written WAV file. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If |
RuntimeError
|
If FFmpeg/FFprobe are missing or the conversion fails. |
Notes
- Video inputs are supported: the audio stream is extracted and converted.
- For multi-channel sources and
channels is None, channel layout is preserved. - We run FFmpeg with
-nostdinto avoid TTY issues in pipelines.
Source code in src\taters\audio\convert_to_wav.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | |
Diarize and transcribe¶
This is a thin CLI shim that forwards to the vendored Whisper diarization wrapper. It handles device selection and writes transcripts alongside subtitles with helpful defaults. Use it to get a timestamped CSV (start_time,end_time,speaker,text) plus SRT/TXT that other Taters tools can consume immediately.
What it does
- delegates to the underlying whisper diarization wrapper
- writes transcripts to a predictable folder by input stem
- accepts device hints (cuda/cpu/auto)
When to use
- you are preparing per-segment text for embeddings or dictionary coding
- you plan to split a long recording into per-speaker WAVs
API: diarize with third-party wrapper (CLI entry)¶
Important note: This function is also exposed more easily via taters.audio.diarize_with_thirdparty
Run the vendored Whisper diarization scripts and normalize their outputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_path
|
str | Path
|
Input audio (WAV recommended). |
required |
out_dir
|
str | Path
|
Output directory for transcript artifacts. If it does not exist, it will be created. |
None
|
repo_dir
|
str | Path | None
|
Optional explicit location of the diarization repo. If None, the vendored copy is used. |
None
|
whisper_model
|
str
|
Whisper ASR model to use (e.g., "small", "base", "large-v3"). |
"medium.en"
|
language
|
str | None
|
Language hint for Whisper (e.g., "en"); if None, autodetection is used. |
None
|
device
|
('cpu', 'cuda')
|
Runtime device. If "cpu", environment variables are set to hide GPUs. |
"cpu","cuda"
|
batch_size
|
int
|
Whisper batch size; 0 disables batching. |
0
|
no_stem
|
bool
|
Pass through to demucs/whisper scripts to disable vocal/instrument stems. |
False
|
suppress_numerals
|
bool
|
Heuristic to reduce spurious numeral tokens. |
False
|
parallel
|
bool
|
Use parallel diarization script if available. |
False
|
timeout
|
int | None
|
Subprocess timeout in seconds; None means no timeout. |
None
|
use_custom
|
bool
|
Prefer the customized script if present (adds CSV emission and minor cleanup). |
True
|
keep_temp
|
bool
|
If False (default), temporary folders created by demucs/whisper are removed. |
False
|
num_speakers
|
int | None
|
Force a fixed number of speakers, if the downstream diarizer supports it. |
None
|
Returns:
| Type | Description |
|---|---|
DiarizationOutputFiles
|
Paths to |
Notes
- The function copies the input WAV to a per-file work directory before running, to ensure relative paths inside the third-party scripts resolve correctly.
- If
device="cpu", CUDA is disabled in the child environment. - On success, the local WAV copy is deleted and temporary folders are tidied up.
See Also
taters.audio.split_wav_by_speaker.make_speaker_wavs_from_csv : Build per-speaker WAVs from the diarization CSV.
Source code in src\taters\audio\diarizer\whisper_diar_wrapper.py
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 | |
Whisper encoder embeddings¶
Export Whisper encoder embeddings as features. Two modes:
- Transcript-driven: one vector per transcript row (e.g., per diarized segment)
- General-audio: segment the WAV with fixed windows or non-silent spans; optionally mean-pool to a single vector for the whole file
Outputs land under ./features/whisper-embeddings/ by default with a stable <stem>_embeddings.csv name. The wrapper runs extraction in a subprocess by default to avoid CUDA/Torch collisions elsewhere in your pipeline.
What it does
- computes D-dimensional vectors using Faster-Whisper/CTranslate2 backends
- segmenting strategies for raw audio; segment-level when given a transcript
- optional single-row mean pooling for whole-file summaries
- isolates heavy GPU state in a child process by default
When to use
- you want robust, speech-centric features for clustering, retrieval, or as inputs to downstream models
- you have transcripts already and want segment-level representations aligned to text
API: extract Whisper embeddings¶
Export Whisper encoder embeddings to a CSV file, using a subprocess by default.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_wav
|
str | Path
|
Path to the input WAV. Must be readable by |
required |
transcript_csv
|
str | Path | None
|
If provided, enables transcript-driven mode. The CSV is expected to contain timestamp columns and (optionally) a speaker column. A row is emitted per transcript segment. |
None
|
time_unit
|
('auto', 'ms', 's', 'samples')
|
How to interpret timestamps in |
"auto","ms","s","samples"
|
strategy
|
('windows', 'nonsilent')
|
General-audio mode only. "windows" uses fixed sized windows with overlap; "nonsilent" uses an energy-based splitter (librosa.effects.split). |
"windows","nonsilent"
|
window_s
|
float
|
General-audio mode only. Window length and hop (seconds). |
30.0, 15.0
|
hop_s
|
float
|
General-audio mode only. Window length and hop (seconds). |
30.0, 15.0
|
min_seg_s
|
float
|
General-audio mode only. Skip segments shorter than this many seconds. |
1.0
|
top_db
|
float
|
General-audio mode only ("nonsilent"). Threshold (dB) below reference to consider as silence. Smaller → more segments; larger → fewer. |
30.0
|
aggregate
|
('none', 'mean')
|
General-audio mode only. If "mean", a single pooled row is written covering the entire file; otherwise one row per segment. |
"none","mean"
|
output_dir
|
str | Path | None
|
Directory for the output CSV. If None, defaults to
|
None
|
model_name
|
str
|
Model identifier passed through to the worker (e.g., "tiny", "base", "small", "large-v3" or a local CTranslate2 model directory). |
"base"
|
device
|
('auto', 'cuda', 'cpu')
|
Runtime device. If "cpu", environment variables are set to disable CUDA in the child process. |
"auto","cuda","cpu"
|
compute_type
|
str
|
CTranslate2 compute type (e.g., "float16", "int8", "float32"); passed to the worker module. |
"float16"
|
run_in_subprocess
|
bool
|
If True (recommended), runs extraction in a separate Python process to isolate Torch/CUDA state from the parent process. |
True
|
extra_env
|
dict | None
|
Additional environment variables to inject into the child process. |
None
|
verbose
|
bool
|
If True, print the launched command and the child's stdout. |
True
|
extractor_module
|
str
|
Dotted module path whose |
"chopshop.audio.extract_whisper_embeddings_subproc"
|
Returns:
| Type | Description |
|---|---|
Path
|
Path to the written embeddings CSV. Pattern:
|
Notes
- The subprocess writes and exits. The parent returns once the file exists.
- If
transcript_csvis supplied, the worker runs in transcript mode; otherwise general-audio mode is used with the given segmentation strategy. - Failures in the child process are re-raised with the captured stdout/stderr to ease debugging.
Examples:
Transcript per-segment embeddings:
>>> extract_whisper_embeddings(
... source_wav="audio/session.wav",
... transcript_csv="transcripts/session.csv",
... time_unit="ms",
... model_name="small",
... device="cuda",
... )
Whole-file mean embedding:
>>> extract_whisper_embeddings(
... source_wav="audio/session.wav",
... strategy="nonsilent",
... aggregate="mean",
... output_dir="features/whisper-embeddings",
... )
Source code in src\taters\audio\extract_whisper_embeddings.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 | |
options: members_order: alphabetical show_source: true
Here’s a drop-in section you can paste into your Analyzing Audio guide right after the existing modules. It mirrors the tone/structure of your text pages and surfaces the key knobs you added (preprocessing, VAD tuning, etc.).
Vocal acoustics (Parselmouth/Praat)¶
This extractor turns a WAV (or a transcript-guided set of turns) into framewise tracks and clean summaries of voice acoustics. Under the hood it uses Praat/Parselmouth for f0, formants (F1–F4), loudness, HNR, jitter/shimmer, and optional tremor/glottal metrics, plus MFCC stats and simple pause/silence measures. Use it when you want interpretable, physiology-adjacent voice features that play nicely with clinical and social-science workflows.
Warning: I am not and acoustics person. Treat the output from this module with skepticism for time being.
Preprocessing¶
By default, the audio is standardized to reduce "garbage in → garbage out" risk:
- resample to
target_sr(default 44.1 kHz) - remove DC offset (channel-wise when stereo)
- loudness-normalize to
target_dbfs(default −20 dBFS)
You can turn this off (preprocess=False) or tweak the targets as needed.
Operating modes¶
- Whole file: analyze a single WAV and write summary (+ framewise, by default).
- Per turn (with transcript): segment the WAV using
start_time/end_timefrom a CSV (optionally group/aggregate by columns likespeaker). Framewise output includessegment_indexand any IDs you pass through.
Outputs¶
Two CSVs, with predictable default names under features/acoustics/:
- Framewise (on by default): one row per short-time frame with
time_s,f0_hz,f1_hz–f4_hz,loudness_db,hnr_db(+ IDs in per-turn mode). - Summary (always): mean/std/range of the framewise series (optionally only on voiced segments ≥
summarize_on_voiced_segments_ms), silence ratio, jitter/shimmer variants, MFCC means/variances, CPP (if available), and optional tremor/glottal metrics.
Practical parameters (hey, that's alliteration!)¶
- Voiced-only summaries:
summarize_on_voiced_segments_ms=100limits stats to voiced stretches ≥100 ms (helps avoid tiny blips). - Pause/VAD tuning:
pause_top_db,pause_frame_length,pause_hop_lengthlet you adjust librosa’s non-silence detection (useful for noisy rooms). - Pitch range:
f0_min,f0_maxbound the search for f0 (frame values outside the range are treated as unvoiced). - Cepstra:
n_mfcccontrols how many MFCCs are summarized. - Tremor/glottal: set
mode="tremor"(requires a Praat script) ormode="advanced"(adds DisVoice-based glottal features).
CLI examples¶
Whole-file, default preprocessing and framewise+summary:
python -m taters.audio.analyze_vocal_acoustics \
--wav "audio/session.wav" \
--mode simple --overwrite_existing
Per-turn, aggregate by speaker:
python -m taters.audio.analyze_vocal_acoustics \
--wav "audio/session.wav" \
--transcript "transcripts/session.csv" \
--time-unit ms \
--group-by speaker \
--mode simple \
--overwrite_existing
Tuning pause detection (looser threshold, longer window):
python -m taters.audio.analyze_vocal_acoustics \
--wav "audio/session.wav" \
--pause-top-db 40 --pause-frame-length 4096 --pause-hop-length 1024
Python example¶
from taters import Taters
t = Taters()
res = t.audio.analyze_vocal_acoustics(
wav_path="audio/session.wav",
# or: transcript_csv="transcripts/session.csv", group_by=["speaker"],
mode="simple",
summarize_on_voiced_segments_ms=100,
preprocess=True, target_sr=44100, target_dbfs=-20.0, remove_dc=True,
pause_top_db=30, pause_frame_length=2048, pause_hop_length=512,
overwrite_existing=True
)
print("Framewise CSV:", res["framewise_csv"])
print("Summary CSV:", res["summary_csv"])
Troubleshooting tips¶
- Many
Nones for formants/HNR: often means unvoiced frames or out-of-range pitch. Try wideningf0_min/f0_max, or ensure preprocessing is on. Very narrow-band or music-heavy audio can also throw formant estimation off. - CPP/GNE missing: those depend on Praat commands that are not available in every environment; the extractor will warn and continue.
- Weird pause counts: adjust
pause_top_dbupwards (e.g., 40) or use largerpause_frame_lengthto be more conservative in noisy recordings.
API: acoustic analysis¶
Extract acoustic features and write a summary CSV and (by default) a framewise CSV.
This function computes a battery of speech/voice features using Praat/Parselmouth-style workflows with optional cepstral, tremor, and glottal measures. It supports two operating modes:
1) Whole-file analysis Features are derived across the entire WAV. Summary statistics can be restricted to voiced segments longer than a threshold.
2) Per-turn analysis (transcript-guided)
The WAV is segmented using start_time/end_time from a transcript CSV,
features are computed per segment, and (optionally) per-segment rows are
aggregated via group_by (e.g., one row per speaker).
Two artifacts can be written:
• Summary CSV (always): means/SDs/ranges of framewise series; silence ratio;
jitter/shimmer; MFCC means/variances; CPP (if available); optional tremor/glottal.
With a transcript and group_by, the summary is aggregated per group.
• Framewise CSV (default): one row per short-time frame (f0, F1–F4, loudness, HNR).
Disable with include_framewise=False or set a custom path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
wav_path
|
str or Path
|
Path to a WAV file (mono or stereo, PCM). Required for both whole-file
and per-turn modes. If both |
None
|
transcript_csv
|
str or Path
|
Path to a transcript CSV with at least |
None
|
time_unit
|
('ms', 's')
|
Units for |
"ms"
|
group_by
|
sequence of str
|
Column names from |
None
|
extra_id_cols
|
sequence of str
|
Identifier/metadata columns to pass through when present (and to use as grouping keys where applicable). These are not numerically aggregated. |
("source", "speaker")
|
out_dir
|
str or Path
|
Base directory for outputs if file paths are not given. Defaults to
|
None
|
out_framewise_csv
|
str or Path
|
Path for the framewise CSV. If omitted and |
None
|
out_summary_csv
|
str or Path
|
Path for the summary CSV. If omitted, defaults to
|
None
|
overwrite_existing
|
bool
|
If |
False
|
include_framewise
|
bool
|
If |
True
|
mode
|
('simple', 'tremor', 'advanced')
|
Feature families to compute:
- |
"simple"
|
summarize_on_voiced_segments_ms
|
int or None
|
If an integer, summary statistics for framewise series are computed only
on voiced segments whose duration is at least this many milliseconds.
If |
100
|
f0_min
|
float
|
Minimum fundamental frequency (Hz) for pitch tracking. Out-of-range f0 values are treated as unvoiced (0 in framewise; excluded from voiced summaries). |
75.0
|
f0_max
|
float
|
Maximum fundamental frequency (Hz) for pitch tracking. |
500.0
|
n_mfcc
|
int
|
Number of MFCC coefficients to summarize (means and variances). |
14
|
tremor_script
|
str or Path
|
Path to a Praat tremor script. Required when |
None
|
preprocess
|
bool
|
If |
True
|
target_sr
|
int
|
Target sample rate for preprocessing (whole-file mode). |
44100
|
target_dbfs
|
float
|
Target loudness (dBFS) for level normalization (whole-file mode). |
-20.0
|
remove_dc
|
bool
|
If |
True
|
pause_top_db
|
int
|
Non-silence threshold for pause detection (higher → fewer speech segments).
Passed to |
30
|
pause_frame_length
|
int
|
Frame length (samples) for pause detection. |
2048
|
pause_hop_length
|
int
|
Hop length (samples) for pause detection. |
512
|
Returns:
| Type | Description |
|---|---|
dict
|
Mapping with:
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If neither |
FileNotFoundError
|
If provided paths do not exist. |
RuntimeError
|
If feature extraction fails due to decoding errors, invalid audio, or downstream library issues. |
Notes
Framewise CSV (written when include_framewise=True)
One row per short-time frame with: frame_index, time_s, f0_hz,
f1_hz–f4_hz, loudness_db, hnr_db. In per-turn mode, also includes
segment_index, start_s, end_s, and any extra_id_cols present.
Summary CSV (always written)
Whole-file: one row.
Per-turn (no group_by): one row per interval.
Per-turn with group_by: one row per group (e.g., per speaker).
Columns include summary stats of framewise series (on all frames or voiced
segments ≥ summarize_on_voiced_segments_ms), silence ratio, jitter/shimmer
variants, MFCC means/variances, CPP (if available), optional tremor/glottal
metrics, and any extra_id_cols/group_by columns.
Performance
Per-turn analysis can be I/O intensive for long files with dense transcripts. Tremor/glottal metrics are substantially more expensive than simple mode.
Examples:
Whole-file analysis with framewise output:
>>> analyze_acoustics(
... wav_path="session.wav",
... out_dir="features/acoustics",
... )
Per-turn analysis aggregated by speaker:
>>> analyze_acoustics(
... wav_path="session.wav",
... transcript_csv="transcripts/session.csv",
... time_unit="ms",
... group_by=["speaker"],
... extra_id_cols=["source", "speaker"],
... out_summary_csv="features/acoustics/session_by_speaker.csv",
... summarize_on_voiced_segments_ms=100,
... mode="simple",
... )
Source code in src\taters\audio\analyze_vocal_acoustics.py
929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 | |
options: members_order: alphabetical show_source: true
Split WAV by speaker¶
Given a diarization transcript, create one WAV per speaker by concatenating that speaker's segments. You can insert small silences to avoid clicks at joins and resample/downmix on the fly. Filenames are readable and stable.
What it does
- reads a timestamped CSV with
start_time,end_time,speaker - builds one output WAV per unique speaker
- optional silence padding, resampling, mono mixdown
- skips ultra-short segments; clamps times to audio bounds
When to use
- you want per-speaker audio for targeted feature extraction or human coding
- you plan to model speakers separately or compute speaker-level aggregates
API: make per-speaker WAVs from a transcript¶
Concatenate speaker-specific segments into per-speaker WAV files.
If merge_consecutive=True (default), adjacent transcript rows with the same
speaker are merged into a single, longer segment spanning from the first
start to the last end — including any silence between those turns. If you
need the strict per-row behavior, set merge_consecutive=False.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_wav
|
str | Path
|
Path to the source WAV. |
required |
transcript_csv_path
|
str | Path
|
CSV with timing and speaker columns (e.g., diarization output). |
required |
output_dir
|
str | Path | None
|
Where to write the per-speaker files. If None, defaults to
|
None
|
start_col
|
str
|
Column names in the transcript CSV. |
'start_time'
|
end_col
|
str
|
Column names in the transcript CSV. |
'start_time'
|
speaker_col
|
str
|
Column names in the transcript CSV. |
'start_time'
|
time_unit
|
('ms', 's')
|
Units for start/end columns. |
"ms","s"
|
silence_ms
|
int
|
If |
1000
|
pre_silence_ms
|
int | None
|
Explicit padding (ms) before/after each segment; overrides |
None
|
post_silence_ms
|
int | None
|
Explicit padding (ms) before/after each segment; overrides |
None
|
sr
|
int | None
|
Resample output to this rate. If None, keep original rate. |
16000
|
mono
|
bool
|
Downmix to mono if True. |
True
|
min_dur_ms
|
int
|
Skip segments shorter than this duration (ms). |
50
|
merge_consecutive
|
bool
|
Merge back-to-back turns for the same speaker into one segment span (including any inter-turn silence). If False, emit one clip per row. |
True
|
Returns:
| Type | Description |
|---|---|
dict[str, Path]
|
Mapping from friendly speaker label → output WAV path. |
Behavior
- Input speaker labels are sanitized for filenames but a more readable label (without path-hostile characters) is preserved for naming.
- Segments are sorted by start time per speaker before concatenation.
- If a speaker ends up with zero valid segments, no file is written.
Examples:
>>> make_speaker_wavs_from_csv(
... source_wav="audio/session.wav",
... transcript_csv_path="transcripts/session.csv",
... time_unit="ms",
... silence_ms=0, # no padding
... sr=16000,
... mono=True,
... )
Source code in src\taters\audio\split_wav_by_speaker.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 | |
options: members_order: alphabetical show_source: true
Practical notes¶
- Paths: if you do not pass explicit outputs, tools write to predictable folders next to your project root (for example,
./audio,./features/whisper-embeddings). - Overwrite behavior: by default, functions will not overwrite existing files; pass the relevant flag to force a rebuild.
- Device selection: where supported,
device="auto"picks sensibly; setcudaorcpuexplicitly when you need control.