Utilities & Helpers¶
The helpers do the unglamorous work that makes everything else feel simple. Well... simpler, I suppose. They help you do things like (1) find files without writing a custom glob every time, (2) turn raw text sources into analysis-ready CSVs, and (3) gather or aggregate many per-file feature CSVs into tidy datasets. The goal is predictable inputs and outputs, with just enough power for large projects.
File discovery: find_files
¶
Find media files under a folder using smart, FFmpeg-friendly filters. You can choose a built-in group (audio, video, image, subtitle, any) or pass explicit extensions. Hidden items are ignored by default; optional glob includes/excludes give you surgical control. For audio/video, ffprobe_verify=True
keeps only files where a matching stream actually exists (handy for odd containers).
Highlights
- Groups:
audio
,video
,image
,subtitle
,archive
,any
. - Or set
extensions=[".wav",".flac"]
to override groups. include_globs
/exclude_globs
to narrow by name or path.ffprobe_verify=True
to confirm playable streams (audio/video).
Python
from taters.helpers.find_files import find_files
# All videos under "dataset", absolute paths
videos = find_files("dataset", file_type="video")
# Only WAV/FLAC, relative paths, include/exclude patterns
wavs = find_files(
"dataset",
extensions=[".wav", ".flac"],
absolute=False,
include_globs=["**/*session*"],
exclude_globs=["**/tmp/**"],
)
# Audio files that actually contain audio streams
aud = find_files("dataset", file_type="audio", ffprobe_verify=True)
print(len(aud), "usable audio files")
CLI
python -m taters.helpers.find_files dataset --file_type video
python -m taters.helpers.find_files dataset --ext .wav --ext .flac --relative
python -m taters.helpers.find_files dataset --file_type audio --ffprobe-verify
API: find_files¶
Discover media files under a folder using smart, FFmpeg-friendly filters.
You can either (a) choose a built-in group of extensions via file_type
("audio"|"video"|"image"|"subtitle"|"archive"|"any"
) or (b) pass an explicit
list of extensions
to match. Matching is case-insensitive; dots are optional
(e.g., ".wav"
and "wav"
are equivalent). Hidden files and directories are
excluded by default.
For audio/video, ffprobe_verify=True
additionally checks that at least one
corresponding stream is present (e.g., exclude MP4s with no audio when
file_type="audio"
). This is slower but robust when your dataset contains
“container only” files. :contentReference[oaicite:0]{index=0}
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root_dir
|
str | PathLike
|
Folder to scan. |
required |
file_type
|
str
|
Built-in group selector. Ignored if |
'video'
|
extensions
|
Optional[Sequence[str]]
|
Explicit extensions to include (e.g., |
None
|
recursive
|
bool
|
Recurse into subfolders. Default: |
True
|
follow_symlinks
|
bool
|
Follow directory symlinks during traversal. Default: |
False
|
include_hidden
|
bool
|
Include dot-files and dot-dirs. Default: |
False
|
include_globs
|
Optional[Sequence[str]]
|
Additional glob filters applied after extension filtering; |
None
|
absolute
|
bool
|
Return absolute paths when |
True
|
sort
|
bool
|
Sort lexicographically (case-insensitive). Default: |
True
|
ffprobe_verify
|
bool
|
For |
False
|
Returns:
Type | Description |
---|---|
list[Path]
|
The matched files. |
Raises:
Type | Description |
---|---|
FileNotFoundError
|
If |
ValueError
|
If |
Examples:
Find all videos (recursive), as absolute paths:
>>> find_files("dataset", file_type="video")
Use explicit extensions and keep paths relative:
>>> find_files("dataset", extensions=[".wav",".flac"], absolute=False)
Only include files matching a glob and exclude temp folders:
>>> find_files("dataset", file_type="audio",
... include_globs=["**/*session*"], exclude_globs=["**/tmp/**"])
Verify playable audio streams exist:
>>> find_files("dataset", file_type="audio", ffprobe_verify=True)
Source code in src\taters\helpers\find_files.py
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 |
|
Text gathering: text_gather
¶
Turn either a CSV or a folder of .txt
files into an analysis-ready CSV with the stable schema text_id,text[,group_count][,source_col][,source_path]
. Two entry points:
csv_to_analysis_ready_csv(...)
– stream a possibly huge CSV, picktext_cols
, optionallygroup_by
, choosemode="concat"|"separate"
, and writetext_id,text
(plus extras). With grouping,group_count
records how many pieces contributed. CSV mode supports external grouping via on-disk bucket partitioning, so you do not need to pre-sort giant files.txt_folder_to_analysis_ready_csv(...)
– scan a directory of.txt
and emit one row per file; choose how to derivetext_id
(stem
,name
, orpath
), and includesource_path
if you want provenance columns.
Key ideas
-
Modes:
-
concat
joins multiple text columns (or rows within a group) usingjoiner
. separate
emits one row per text column and fillssource_col
.- Grouping at scale: Two-pass, external hash partitioning with an LRU of open writers; tune with
num_buckets
andmax_open_bucket_files
. - Defaults: Encoding tolerant of Excel (
utf-8-sig
), delimiter sniffing if you do not specify one, don’t-overwrite-unless-asked.
Python (CSV source → per-speaker text)
from taters.helpers.text_gather import csv_to_analysis_ready_csv
analysis_csv = csv_to_analysis_ready_csv(
csv_path="transcripts/session.csv",
text_cols=["text"],
id_cols=["speaker"], # optional; composes text_id when not grouping
group_by=["speaker"], # aggregate all utterances by speaker
mode="concat",
delimiter=",", # sniffed if None
joiner=" ",
num_buckets=1024, # scale knobs for big files
max_open_bucket_files=64,
)
print("Wrote:", analysis_csv)
Python (folder of .txt)
from taters.helpers.text_gather import txt_folder_to_analysis_ready_csv
analysis_csv = txt_folder_to_analysis_ready_csv(
root_dir="notes/",
recursive=True,
id_from="path",
include_source_path=True,
)
CLI
# CSV mode (repeat --text-col and --group-by as needed)
python -m taters.helpers.text_gather \
--csv transcripts/session.csv \
--text-col text --group-by speaker \
--delimiter "," --overwrite_existing false
# TXT folder mode
python -m taters.helpers.text_gather \
--txt-dir corpus/ --recursive --id-from path
API: csv_to_analysis_ready_csv¶
Stream a (possibly huge) CSV into a compact analysis-ready CSV with a stable schema and optional external grouping.
Output schema
Always writes a header and enforces a consistent column order:
• No grouping:
text_id,text
(plus source_col
if mode="separate"
)
• With grouping:
text_id,text,group_count
(plus source_col
if mode="separate"
)
Where:
- text_id
is either the composed ID from id_cols
or row_<n>
when
id_cols=None
.
- mode="concat"
joins all text_cols
using joiner
per row or group.
- mode="separate"
emits one row per (row_or_group
, text_col
) and
fills source_col
with the contributing column name.
Grouping at scale
If group_by
is provided, the function performs a two-pass external
grouping that does not require presorting:
1) Hash-partition rows to on-disk “bucket” CSVs (bounded writers with LRU).
2) Aggregate each bucket into final rows (concat or separate mode), writing
group_count
to record how many pieces contributed. :contentReference[oaicite:1]{index=1}
Parameters:
Name | Type | Description | Default |
---|---|---|---|
csv_path
|
PathLike
|
Source CSV with at least the columns in |
required |
out_csv
|
PathLike | None
|
Destination CSV. If |
None
|
overwrite_existing
|
bool
|
If |
False
|
text_cols
|
Sequence[str]
|
One or more text fields to concatenate or emit separately. |
required |
id_cols
|
Sequence[str] | None
|
Optional columns to compose |
None
|
mode
|
str
|
|
'concat'
|
group_by
|
Sequence[str] | None
|
Optional list of columns to aggregate by; works on unsorted CSVs. |
None
|
delimiter
|
str | None
|
Parsing/formatting options. If |
None
|
encoding
|
str | None
|
Parsing/formatting options. If |
None
|
joiner
|
str | None
|
Parsing/formatting options. If |
None
|
num_buckets
|
int
|
External grouping controls (partition count, LRU limit, temp root). |
1024
|
max_open_bucket_files
|
int
|
External grouping controls (partition count, LRU limit, temp root). |
1024
|
tmp_root
|
int
|
External grouping controls (partition count, LRU limit, temp root). |
1024
|
Returns:
Type | Description |
---|---|
Path
|
Path to the analysis-ready CSV. |
Raises:
Type | Description |
---|---|
ValueError
|
If required columns are missing or |
Examples:
Concatenate two text fields per row:
>>> csv_to_analysis_ready_csv(
... csv_path="transcripts.csv",
... text_cols=["prompt","response"],
... id_cols=["speaker"],
... )
Group by speaker and join rows:
>>> csv_to_analysis_ready_csv(
... csv_path="transcripts.csv",
... text_cols=["text"],
... group_by=["speaker"],
... )
Source code in src\taters\helpers\text_gather.py
170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 |
|
API: txt_folder_to_analysis_ready_csv¶
Stream a folder of .txt
files into an analysis-ready CSV with predictable,
reproducible IDs.
For each file matching pattern
, the emitted row contains:
- text_id
: the basename (stem), full filename, or relative path (see
id_from
), and
- text
: the file contents.
- source_path
: optional column with path relative to root_dir
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root_dir
|
PathLike
|
Folder containing |
required |
out_csv
|
PathLike | None
|
Destination CSV. If |
None
|
recursive
|
bool
|
Recurse into subfolders. Default: |
False
|
pattern
|
str
|
Glob for matching text files. Default: |
'*.txt'
|
encoding
|
str
|
File decoding. Default: |
'utf-8'
|
id_from
|
str
|
How to derive |
'stem'
|
include_source_path
|
bool
|
If |
True
|
overwrite_existing
|
bool
|
If |
False
|
Returns:
Type | Description |
---|---|
Path
|
Path to the analysis-ready CSV. |
Examples:
>>> txt_folder_to_analysis_ready_csv(root_dir="notes", recursive=True, id_from="path")
Source code in src\taters\helpers\text_gather.py
419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 |
|
Feature gatherer: concatenate or aggregate many CSVs¶
Once you have lots of per-file outputs (e.g., embeddings per segment, dictionary scores per file), these helpers build a single dataset for modeling or visualization.
gather_csvs_to_one(...)
– just stack CSVs, inserting a leadingsource
column (and optionallysource_path
). Output defaults to<root_dir.name>.csv
next to the folder.feature_gather(...)
– a single entry that either concatenates (default) or performs aggregation ifaggregate=True
. You can pass anAggregationPlan
or let it build one from quick arguments likegroup_by
,stats
, andper_file
.aggregate_features(...)
– discover, filter, coerce to numeric, group, and compute statistics; output columns are flattened likefeature__mean
. Group keys lead the output.
What aggregation means here
- Choose group keys (e.g.,
["speaker"]
). - Decide if grouping is per file (
per_file=True
adds thesource
key) or across all files. - Optionally filter columns before numeric selection using
exclude_cols
,include_regex
, andexclude_regex
. Only numeric columns (after coercion) are aggregated. - Compute stats per numeric feature (
mean
,std
,median
, etc.).
Python (concatenate only)
from taters.helpers.feature_gather import gather_csvs_to_one
out = gather_csvs_to_one(
root_dir="features/whisper-embeddings",
pattern="*.csv",
recursive=True,
add_source_path=False,
)
print("Merged:", out)
Python (aggregate per speaker within each file)
from taters.helpers.feature_gather import feature_gather
agg = feature_gather(
root_dir="features/sentence-embeddings",
aggregate=True,
group_by=["speaker"], # quick-plan keys
per_file=True, # include 'source' in group keys
stats=("mean","std"), # compute per numeric column
exclude_cols=("start_time","end_time","text"), # drop non-features first
include_regex=None, # or narrow to specific features
out_csv=None, # defaults to ./features/sentence-embeddings.csv
)
print("Aggregated:", agg)
Python (explicit plan)
from taters.helpers.feature_gather import make_plan, aggregate_features
plan = make_plan(
group_by=["speaker"],
per_file=False, # aggregate across files
stats=("mean","std","median"),
exclude_cols=("text",),
include_regex=r"^e\d+$", # only columns like e0,e1,...
)
out = aggregate_features(
root_dir="features/sentence-embeddings",
plan=plan,
)
CLI
# Concatenate
python -m taters.helpers.feature_gather gather \
--root_dir features/archetypes --pattern "*.csv"
# Aggregate (per file, by speaker)
python -m taters.helpers.feature_gather aggregate \
--root_dir features/sentence-embeddings \
--group-by speaker --per-file --stats mean std \
--exclude-cols start_time end_time text
Notes and tips
- Outputs will not be overwritten unless you pass
overwrite_existing=True
. - For aggregation, if no numeric columns remain after filtering/coercion, you will get a clear error; adjust filters or check inputs.
add_source_path=True
is great for audits; otherwise keep datasets lean.
API: feature_gather (single entry)¶
Single entry point to concatenate or aggregate feature CSVs from one folder.
If aggregate=False
, CSVs are concatenated with origin metadata
(see :func:gather_csvs_to_one
). If aggregate=True
, numeric feature
columns are aggregated per the provided or constructed plan
(see :func:aggregate_features
).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root_dir
|
PathLike
|
Folder containing per-item CSVs (or a single CSV file). |
required |
pattern
|
str
|
Glob pattern for selecting CSV files. |
"*.csv"
|
recursive
|
bool
|
Recurse into subdirectories when True. |
True
|
delimiter
|
str
|
CSV delimiter. |
","
|
encoding
|
str
|
CSV encoding. |
"utf-8-sig"
|
add_source_path
|
bool
|
If True, include a |
False
|
aggregate
|
bool
|
Toggle aggregation mode. If False, files are concatenated. |
False
|
plan
|
AggregationPlan or None
|
Explicit plan for aggregation. Required if |
None
|
group_by
|
Sequence[str] or None
|
Quick-plan keys. Used only when |
None
|
per_file
|
bool
|
Quick-plan flag; include |
True
|
stats
|
Sequence[str]
|
Quick-plan statistics to compute per numeric column. |
("mean", "std")
|
exclude_cols
|
Sequence[str]
|
Quick-plan columns to drop before numeric selection. |
()
|
include_regex
|
str or None
|
Quick-plan regex to include feature columns by name. |
None
|
exclude_regex
|
str or None
|
Quick-plan regex to exclude feature columns by name. |
None
|
dropna
|
bool
|
Quick-plan NA handling for group keys. |
True
|
out_csv
|
PathLike or None
|
Output CSV path. If None, defaults to
|
None
|
overwrite_existing
|
bool
|
If False and |
False
|
Returns:
Type | Description |
---|---|
Path
|
Path to the resulting CSV. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
See Also
gather_csvs_to_one : Concatenate CSVs with origin metadata. aggregate_features : Aggregate numeric columns according to a plan.
Source code in src\taters\helpers\feature_gather.py
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 |
|
API: gather_csvs_to_one¶
Concatenate many CSVs into a single CSV with origin metadata.
Each input CSV is loaded (all columns as object dtype), a leading
"source"
column is inserted (and optionally "source_path"
), and
rows are appended. The final CSV ensures "source"
(and, if present,
"source_path"
) lead the column order.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root_dir
|
PathLike
|
Folder containing CSVs, or a single CSV file. |
required |
pattern
|
str
|
Glob pattern for selecting files. |
"*.csv"
|
recursive
|
bool
|
Recurse into subdirectories when True. |
True
|
delimiter
|
str
|
CSV delimiter. |
","
|
encoding
|
str
|
CSV encoding for read/write. |
"utf-8-sig"
|
add_source_path
|
bool
|
If True, include absolute path in |
False
|
out_csv
|
PathLike or None
|
Output path. If None, defaults to
|
None
|
overwrite_existing
|
bool
|
If False and |
False
|
Returns:
Type | Description |
---|---|
Path
|
Path to the written CSV. |
Raises:
Type | Description |
---|---|
FileNotFoundError
|
If no files match the pattern under |
RuntimeError
|
If files were found but none could be read successfully. |
Notes
Input rows are not type-coerced beyond object dtype. Column order from inputs is preserved after the leading origin columns.
Source code in src\taters\helpers\feature_gather.py
313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 |
|
API: aggregate_features¶
Discover files, read, concatenate, and aggregate numeric columns per plan.
This function consolidates CSVs from a single folder, filters columns,
coerces candidate features to numeric, groups by the specified keys,
and computes the requested statistics. Output columns for aggregated
features are flattened with the pattern "{column}__{stat}"
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root_dir
|
PathLike
|
Folder containing per-item CSVs, or a single CSV file. |
required |
pattern
|
str
|
Glob pattern for selecting files. |
"*.csv"
|
recursive
|
bool
|
Recurse into subdirectories when True. |
True
|
delimiter
|
str
|
CSV delimiter. |
","
|
encoding
|
str
|
CSV encoding for read/write. |
"utf-8-sig"
|
add_source_path
|
bool
|
If True, include absolute path in |
False
|
plan
|
AggregationPlan
|
Aggregation configuration (group keys, stats, filters, NA handling). |
required |
out_csv
|
PathLike or None
|
Output path. If None, defaults to
|
None
|
overwrite_existing
|
bool
|
If False and |
False
|
Returns:
Type | Description |
---|---|
Path
|
Path to the written CSV of aggregated features. |
Raises:
Type | Description |
---|---|
FileNotFoundError
|
If no files match the pattern under |
RuntimeError
|
If files were found but none could be read successfully. |
ValueError
|
If required group-by columns are missing,
or if no numeric columns remain after filtering,
or if per-file grouping is requested but the |
Notes
Group keys are preserved as leading columns in the output. The output places
"source"
(and optionally "source_path"
) first when present.
Source code in src\taters\helpers\feature_gather.py
527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 |
|
API: make_plan / AggregationPlan¶
Create an :class:AggregationPlan
from simple arguments.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
group_by
|
Sequence[str]
|
Grouping key(s) to use (e.g., |
required |
per_file
|
bool
|
If True, group within files by including |
True
|
stats
|
Sequence[str]
|
Statistical reductions to compute per numeric column. |
("mean", "std")
|
exclude_cols
|
Sequence[str]
|
Columns to drop prior to feature selection. |
()
|
include_regex
|
str or None
|
Regex to include feature columns by name. |
None
|
exclude_regex
|
str or None
|
Regex to exclude feature columns by name. |
None
|
dropna
|
bool
|
Drop rows with NA in any group key. |
True
|
Returns:
Type | Description |
---|---|
AggregationPlan
|
A configured plan instance for :func: |
Source code in src\taters\helpers\feature_gather.py
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 |
|
Plan describing how numeric feature columns should be aggregated.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
group_by
|
Sequence[str]
|
One or more column names used as grouping keys (e.g., |
required |
per_file
|
bool
|
If True, include |
True
|
stats
|
Sequence[str]
|
Statistical reductions to compute for each numeric feature column.
Values are passed to |
("mean", "std")
|
exclude_cols
|
Sequence[str]
|
Columns to drop before filtering/selecting numeric features (e.g., timestamps or free text). |
()
|
include_regex
|
str or None
|
Optional regex; if provided, only columns matching this pattern are kept
(after excluding |
None
|
exclude_regex
|
str or None
|
Optional regex; if provided, columns matching this pattern are removed
(after applying |
None
|
dropna
|
bool
|
Whether to drop rows with NA in any of the group-by keys before grouping. |
True
|
Notes
This plan is consumed by :func:aggregate_features
. Column filtering happens
before numeric selection; only columns that remain and can be coerced to numeric
will be aggregated.
Practical patterns¶
- Prepare text for analysis: Use
csv_to_analysis_ready_csv
to gather and optionally group text (per speaker, per session), then feed into dictionary/archetype or embedding steps. - Unify features across runs: After extracting features for many files, run
feature_gather
to produce a single modeling table; add per-file grouping for summary statistics. - Curate inputs up front: Use
find_files
withinclude_globs
/exclude_globs
andffprobe_verify
to build clean file lists for pipelines.
This guide keeps things high level and practical; the API sections above have every parameter if you want to customize further.