Utilities & Helpers¶
taters.helpers.feature_gather ¶
AggregationPlan
dataclass
¶
AggregationPlan(
group_by,
per_file=True,
stats=("mean", "std"),
exclude_cols=(),
include_regex=None,
exclude_regex=None,
dropna=True,
)
Plan describing how numeric feature columns should be aggregated.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
group_by
|
Sequence[str]
|
One or more column names used as grouping keys (e.g., |
required |
per_file
|
bool
|
If True, include |
True
|
stats
|
Sequence[str]
|
Statistical reductions to compute for each numeric feature column.
Values are passed to |
("mean", "std")
|
exclude_cols
|
Sequence[str]
|
Columns to drop before filtering/selecting numeric features (e.g., timestamps or free text). |
()
|
include_regex
|
str or None
|
Optional regex; if provided, only columns matching this pattern are kept
(after excluding |
None
|
exclude_regex
|
str or None
|
Optional regex; if provided, columns matching this pattern are removed
(after applying |
None
|
dropna
|
bool
|
Whether to drop rows with NA in any of the group-by keys before grouping. |
True
|
Notes
This plan is consumed by :func:aggregate_features
. Column filtering happens
before numeric selection; only columns that remain and can be coerced to numeric
will be aggregated.
aggregate_features ¶
aggregate_features(
*,
root_dir,
pattern="*.csv",
recursive=True,
delimiter=",",
encoding="utf-8-sig",
add_source_path=False,
plan,
out_csv=None,
overwrite_existing=False
)
Discover files, read, concatenate, and aggregate numeric columns per plan.
This function consolidates CSVs from a single folder, filters columns,
coerces candidate features to numeric, groups by the specified keys,
and computes the requested statistics. Output columns for aggregated
features are flattened with the pattern "{column}__{stat}"
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root_dir
|
PathLike
|
Folder containing per-item CSVs, or a single CSV file. |
required |
pattern
|
str
|
Glob pattern for selecting files. |
"*.csv"
|
recursive
|
bool
|
Recurse into subdirectories when True. |
True
|
delimiter
|
str
|
CSV delimiter. |
","
|
encoding
|
str
|
CSV encoding for read/write. |
"utf-8-sig"
|
add_source_path
|
bool
|
If True, include absolute path in |
False
|
plan
|
AggregationPlan
|
Aggregation configuration (group keys, stats, filters, NA handling). |
required |
out_csv
|
PathLike or None
|
Output path. If None, defaults to
|
None
|
overwrite_existing
|
bool
|
If False and |
False
|
Returns:
Type | Description |
---|---|
Path
|
Path to the written CSV of aggregated features. |
Raises:
Type | Description |
---|---|
FileNotFoundError
|
If no files match the pattern under |
RuntimeError
|
If files were found but none could be read successfully. |
ValueError
|
If required group-by columns are missing,
or if no numeric columns remain after filtering,
or if per-file grouping is requested but the |
Notes
Group keys are preserved as leading columns in the output. The output places
"source"
(and optionally "source_path"
) first when present.
Source code in src\taters\helpers\feature_gather.py
527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 |
|
feature_gather ¶
feature_gather(
*,
root_dir,
pattern="*.csv",
recursive=True,
delimiter=",",
encoding="utf-8-sig",
add_source_path=False,
aggregate=False,
plan=None,
group_by=None,
per_file=True,
stats=("mean", "std"),
exclude_cols=(),
include_regex=None,
exclude_regex=None,
dropna=True,
out_csv=None,
overwrite_existing=False
)
Single entry point to concatenate or aggregate feature CSVs from one folder.
If aggregate=False
, CSVs are concatenated with origin metadata
(see :func:gather_csvs_to_one
). If aggregate=True
, numeric feature
columns are aggregated per the provided or constructed plan
(see :func:aggregate_features
).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root_dir
|
PathLike
|
Folder containing per-item CSVs (or a single CSV file). |
required |
pattern
|
str
|
Glob pattern for selecting CSV files. |
"*.csv"
|
recursive
|
bool
|
Recurse into subdirectories when True. |
True
|
delimiter
|
str
|
CSV delimiter. |
","
|
encoding
|
str
|
CSV encoding. |
"utf-8-sig"
|
add_source_path
|
bool
|
If True, include a |
False
|
aggregate
|
bool
|
Toggle aggregation mode. If False, files are concatenated. |
False
|
plan
|
AggregationPlan or None
|
Explicit plan for aggregation. Required if |
None
|
group_by
|
Sequence[str] or None
|
Quick-plan keys. Used only when |
None
|
per_file
|
bool
|
Quick-plan flag; include |
True
|
stats
|
Sequence[str]
|
Quick-plan statistics to compute per numeric column. |
("mean", "std")
|
exclude_cols
|
Sequence[str]
|
Quick-plan columns to drop before numeric selection. |
()
|
include_regex
|
str or None
|
Quick-plan regex to include feature columns by name. |
None
|
exclude_regex
|
str or None
|
Quick-plan regex to exclude feature columns by name. |
None
|
dropna
|
bool
|
Quick-plan NA handling for group keys. |
True
|
out_csv
|
PathLike or None
|
Output CSV path. If None, defaults to
|
None
|
overwrite_existing
|
bool
|
If False and |
False
|
Returns:
Type | Description |
---|---|
Path
|
Path to the resulting CSV. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
See Also
gather_csvs_to_one : Concatenate CSVs with origin metadata. aggregate_features : Aggregate numeric columns according to a plan.
Source code in src\taters\helpers\feature_gather.py
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 |
|
gather_csvs_to_one ¶
gather_csvs_to_one(
*,
root_dir,
pattern="*.csv",
recursive=True,
delimiter=",",
encoding="utf-8-sig",
add_source_path=False,
out_csv=None,
overwrite_existing=False
)
Concatenate many CSVs into a single CSV with origin metadata.
Each input CSV is loaded (all columns as object dtype), a leading
"source"
column is inserted (and optionally "source_path"
), and
rows are appended. The final CSV ensures "source"
(and, if present,
"source_path"
) lead the column order.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root_dir
|
PathLike
|
Folder containing CSVs, or a single CSV file. |
required |
pattern
|
str
|
Glob pattern for selecting files. |
"*.csv"
|
recursive
|
bool
|
Recurse into subdirectories when True. |
True
|
delimiter
|
str
|
CSV delimiter. |
","
|
encoding
|
str
|
CSV encoding for read/write. |
"utf-8-sig"
|
add_source_path
|
bool
|
If True, include absolute path in |
False
|
out_csv
|
PathLike or None
|
Output path. If None, defaults to
|
None
|
overwrite_existing
|
bool
|
If False and |
False
|
Returns:
Type | Description |
---|---|
Path
|
Path to the written CSV. |
Raises:
Type | Description |
---|---|
FileNotFoundError
|
If no files match the pattern under |
RuntimeError
|
If files were found but none could be read successfully. |
Notes
Input rows are not type-coerced beyond object dtype. Column order from inputs is preserved after the leading origin columns.
Source code in src\taters\helpers\feature_gather.py
313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 |
|
main ¶
main()
Entry point for the command-line interface.
Parses arguments, dispatches to :func:gather_csvs_to_one
,
:func:aggregate_features
, or :func:feature_gather
depending on the
selected subcommand, and prints the resulting output path.
Notes
This function is invoked when the module is executed as a script::
python -m taters.helpers.feature_gather <subcommand> [options]
Source code in src\taters\helpers\feature_gather.py
748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 |
|
make_plan ¶
make_plan(
*,
group_by,
per_file=True,
stats=("mean", "std"),
exclude_cols=(),
include_regex=None,
exclude_regex=None,
dropna=True
)
Create an :class:AggregationPlan
from simple arguments.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
group_by
|
Sequence[str]
|
Grouping key(s) to use (e.g., |
required |
per_file
|
bool
|
If True, group within files by including |
True
|
stats
|
Sequence[str]
|
Statistical reductions to compute per numeric column. |
("mean", "std")
|
exclude_cols
|
Sequence[str]
|
Columns to drop prior to feature selection. |
()
|
include_regex
|
str or None
|
Regex to include feature columns by name. |
None
|
exclude_regex
|
str or None
|
Regex to exclude feature columns by name. |
None
|
dropna
|
bool
|
Drop rows with NA in any group key. |
True
|
Returns:
Type | Description |
---|---|
AggregationPlan
|
A configured plan instance for :func: |
Source code in src\taters\helpers\feature_gather.py
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 |
|
taters.helpers.find_files ¶
find_files ¶
find_files(
root_dir,
*,
file_type="video",
extensions=None,
recursive=True,
follow_symlinks=False,
include_hidden=False,
include_globs=None,
exclude_globs=None,
absolute=True,
sort=True,
ffprobe_verify=False
)
Discover media files under a folder using smart, FFmpeg-friendly filters.
You can either (a) choose a built-in group of extensions via file_type
("audio"|"video"|"image"|"subtitle"|"archive"|"any"
) or (b) pass an explicit
list of extensions
to match. Matching is case-insensitive; dots are optional
(e.g., ".wav"
and "wav"
are equivalent). Hidden files and directories are
excluded by default.
For audio/video, ffprobe_verify=True
additionally checks that at least one
corresponding stream is present (e.g., exclude MP4s with no audio when
file_type="audio"
). This is slower but robust when your dataset contains
“container only” files. :contentReference[oaicite:0]{index=0}
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root_dir
|
str | PathLike
|
Folder to scan. |
required |
file_type
|
str
|
Built-in group selector. Ignored if |
'video'
|
extensions
|
Optional[Sequence[str]]
|
Explicit extensions to include (e.g., |
None
|
recursive
|
bool
|
Recurse into subfolders. Default: |
True
|
follow_symlinks
|
bool
|
Follow directory symlinks during traversal. Default: |
False
|
include_hidden
|
bool
|
Include dot-files and dot-dirs. Default: |
False
|
include_globs
|
Optional[Sequence[str]]
|
Additional glob filters applied after extension filtering; |
None
|
absolute
|
bool
|
Return absolute paths when |
True
|
sort
|
bool
|
Sort lexicographically (case-insensitive). Default: |
True
|
ffprobe_verify
|
bool
|
For |
False
|
Returns:
Type | Description |
---|---|
list[Path]
|
The matched files. |
Raises:
Type | Description |
---|---|
FileNotFoundError
|
If |
ValueError
|
If |
Examples:
Find all videos (recursive), as absolute paths:
>>> find_files("dataset", file_type="video")
Use explicit extensions and keep paths relative:
>>> find_files("dataset", extensions=[".wav",".flac"], absolute=False)
Only include files matching a glob and exclude temp folders:
>>> find_files("dataset", file_type="audio",
... include_globs=["**/*session*"], exclude_globs=["**/tmp/**"])
Verify playable audio streams exist:
>>> find_files("dataset", file_type="audio", ffprobe_verify=True)
Source code in src\taters\helpers\find_files.py
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 |
|
taters.helpers.text_gather ¶
csv_to_analysis_ready_csv ¶
csv_to_analysis_ready_csv(
*,
csv_path,
out_csv=None,
overwrite_existing=False,
text_cols,
id_cols=None,
mode="concat",
group_by=None,
delimiter=None,
encoding=DEFAULT_ENCODING,
joiner=DEFAULT_JOINER,
num_buckets=1024,
max_open_bucket_files=64,
tmp_root=None
)
Stream a (possibly huge) CSV into a compact analysis-ready CSV with a stable schema and optional external grouping.
Output schema
Always writes a header and enforces a consistent column order:
• No grouping:
text_id,text
(plus source_col
if mode="separate"
)
• With grouping:
text_id,text,group_count
(plus source_col
if mode="separate"
)
Where:
- text_id
is either the composed ID from id_cols
or row_<n>
when
id_cols=None
.
- mode="concat"
joins all text_cols
using joiner
per row or group.
- mode="separate"
emits one row per (row_or_group
, text_col
) and
fills source_col
with the contributing column name.
Grouping at scale
If group_by
is provided, the function performs a two-pass external
grouping that does not require presorting:
1) Hash-partition rows to on-disk “bucket” CSVs (bounded writers with LRU).
2) Aggregate each bucket into final rows (concat or separate mode), writing
group_count
to record how many pieces contributed. :contentReference[oaicite:1]{index=1}
Parameters:
Name | Type | Description | Default |
---|---|---|---|
csv_path
|
PathLike
|
Source CSV with at least the columns in |
required |
out_csv
|
PathLike | None
|
Destination CSV. If |
None
|
overwrite_existing
|
bool
|
If |
False
|
text_cols
|
Sequence[str]
|
One or more text fields to concatenate or emit separately. |
required |
id_cols
|
Sequence[str] | None
|
Optional columns to compose |
None
|
mode
|
str
|
|
'concat'
|
group_by
|
Sequence[str] | None
|
Optional list of columns to aggregate by; works on unsorted CSVs. |
None
|
delimiter
|
str | None
|
Parsing/formatting options. If |
None
|
encoding
|
str | None
|
Parsing/formatting options. If |
None
|
joiner
|
str | None
|
Parsing/formatting options. If |
None
|
num_buckets
|
int
|
External grouping controls (partition count, LRU limit, temp root). |
1024
|
max_open_bucket_files
|
int
|
External grouping controls (partition count, LRU limit, temp root). |
1024
|
tmp_root
|
int
|
External grouping controls (partition count, LRU limit, temp root). |
1024
|
Returns:
Type | Description |
---|---|
Path
|
Path to the analysis-ready CSV. |
Raises:
Type | Description |
---|---|
ValueError
|
If required columns are missing or |
Examples:
Concatenate two text fields per row:
>>> csv_to_analysis_ready_csv(
... csv_path="transcripts.csv",
... text_cols=["prompt","response"],
... id_cols=["speaker"],
... )
Group by speaker and join rows:
>>> csv_to_analysis_ready_csv(
... csv_path="transcripts.csv",
... text_cols=["text"],
... group_by=["speaker"],
... )
Source code in src\taters\helpers\text_gather.py
170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 |
|
txt_folder_to_analysis_ready_csv ¶
txt_folder_to_analysis_ready_csv(
*,
root_dir,
out_csv=None,
recursive=False,
pattern="*.txt",
encoding="utf-8",
id_from="stem",
include_source_path=True,
overwrite_existing=False
)
Stream a folder of .txt
files into an analysis-ready CSV with predictable,
reproducible IDs.
For each file matching pattern
, the emitted row contains:
- text_id
: the basename (stem), full filename, or relative path (see
id_from
), and
- text
: the file contents.
- source_path
: optional column with path relative to root_dir
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root_dir
|
PathLike
|
Folder containing |
required |
out_csv
|
PathLike | None
|
Destination CSV. If |
None
|
recursive
|
bool
|
Recurse into subfolders. Default: |
False
|
pattern
|
str
|
Glob for matching text files. Default: |
'*.txt'
|
encoding
|
str
|
File decoding. Default: |
'utf-8'
|
id_from
|
str
|
How to derive |
'stem'
|
include_source_path
|
bool
|
If |
True
|
overwrite_existing
|
bool
|
If |
False
|
Returns:
Type | Description |
---|---|
Path
|
Path to the analysis-ready CSV. |
Examples:
>>> txt_folder_to_analysis_ready_csv(root_dir="notes", recursive=True, id_from="path")
Source code in src\taters\helpers\text_gather.py
419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 |
|