Utilities & Helpers¶

The helpers do the unglamorous work that makes everything else feel simple. Well... simpler, I suppose. They help you do things like (1) find files without writing a custom glob every time, (2) turn raw text sources into analysis-ready CSVs, and (3) gather or aggregate many per-file feature CSVs into tidy datasets. The goal is predictable inputs and outputs, with just enough power for large projects.

File discovery: `find_files`¶

Find media files under a folder using smart, FFmpeg-friendly filters. You can choose a built-in group (audio, video, image, subtitle, any) or pass explicit extensions. Hidden items are ignored by default; optional glob includes/excludes give you surgical control. For audio/video, ffprobe_verify=True keeps only files where a matching stream actually exists (handy for odd containers).

Highlights

Groups: audio, video, image, subtitle, archive, any.
Or set extensions=[".wav",".flac"] to override groups.
include_globs / exclude_globs to narrow by name or path.
ffprobe_verify=True to confirm playable streams (audio/video).

Python

from taters.helpers.find_files import find_files

# All videos under "dataset", absolute paths
videos = find_files("dataset", file_type="video")

# Only WAV/FLAC, relative paths, include/exclude patterns
wavs = find_files(
    "dataset",
    extensions=[".wav", ".flac"],
    absolute=False,
    include_globs=["**/*session*"],
    exclude_globs=["**/tmp/**"],
)

# Audio files that actually contain audio streams
aud = find_files("dataset", file_type="audio", ffprobe_verify=True)
print(len(aud), "usable audio files")

CLI

python -m taters.helpers.find_files dataset --file_type video
python -m taters.helpers.find_files dataset --ext .wav --ext .flac --relative
python -m taters.helpers.find_files dataset --file_type audio --ffprobe-verify

API: find_files¶

Discover media files under a folder using smart, FFmpeg-friendly filters.

You can either (a) choose a built-in group of extensions via file_type ("audio"|"video"|"image"|"subtitle"|"archive"|"any") or (b) pass an explicit list of extensions to match. Matching is case-insensitive; dots are optional (e.g., ".wav" and "wav" are equivalent). Hidden files and directories are excluded by default.

For audio/video, ffprobe_verify=True additionally checks that at least one corresponding stream is present (e.g., exclude MP4s with no audio when file_type="audio"). This is slower but robust when your dataset contains “container only” files. :contentReference[oaicite:0]{index=0}

Parameters:

Name	Type	Description	Default
`root_dir`	`str \| PathLike`	Folder to scan.	required
`file_type`	`str`	Built-in group selector. Ignored if `extensions` is provided.	`'video'`
`extensions`	`Optional[Sequence[str]]`	Explicit extensions to include (e.g., `[".wav",".flac"]`). Overrides `file_type`.	`None`
`recursive`	`bool`	Recurse into subfolders. Default: `True`.	`True`
`follow_symlinks`	`bool`	Follow directory symlinks during traversal. Default: `False`.	`False`
`include_hidden`	`bool`	Include dot-files and dot-dirs. Default: `False`.	`False`
`include_globs`	`Optional[Sequence[str]]`	Additional glob filters applied after extension filtering; `include_globs` uses OR-semantics, then `exclude_globs` removes matches.	`None`
`absolute`	`bool`	Return absolute paths when `True` (default) else relative to `root_dir`.	`True`
`sort`	`bool`	Sort lexicographically (case-insensitive). Default: `True`.	`True`
`ffprobe_verify`	`bool`	For `audio`/`video`, keep only files where `ffprobe` reports ≥1 matching stream.	`False`

Returns:

Type	Description
`list[Path]`	The matched files.

Raises:

Type	Description
`FileNotFoundError`	If `root_dir` does not exist.
`ValueError`	If `file_type` is not one of the supported groups.

Examples:

Find all videos (recursive), as absolute paths:

>>> find_files("dataset", file_type="video")

Use explicit extensions and keep paths relative:

>>> find_files("dataset", extensions=[".wav",".flac"], absolute=False)

Only include files matching a glob and exclude temp folders:

>>> find_files("dataset", file_type="audio",
...            include_globs=["**/*session*"], exclude_globs=["**/tmp/**"])

Verify playable audio streams exist:

>>> find_files("dataset", file_type="audio", ffprobe_verify=True)

Source code in src\taters\helpers\find_files.py

def find_files(
    root_dir: str | os.PathLike,
    *,
    file_type: str = "video",                            # 'audio' | 'video' | 'image' | 'subtitle' | 'archive' | 'any'
    extensions: Optional[Sequence[str]] = None,     # explicit extensions override group (e.g., ['.wav','.flac'])
    recursive: bool = True,
    follow_symlinks: bool = False,
    include_hidden: bool = False,
    include_globs: Optional[Sequence[str]] = None,  # e.g., ['**/*session*']
    exclude_globs: Optional[Sequence[str]] = None,  # e.g., ['**/temp/**']
    absolute: bool = True,
    sort: bool = True,
    ffprobe_verify: bool = False,                   # confirm stream presence via ffprobe (audio/video only)
) -> List[Path]:
    """
    Discover media files under a folder using smart, FFmpeg-friendly filters.

    You can either (a) choose a built-in **group** of extensions via `file_type`
    (`"audio"|"video"|"image"|"subtitle"|"archive"|"any"`) or (b) pass an explicit
    list of `extensions` to match. Matching is case-insensitive; dots are optional
    (e.g., `".wav"` and `"wav"` are equivalent). Hidden files and directories are
    excluded by default.

    For audio/video, `ffprobe_verify=True` additionally checks that at least one
    corresponding stream is present (e.g., exclude MP4s with no audio when
    `file_type="audio"`). This is slower but robust when your dataset contains
    “container only” files. :contentReference[oaicite:0]{index=0}

    Parameters
    ----------
    root_dir
        Folder to scan.
    file_type
        Built-in group selector. Ignored if `extensions` is provided.
    extensions
        Explicit extensions to include (e.g., `[".wav",".flac"]`). Overrides `file_type`.
    recursive
        Recurse into subfolders. Default: `True`.
    follow_symlinks
        Follow directory symlinks during traversal. Default: `False`.
    include_hidden
        Include dot-files and dot-dirs. Default: `False`.
    include_globs / exclude_globs
        Additional glob filters applied after extension filtering; `include_globs`
        uses OR-semantics, then `exclude_globs` removes matches.
    absolute
        Return absolute paths when `True` (default) else relative to `root_dir`.
    sort
        Sort lexicographically (case-insensitive). Default: `True`.
    ffprobe_verify
        For `audio`/`video`, keep only files where `ffprobe` reports ≥1 matching
        stream.

    Returns
    -------
    list[pathlib.Path]
        The matched files.

    Raises
    ------
    FileNotFoundError
        If `root_dir` does not exist.
    ValueError
        If `file_type` is not one of the supported groups.

    Examples
    --------
    Find all videos (recursive), as absolute paths:

    >>> find_files("dataset", file_type="video")

    Use explicit extensions and keep paths relative:

    >>> find_files("dataset", extensions=[".wav",".flac"], absolute=False)

    Only include files matching a glob and exclude temp folders:

    >>> find_files("dataset", file_type="audio",
    ...            include_globs=["**/*session*"], exclude_globs=["**/tmp/**"])

    Verify playable audio streams exist:

    >>> find_files("dataset", file_type="audio", ffprobe_verify=True)
    """
    root_dir = Path(root_dir)
    if not root_dir.exists():
        raise FileNotFoundError(f"Root path not found: {root_dir}")

    if extensions:
        allowed = {_norm_ext(e) for e in extensions}
    else:
        if file_type not in GROUPS:
            raise ValueError(f"Unknown kind '{file_type}'. Choose from {', '.join(GROUPS.keys())}.")
        allowed = set(GROUPS[file_type])

    cand = (
        p for p in _iter_files(root_dir, recursive=recursive, follow_symlinks=follow_symlinks, include_hidden=include_hidden)
        if p.is_file() and _match_ext(p, allowed)
    )

    cand = _glob_filter(
        cand,
        includes=include_globs or [],
        excludes=exclude_globs or [],
    )

    out: List[Path] = []
    for p in cand:
        if ffprobe_verify and file_type in ("audio", "video"):
            if not _ffprobe_has_stream(p, file_type):
                continue
        out.append(p.resolve() if absolute else p)

    if sort:
        out.sort(key=lambda x: str(x).lower())
    return out

Text gathering: `text_gather`¶

Turn either a CSV or a folder of .txt files into an analysis-ready CSV with the stable schema text_id,text[,group_count][,source_col][,source_path]. Two entry points:

csv_to_analysis_ready_csv(...) – stream a possibly huge CSV, pick text_cols, optionally group_by, choose mode="concat"|"separate", and write text_id,text (plus extras). With grouping, group_count records how many pieces contributed. CSV mode supports external grouping via on-disk bucket partitioning, so you do not need to pre-sort giant files.
txt_folder_to_analysis_ready_csv(...) – scan a directory of .txt and emit one row per file; choose how to derive text_id (stem, name, or path), and include source_path if you want provenance columns.

Key ideas

Modes:
concat joins multiple text columns (or rows within a group) using joiner.
separate emits one row per text column and fills source_col.
Grouping at scale: Two-pass, external hash partitioning with an LRU of open writers; tune with num_buckets and max_open_bucket_files.
Defaults: Encoding tolerant of Excel (utf-8-sig), delimiter sniffing if you do not specify one, don’t-overwrite-unless-asked.

Python (CSV source → per-speaker text)

from taters.helpers.text_gather import csv_to_analysis_ready_csv

analysis_csv = csv_to_analysis_ready_csv(
    csv_path="transcripts/session.csv",
    text_cols=["text"],
    id_cols=["speaker"],            # optional; composes text_id when not grouping
    group_by=["speaker"],           # aggregate all utterances by speaker
    mode="concat",
    delimiter=",",                  # sniffed if None
    joiner=" ",
    num_buckets=1024,               # scale knobs for big files
    max_open_bucket_files=64,
)
print("Wrote:", analysis_csv)

Python (folder of .txt)

from taters.helpers.text_gather import txt_folder_to_analysis_ready_csv

analysis_csv = txt_folder_to_analysis_ready_csv(
    root_dir="notes/",
    recursive=True,
    id_from="path",
    include_source_path=True,
)

CLI

# CSV mode (repeat --text-col and --group-by as needed)
python -m taters.helpers.text_gather \
  --csv transcripts/session.csv \
  --text-col text --group-by speaker \
  --delimiter "," --overwrite_existing false

# TXT folder mode
python -m taters.helpers.text_gather \
  --txt-dir corpus/ --recursive --id-from path

API: csv_to_analysis_ready_csv¶

Stream a (possibly huge) CSV into a compact analysis-ready CSV with a stable schema and optional external grouping.

Output schema

Always writes a header and enforces a consistent column order:

• No grouping: text_id,text (plus source_col if mode="separate") • With grouping: text_id,text,group_count (plus source_col if mode="separate")

Where: - text_id is either the composed ID from id_cols or row_<n> when id_cols=None. - mode="concat" joins all text_cols using joiner per row or group. - mode="separate" emits one row per (row_or_group, text_col) and fills source_col with the contributing column name.

Grouping at scale

If group_by is provided, the function performs a two-pass external grouping that does not require presorting: 1) Hash-partition rows to on-disk “bucket” CSVs (bounded writers with LRU). 2) Aggregate each bucket into final rows (concat or separate mode), writing group_count to record how many pieces contributed. :contentReference[oaicite:1]{index=1}

Parameters:

Name	Type	Description	Default
`csv_path`	`PathLike`	Source CSV with at least the columns in `text_cols` (and `group_by` if grouping).	required
`out_csv`	`PathLike \| None`	Destination CSV. If `None`, a name is derived from the input and options (e.g., `<stem>_grouped_<group_by>.csv` or `<stem>_concat_<cols>.csv`).	`None`
`overwrite_existing`	`bool`	If `False` (default) and `out_csv` exists, the function returns early.	`False`
`text_cols`	`Sequence[str]`	One or more text fields to concatenate or emit separately.	required
`id_cols`	`Sequence[str] \| None`	Optional columns to compose `text_id` when not grouping. When omitted, a synthetic `row_<n>` is used.	`None`
`mode`	`str`	`"concat"` (default) or `"separate"`. See schema above.	`'concat'`
`group_by`	`Sequence[str] \| None`	Optional list of columns to aggregate by; works on unsorted CSVs.	`None`
`delimiter`	`str \| None`	Parsing/formatting options. If `delimiter=None`, sniffs from a sample.	`None`
`encoding`	`str \| None`	Parsing/formatting options. If `delimiter=None`, sniffs from a sample.	`None`
`joiner`	`str \| None`	Parsing/formatting options. If `delimiter=None`, sniffs from a sample.	`None`
`num_buckets`	`int`	External grouping controls (partition count, LRU limit, temp root).	`1024`
`max_open_bucket_files`	`int`	External grouping controls (partition count, LRU limit, temp root).	`1024`
`tmp_root`	`int`	External grouping controls (partition count, LRU limit, temp root).	`1024`
`include_id_cols`	`bool`	When aggregating/concatenating, retains the identifiers in the output file.	`True`

Returns:

Type	Description
`Path`	Path to the analysis-ready CSV.

Raises:

Type	Description
`ValueError`	If required columns are missing or `mode` is invalid.

Examples:

Concatenate two text fields per row:

>>> csv_to_analysis_ready_csv(
...     csv_path="transcripts.csv",
...     text_cols=["prompt","response"],
...     id_cols=["speaker"],
... )

Group by speaker and join rows:

>>> csv_to_analysis_ready_csv(
...     csv_path="transcripts.csv",
...     text_cols=["text"],
...     group_by=["speaker"],
... )

Source code in src\taters\helpers\text_gather.py

def csv_to_analysis_ready_csv(
    *,
    csv_path: PathLike,
    out_csv: PathLike | None = None,
    overwrite_existing: bool = False,
    text_cols: Sequence[str],
    id_cols: Sequence[str] | None = None,
    mode: str = "concat",
    group_by: Sequence[str] | None = None,
    delimiter: str | None = None,
    encoding: str = DEFAULT_ENCODING,
    joiner: str = DEFAULT_JOINER,
    num_buckets: int = 1024,
    max_open_bucket_files: int = 64,
    tmp_root: PathLike | None = None,
    include_id_cols: bool = True,             # NEW (default on)
) -> Path:
    """
    Stream a (possibly huge) CSV into a compact **analysis-ready** CSV with a
    stable schema and optional external grouping.

    Output schema
    -------------
    Always writes a header and enforces a consistent column order:

    • No grouping:
        `text_id,text`                            (plus `source_col` if `mode="separate"`)
    • With grouping:
        `text_id,text,group_count`                (plus `source_col` if `mode="separate"`)

    Where:
      - `text_id` is either the composed ID from `id_cols` or `row_<n>` when
        `id_cols=None`.
      - `mode="concat"` joins all `text_cols` using `joiner` per row or group.
      - `mode="separate"` emits one row per (`row_or_group`, `text_col`) and
        fills `source_col` with the contributing column name.

    Grouping at scale
    -----------------
    If `group_by` is provided, the function performs a **two-pass external
    grouping** that does not require presorting:
      1) Hash-partition rows to on-disk “bucket” CSVs (bounded writers with LRU).
      2) Aggregate each bucket into final rows (concat or separate mode), writing
         `group_count` to record how many pieces contributed. :contentReference[oaicite:1]{index=1}

    Parameters
    ----------
    csv_path
        Source CSV with at least the columns in `text_cols` (and `group_by` if
        grouping).
    out_csv
        Destination CSV. If `None`, a name is derived from the input and options
        (e.g., `<stem>_grouped_<group_by>.csv` or `<stem>_concat_<cols>.csv`).
    overwrite_existing
        If `False` (default) and `out_csv` exists, the function returns early.
    text_cols
        One or more text fields to concatenate or emit separately.
    id_cols
        Optional columns to compose `text_id` when not grouping. When omitted, a
        synthetic `row_<n>` is used.
    mode
        `"concat"` (default) or `"separate"`. See schema above.
    group_by
        Optional list of columns to aggregate by; works on unsorted CSVs.
    delimiter, encoding, joiner
        Parsing/formatting options. If `delimiter=None`, sniffs from a sample.
    num_buckets, max_open_bucket_files, tmp_root
        External grouping controls (partition count, LRU limit, temp root).
    include_id_cols
        When aggregating/concatenating, retains the identifiers in the output file.

    Returns
    -------
    Path
        Path to the analysis-ready CSV.

    Raises
    ------
    ValueError
        If required columns are missing or `mode` is invalid.

    Examples
    --------
    Concatenate two text fields per row:

    >>> csv_to_analysis_ready_csv(
    ...     csv_path="transcripts.csv",
    ...     text_cols=["prompt","response"],
    ...     id_cols=["speaker"],
    ... )

    Group by speaker and join rows:

    >>> csv_to_analysis_ready_csv(
    ...     csv_path="transcripts.csv",
    ...     text_cols=["text"],
    ...     group_by=["speaker"],
    ... )
    """
    in_path = _ensure_path(csv_path)

    # Detect delimiter if not provided
    if delimiter is None:
        with in_path.open("rb") as fb:
            sample = fb.read(8192)
        delimiter = _detect_delimiter(sample, default=DEFAULT_DELIM)

    text_cols = list(text_cols)
    if not text_cols:
        raise ValueError("text_cols must be non-empty")
    mode = mode.strip().lower()
    if mode not in ("concat", "separate"):
        raise ValueError("mode must be 'concat' or 'separate'")

    include_source_col = (mode == "separate")
    include_source_path = False  # this function deals with CSV; folder variant uses this flag

    # Decide output path (default next to input if not specified)
    out_path = _ensure_path(out_csv) if out_csv is not None else _default_csv_out_path(
        in_csv=in_path, mode=mode, text_cols=text_cols, group_by=group_by)
    out_path.parent.mkdir(parents=True, exist_ok=True)

    if not overwrite_existing and Path(out_path).is_file():
        print("File with gathered text already exists; returning existing file.")
        return out_path

    # If no grouping, we can stream straight to the output
    if not group_by:
        writer, fh, _ = _open_out_csv(
            out_path,
            include_source_col,
            include_source_path,
            include_group_count=False,
            id_col_names=(list(id_cols) if include_id_cols and id_cols else None),
            group_by_names=None,
        )
        try:
            with in_path.open("r", newline="", encoding=encoding) as f:
                rdr = csv.DictReader(f, delimiter=delimiter)
                headers = rdr.fieldnames or []
                missing = [c for c in (id_cols or []) + text_cols if c not in headers]
                if missing:
                    raise ValueError(f"Missing columns: {missing}. Make sure that you try specifying a delimiter manually if you see this error message.")

                for idx, row in enumerate(rdr, start=1):
                    text_id = _compose_id([row.get(c, "") for c in (id_cols or [])]) if id_cols else f"row_{idx}"
                    if mode == "concat":
                        parts = [row.get(c, "") for c in text_cols if row.get(c, "")]
                        if not parts:
                            continue
                        row_prefix = [text_id]
                        if include_id_cols and id_cols:
                            row_prefix += [row.get(c, "") for c in id_cols]
                        writer.writerow(row_prefix + [joiner.join(parts)])
                    else:
                        for col in text_cols:
                            val = row.get(col, "")
                            if not val:
                                continue
                            row_prefix = [text_id]
                            if include_id_cols and id_cols:
                                row_prefix += [row.get(c, "") for c in id_cols]
                            writer.writerow(row_prefix + [val, col])

        finally:
            fh.close()
        return out_path

    # Otherwise, do external grouping (two-pass)
    group_by = list(group_by)

    # Phase 1: partition into hash buckets
    tmp_base = Path(tempfile.mkdtemp(prefix="gather_partitions_", dir=str(tmp_root) if tmp_root else None))
    part_dir = tmp_base / "parts"
    part_dir.mkdir(parents=True, exist_ok=True)

    # Bucket writer cache
    header_small = group_by + text_cols
    cache = _LRUFileCache(
        max_open=max_open_bucket_files,
        newline="",
        encoding=encoding,
        delimiter=delimiter,             # ← NEW
    )


    try:
        with in_path.open("r", newline="", encoding=encoding) as f:
            rdr = csv.DictReader(f, delimiter=delimiter)
            headers = rdr.fieldnames or []
            missing = [c for c in group_by + text_cols if c not in headers]
            if missing:
                raise ValueError(f"Missing columns: {missing}. Make sure that you try specifying a delimiter manually if you see this error message.")

            for row in rdr:
                key_tuple = tuple(row[g] for g in group_by)
                bucket = _bucket_of_key(key_tuple, num_buckets)
                bpath = part_dir / f"bucket_{bucket:05d}.csv"
                w = cache.get(bucket, bpath, header_small)
                # write only needed fields to keep partitions lean
                w.writerow([row.get(c, "") for c in header_small])
    finally:
        cache.close_all()

    # Phase 2: per-bucket aggregation → final writer
    writer, out_fh, _ = _open_out_csv(
        out_path,
        include_source_col,
        include_source_path,
        include_group_count=True,
        id_col_names=None,
        group_by_names=group_by,   # NEW
    )



    try:
        for bfile in sorted(part_dir.glob("bucket_*.csv")):
            # Aggregate this bucket in memory
            if mode == "concat":
                # key -> list[text]
                agg: Dict[Tuple[str, ...], List[str]] = {}
                with bfile.open("r", newline="", encoding=encoding) as bf:
                    br = csv.DictReader(bf, delimiter=delimiter)
                    for row in br:
                        key = tuple(row[g] for g in group_by)
                        parts = [row.get(c, "") for c in text_cols if row.get(c, "")]
                        if not parts:
                            continue
                        agg.setdefault(key, []).append(joiner.join(parts))
                # Emit
                for key, pieces in agg.items():
                    text_id = _compose_id(key) or "group"
                    writer.writerow([text_id, *key, joiner.join(pieces), len(pieces)])
            else:
                # key -> col -> list[text]
                agg: Dict[Tuple[str, ...], Dict[str, List[str]]] = {}
                with bfile.open("r", newline="", encoding=encoding) as bf:
                    br = csv.DictReader(bf, delimiter=delimiter)
                    for row in br:
                        key = tuple(row[g] for g in group_by)
                        box = agg.setdefault(key, {})
                        for col in text_cols:
                            val = row.get(col, "")
                            if val:
                                box.setdefault(col, []).append(val)
                # Emit
                for key, per_col in agg.items():
                    text_id = _compose_id(key) or "group"
                    for col in text_cols:
                        vals = per_col.get(col, [])
                        if not vals:
                            continue
                        writer.writerow([text_id, *key, joiner.join(vals), len(vals), col])

    finally:
        out_fh.close()
        # Clean up partitions
        try:
            for p in part_dir.glob("bucket_*.csv"):
                p.unlink(missing_ok=True)
            part_dir.rmdir()
            tmp_base.rmdir()
        except Exception:
            pass

    return out_path

API: txt_folder_to_analysis_ready_csv¶

Stream a folder of .txt files into an analysis-ready CSV with predictable, reproducible IDs.

For each file matching pattern, the emitted row contains: - text_id: the basename (stem), full filename, or relative path (see id_from), and - text: the file contents. - source_path: optional column with path relative to root_dir.

Parameters:

Name	Type	Description	Default
`root_dir`	`PathLike`	Folder containing `.txt` files.	required
`out_csv`	`PathLike \| None`	Destination CSV. If `None`, a descriptive default is created next to `root_dir` (e.g., `<folder>_txt_recursive_*.csv`).	`None`
`recursive`	`bool`	Recurse into subfolders. Default: `False`.	`False`
`pattern`	`str`	Glob for matching text files. Default: `"*.txt"`.	`'*.txt'`
`encoding`	`str`	File decoding. Default: `"utf-8"`.	`'utf-8'`
`id_from`	`str`	How to derive `text_id`: `"stem"` (basename without extension), `"name"` (filename), or `"path"` (relative path).	`'stem'`
`include_source_path`	`bool`	If `True` (default), add a `source_path` column showing the relative path.	`True`
`overwrite_existing`	`bool`	If `False` (default) and `out_csv` exists, returns the existing file.	`False`

Returns:

Type	Description
`Path`	Path to the analysis-ready CSV.

Examples:

>>> txt_folder_to_analysis_ready_csv(root_dir="notes", recursive=True, id_from="path")

Source code in src\taters\helpers\text_gather.py

def txt_folder_to_analysis_ready_csv(
    *,
    root_dir: PathLike,
    out_csv: PathLike | None = None,
    recursive: bool = False,
    pattern: str = "*.txt",
    encoding: str = "utf-8",
    id_from: str = "stem",            # "stem" | "name" | "path"
    include_source_path: bool = True, # writes 'source_path' column
    overwrite_existing: bool = False  # if the file already exists, let's not overwrite by default

) -> Path:
    """
    Stream a folder of `.txt` files into an analysis-ready CSV with predictable,
    reproducible IDs.

    For each file matching `pattern`, the emitted row contains:
      - `text_id`: the basename (stem), full filename, or relative path (see
        `id_from`), and
      - `text`: the file contents.
      - `source_path`: optional column with path relative to `root_dir`.

    Parameters
    ----------
    root_dir
        Folder containing `.txt` files.
    out_csv
        Destination CSV. If `None`, a descriptive default is created next to
        `root_dir` (e.g., `<folder>_txt_recursive_*.csv`).
    recursive
        Recurse into subfolders. Default: `False`.
    pattern
        Glob for matching text files. Default: `"*.txt"`.
    encoding
        File decoding. Default: `"utf-8"`.
    id_from
        How to derive `text_id`: `"stem"` (basename without extension),
        `"name"` (filename), or `"path"` (relative path).
    include_source_path
        If `True` (default), add a `source_path` column showing the relative path.
    overwrite_existing
        If `False` (default) and `out_csv` exists, returns the existing file.

    Returns
    -------
    Path
        Path to the analysis-ready CSV.

    Examples
    --------
    >>> txt_folder_to_analysis_ready_csv(root_dir="notes", recursive=True, id_from="path")
    """
    root = _ensure_path(root_dir)
    out_path = _ensure_path(out_csv) if out_csv is not None else _default_txt_out_path(
        root, id_from=id_from, recursive=recursive, pattern=pattern)
    out_path.parent.mkdir(parents=True, exist_ok=True)

    if not overwrite_existing and Path(out_path).is_file():
        print("File with gathered text already exists; returning existing file.")
        return out_path

    writer, fh, _ = _open_out_csv(out_path, include_source_col=False, include_source_path=include_source_path)
    try:
        files = root.rglob(pattern) if recursive else root.glob(pattern)
        for p in files:
            if not p.is_file():
                continue
            if id_from == "stem":
                text_id = p.stem
            elif id_from == "name":
                text_id = p.name
            elif id_from == "path":
                text_id = str(p.relative_to(root))
            else:
                raise ValueError("id_from must be 'stem', 'name', or 'path'")
            text = p.read_text(encoding=encoding, errors="ignore")
            if include_source_path:
                writer.writerow([text_id, text, str(p.relative_to(root))])
            else:
                writer.writerow([text_id, text])
    finally:
        fh.close()

    return out_path

Feature gatherer: concatenate or aggregate many CSVs¶

Once you have lots of per-file outputs (e.g., embeddings per segment, dictionary scores per file), these helpers build a single dataset for modeling or visualization.

gather_csvs_to_one(...) – just stack CSVs, inserting a leading source column (and optionally source_path). Output defaults to <root_dir.name>.csv next to the folder.
feature_gather(...) – a single entry that either concatenates (default) or performs aggregation if aggregate=True. You can pass an AggregationPlan or let it build one from quick arguments like group_by, stats, and per_file.
aggregate_features(...) – discover, filter, coerce to numeric, group, and compute statistics; output columns are flattened like feature__mean. Group keys lead the output.

What aggregation means here

Choose group keys (e.g., ["speaker"]).
Decide if grouping is per file (per_file=True adds the source key) or across all files.
Optionally filter columns before numeric selection using exclude_cols, include_regex, and exclude_regex. Only numeric columns (after coercion) are aggregated.
Compute stats per numeric feature (mean, std, median, etc.).

Python (concatenate only)

from taters.helpers.feature_gather import gather_csvs_to_one

out = gather_csvs_to_one(
    root_dir="features/whisper-embeddings",
    pattern="*.csv",
    recursive=True,
    add_source_path=False,
)
print("Merged:", out)

Python (aggregate per speaker within each file)

from taters.helpers.feature_gather import feature_gather

agg = feature_gather(
    root_dir="features/sentence-embeddings",
    aggregate=True,
    group_by=["speaker"],           # quick-plan keys
    per_file=True,                  # include 'source' in group keys
    stats=("mean","std"),           # compute per numeric column
    exclude_cols=("start_time","end_time","text"),  # drop non-features first
    include_regex=None,             # or narrow to specific features
    out_csv=None,                   # defaults to ./features/sentence-embeddings.csv
)
print("Aggregated:", agg)

Python (explicit plan)

from taters.helpers.feature_gather import make_plan, aggregate_features

plan = make_plan(
    group_by=["speaker"],
    per_file=False,                 # aggregate across files
    stats=("mean","std","median"),
    exclude_cols=("text",),
    include_regex=r"^e\d+$",        # only columns like e0,e1,...
)

out = aggregate_features(
    root_dir="features/sentence-embeddings",
    plan=plan,
)

CLI

# Concatenate
python -m taters.helpers.feature_gather gather \
  --root_dir features/archetypes --pattern "*.csv"

# Aggregate (per file, by speaker)
python -m taters.helpers.feature_gather aggregate \
  --root_dir features/sentence-embeddings \
  --group-by speaker --per-file --stats mean std \
  --exclude-cols start_time end_time text

Notes and tips

Outputs will not be overwritten unless you pass overwrite_existing=True.
For aggregation, if no numeric columns remain after filtering/coercion, you will get a clear error; adjust filters or check inputs.
add_source_path=True is great for audits; otherwise keep datasets lean.

API: feature_gather (single entry)¶

Single entry point to concatenate or aggregate feature CSVs from one folder.

If aggregate=False, CSVs are concatenated with origin metadata (see :func:gather_csvs_to_one). If aggregate=True, numeric feature columns are aggregated per the provided or constructed plan (see :func:aggregate_features).

Parameters:

Name	Type	Description	Default
`root_dir`	`PathLike`	Folder containing per-item CSVs (or a single CSV file).	required
`pattern`	`str`	Glob pattern for selecting CSV files.	`"*.csv"`
`recursive`	`bool`	Recurse into subdirectories when True.	`True`
`delimiter`	`str`	CSV delimiter.	`","`
`encoding`	`str`	CSV encoding.	`"utf-8-sig"`
`add_source_path`	`bool`	If True, include a `"source_path"` column in outputs.	`False`
`aggregate`	`bool`	Toggle aggregation mode. If False, files are concatenated.	`False`
`plan`	`AggregationPlan or None`	Explicit plan for aggregation. Required if `aggregate=True` and `group_by` is not given.	`None`
`group_by`	`Sequence[str] or None`	Quick-plan keys. Used only when `aggregate=True` and `plan` is None.	`None`
`per_file`	`bool`	Quick-plan flag; include `"source"` in grouping keys to aggregate per file.	`True`
`stats`	`Sequence[str]`	Quick-plan statistics to compute per numeric column.	`("mean", "std")`
`exclude_cols`	`Sequence[str]`	Quick-plan columns to drop before numeric selection.	`()`
`include_regex`	`str or None`	Quick-plan regex to include feature columns by name.	`None`
`exclude_regex`	`str or None`	Quick-plan regex to exclude feature columns by name.	`None`
`dropna`	`bool`	Quick-plan NA handling for group keys.	`True`
`out_csv`	`PathLike or None`	Output CSV path. If None, defaults to `<root_dir_parent>/<root_dir_name>.csv`.	`None`
`overwrite_existing`	`bool`	If False and `out_csv` exists, the existing path is returned without recomputation.	`False`

Returns:

Type	Description
`Path`	Path to the resulting CSV.

Raises:

Type	Description
`ValueError`	If `aggregate=True` and neither `plan` nor `group_by` is provided.

API: gather_csvs_to_one¶

Concatenate many CSVs into a single CSV with origin metadata.

Each input CSV is loaded (all columns as object dtype), a leading "source" column is inserted (and optionally "source_path"), and rows are appended. The final CSV ensures "source" (and, if present, "source_path") lead the column order.

Parameters:

Name	Type	Description	Default
`root_dir`	`PathLike`	Folder containing CSVs, or a single CSV file.	required
`pattern`	`str`	Glob pattern for selecting files.	`"*.csv"`
`recursive`	`bool`	Recurse into subdirectories when True.	`True`
`delimiter`	`str`	CSV delimiter.	`","`
`encoding`	`str`	CSV encoding for read/write.	`"utf-8-sig"`
`add_source_path`	`bool`	If True, include absolute path in `"source_path"`.	`False`
`out_csv`	`PathLike or None`	Output path. If None, defaults to `<root_dir_parent>/<root_dir_name>.csv`.	`None`
`overwrite_existing`	`bool`	If False and `out_csv` exists, return it without recomputation.	`False`

Returns:

Type	Description
`Path`	Path to the written CSV.

Raises:

Type	Description
`FileNotFoundError`	If no files match the pattern under `root_dir`.
`RuntimeError`	If files were found but none could be read successfully.

Notes

Input rows are not type-coerced beyond object dtype. Column order from inputs is preserved after the leading origin columns.

Source code in src\taters\helpers\feature_gather.py

def gather_csvs_to_one(
    *,
    root_dir: PathLike,
    pattern: str = "*.csv",
    recursive: bool = True,
    delimiter: str = ",",
    encoding: str = "utf-8-sig",
    add_source_path: bool = False,
    out_csv: Optional[PathLike] = None,
    overwrite_existing: bool = False,
) -> Path:
    """
    Concatenate many CSVs into a single CSV with origin metadata.

    Each input CSV is loaded (all columns as object dtype), a leading
    ``"source"`` column is inserted (and optionally ``"source_path"``), and
    rows are appended. The final CSV ensures ``"source"`` (and, if present,
    ``"source_path"``) lead the column order.

    Parameters
    ----------
    root_dir : PathLike
        Folder containing CSVs, or a single CSV file.
    pattern : str, default="*.csv"
        Glob pattern for selecting files.
    recursive : bool, default=True
        Recurse into subdirectories when True.
    delimiter : str, default=","
        CSV delimiter.
    encoding : str, default="utf-8-sig"
        CSV encoding for read/write.
    add_source_path : bool, default=False
        If True, include absolute path in ``"source_path"``.
    out_csv : PathLike or None, default=None
        Output path. If None, defaults to
        ``<root_dir_parent>/<root_dir_name>.csv``.
    overwrite_existing : bool, default=False
        If False and `out_csv` exists, return it without recomputation.

    Returns
    -------
    pathlib.Path
        Path to the written CSV.

    Raises
    ------
    FileNotFoundError
        If no files match the pattern under `root_dir`.
    RuntimeError
        If files were found but none could be read successfully.

    Notes
    -----
    Input rows are not type-coerced beyond object dtype. Column order from
    inputs is preserved after the leading origin columns.
    """

    root = Path(root_dir)
    if out_csv is None:
        out_csv = root.parent / f"{root.name}.csv"

    out_csv = Path(out_csv)
    out_csv.parent.mkdir(parents=True, exist_ok=True)

    if out_csv.exists() and not overwrite_existing:
        print(f"Aggregated feature output file already exists; returning existing file: {out_csv}")
        return out_csv

    files = list(_iter_csv_files(root, pattern=pattern, recursive=recursive))
    if not files:
        raise FileNotFoundError(f"No files matched {pattern} under {root}")

    frames = []
    for fp in files:
        try:
            frames.append(
                _read_csv_add_source(
                    fp,
                    delimiter=delimiter,
                    encoding=encoding,
                    add_source_path=add_source_path,
                )
            )
        except Exception as e:
            print(f"[gather] WARNING: failed to read {fp}: {e}")

    if not frames:
        raise RuntimeError("No CSVs could be read successfully.")

    merged = pd.concat(frames, axis=0, ignore_index=True)

    # Ensure 'source' is first (and 'source_path' next if present)
    cols = list(merged.columns)
    if "source" in cols:
        lead = ["source"] + (["source_path"] if "source_path" in cols else [])
        rest = [c for c in cols if c not in lead]
        merged = merged[lead + rest]

    merged.to_csv(out_csv, index=False, encoding=encoding)
    return out_csv

API: aggregate_features¶

Discover files, read, concatenate, and aggregate numeric columns per plan.

This function consolidates CSVs from a single folder, filters columns, coerces candidate features to numeric, groups by the specified keys, and computes the requested statistics. Output columns for aggregated features are flattened with the pattern "{column}__{stat}".

Parameters:

Name	Type	Description	Default
`root_dir`	`PathLike`	Folder containing per-item CSVs, or a single CSV file.	required
`pattern`	`str`	Glob pattern for selecting files.	`"*.csv"`
`recursive`	`bool`	Recurse into subdirectories when True.	`True`
`delimiter`	`str`	CSV delimiter.	`","`
`encoding`	`str`	CSV encoding for read/write.	`"utf-8-sig"`
`add_source_path`	`bool`	If True, include absolute path in `"source_path"` prior to filtering.	`False`
`plan`	`AggregationPlan`	Aggregation configuration (group keys, stats, filters, NA handling).	required
`out_csv`	`PathLike or None`	Output path. If None, defaults to `<root_dir_parent>/<root_dir_name>.csv`.	`None`
`overwrite_existing`	`bool`	If False and `out_csv` exists, return it without recomputation.	`False`

Returns:

Type	Description
`Path`	Path to the written CSV of aggregated features.

Raises:

Type	Description
`FileNotFoundError`	If no files match the pattern under `root_dir`.
`RuntimeError`	If files were found but none could be read successfully.
`ValueError`	If required group-by columns are missing, or if no numeric columns remain after filtering, or if per-file grouping is requested but the `"source"` column is absent.

Notes

Group keys are preserved as leading columns in the output. The output places "source" (and optionally "source_path") first when present.

Source code in src\taters\helpers\feature_gather.py

def aggregate_features(
    *,
    root_dir: PathLike,
    pattern: str = "*.csv",
    recursive: bool = True,
    delimiter: str = ",",
    encoding: str = "utf-8-sig",
    add_source_path: bool = False,
    plan: AggregationPlan,
    out_csv: Optional[PathLike] = None,
    overwrite_existing: bool = False,
) -> Path:
    """
    Discover files, read, concatenate, and aggregate numeric columns per plan.

    This function consolidates CSVs from a single folder, filters columns,
    coerces candidate features to numeric, groups by the specified keys,
    and computes the requested statistics. Output columns for aggregated
    features are flattened with the pattern ``"{column}__{stat}"``.

    Parameters
    ----------
    root_dir : PathLike
        Folder containing per-item CSVs, or a single CSV file.
    pattern : str, default="*.csv"
        Glob pattern for selecting files.
    recursive : bool, default=True
        Recurse into subdirectories when True.
    delimiter : str, default=","
        CSV delimiter.
    encoding : str, default="utf-8-sig"
        CSV encoding for read/write.
    add_source_path : bool, default=False
        If True, include absolute path in ``"source_path"`` prior to filtering.
    plan : AggregationPlan
        Aggregation configuration (group keys, stats, filters, NA handling).
    out_csv : PathLike or None, default=None
        Output path. If None, defaults to
        ``<root_dir_parent>/<root_dir_name>.csv``.
    overwrite_existing : bool, default=False
        If False and `out_csv` exists, return it without recomputation.

    Returns
    -------
    pathlib.Path
        Path to the written CSV of aggregated features.

    Raises
    ------
    FileNotFoundError
        If no files match the pattern under `root_dir`.
    RuntimeError
        If files were found but none could be read successfully.
    ValueError
        If required group-by columns are missing,
        or if no numeric columns remain after filtering,
        or if per-file grouping is requested but the ``"source"`` column is absent.

    Notes
    -----
    Group keys are preserved as leading columns in the output. The output places
    ``"source"`` (and optionally ``"source_path"``) first when present.
    """

    root = Path(root_dir)
    if out_csv is None:
        out_csv = root.parent / f"{root.name}.csv"
    out_csv = Path(out_csv)
    out_csv.parent.mkdir(parents=True, exist_ok=True)

    if out_csv.exists() and not overwrite_existing:
        print(f"Aggregated feature output file already exists; returning existing file: {out_csv}")
        return out_csv

    files = list(_iter_csv_files(root, pattern=pattern, recursive=recursive))
    if not files:
        raise FileNotFoundError(f"No files matched {pattern} under {root}")

    frames = []
    for fp in files:
        try:
            frames.append(
                _read_csv_add_source(
                    fp,
                    delimiter=delimiter,
                    encoding=encoding,
                    add_source_path=add_source_path,
                )
            )
        except Exception as e:
            print(f"[aggregate] WARNING: failed to read {fp}: {e}")

    if not frames:
        raise RuntimeError("No CSVs could be read successfully.")

    df = pd.concat(frames, axis=0, ignore_index=True)

    # If we are aggregating across files (per_file=False),
    # promote inner keys (e.g., 'source.1' -> 'source') and demote file-level keys.
    if not plan.per_file:
        df = _promote_inner_keys(df, plan.group_by)


    def _resolve_keys(base_keys, columns):
        cols = set(columns)
        resolved = []
        for k in base_keys:
            if k in cols:
                resolved.append(k)
                continue
            # Look for numbered variants like 'k.1', 'k.2', ...
            prefix = f"{k}."
            candidates = [c for c in columns if c == f"{k}.1" or c.startswith(prefix)]
            if candidates:
                # pick the first stable candidate
                resolved.append(sorted(candidates)[0])
            else:
                # leave unresolved; we'll error below with a helpful message
                resolved.append(k)
        return resolved

    # Build (base) group keys from the plan
    group_keys = list(plan.group_by)
    if plan.per_file:
        if "source" not in df.columns:
            raise ValueError("source column is missing; cannot group per_file.")
        group_keys = ["source"] + group_keys

    # Resolve collisions against actual columns
    group_keys = _resolve_keys(group_keys, df.columns)

    # Now filter but ALWAYS keep group keys
    df_f = _filter_columns(
        df,
        exclude_cols=tuple(plan.exclude_cols) + ("source_path",),
        include_regex=plan.include_regex,
        exclude_regex=plan.exclude_regex,
        must_keep=group_keys,
    )

    missing = [k for k in group_keys if k not in df_f.columns]
    if missing:
        raise ValueError(f"Missing group-by columns in data: {missing}")

    # Candidate numeric features
    feature_cols = [c for c in df_f.columns if c not in set(group_keys)]
    numeric_df = _numeric_subframe(df_f[feature_cols])
    if numeric_df.empty:
        raise ValueError("No numeric columns available for aggregation after filtering.")

    # Reattach group keys for grouping
    gdf = pd.concat([df_f[group_keys].reset_index(drop=True),
                     numeric_df.reset_index(drop=True)], axis=1)

    agg_ops = {c: list(plan.stats) for c in numeric_df.columns}
    grouped = gdf.groupby(group_keys, dropna=False).agg(agg_ops)

    # Flatten MultiIndex columns and order 'source' first
    grouped.columns = [f"{c}__{stat}" for (c, stat) in grouped.columns]
    grouped = grouped.reset_index()

    # Ensure 'source' (and 'source_path' if present) lead the output
    cols = list(grouped.columns)
    lead = [c for c in ("source", "source_path") if c in cols]
    rest = [c for c in cols if c not in lead]
    grouped = grouped[lead + rest]

    grouped.to_csv(out_csv, index=False, encoding=encoding)
    return out_csv

API: make_plan / AggregationPlan¶

Create an :class:AggregationPlan from simple arguments.

Parameters:

Name	Type	Description	Default
`group_by`	`Sequence[str]`	Grouping key(s) to use (e.g., `["speaker"]`).	required
`per_file`	`bool`	If True, group within files by including `"source"` in group keys.	`True`
`stats`	`Sequence[str]`	Statistical reductions to compute per numeric column.	`("mean", "std")`
`exclude_cols`	`Sequence[str]`	Columns to drop prior to feature selection.	`()`
`include_regex`	`str or None`	Regex to include feature columns by name.	`None`
`exclude_regex`	`str or None`	Regex to exclude feature columns by name.	`None`
`dropna`	`bool`	Drop rows with NA in any group key.	`True`

Returns:

Type	Description
`AggregationPlan`	A configured plan instance for :func:`aggregate_features`.

Source code in src\taters\helpers\feature_gather.py

def make_plan(
    *,
    group_by: Sequence[str],
    per_file: bool = True,
    stats: Sequence[str] = ("mean", "std"),
    exclude_cols: Sequence[str] = (),
    include_regex: Optional[str] = None,
    exclude_regex: Optional[str] = None,
    dropna: bool = True,
) -> AggregationPlan:
    """
    Create an :class:`AggregationPlan` from simple arguments.

    Parameters
    ----------
    group_by : Sequence[str]
        Grouping key(s) to use (e.g., ``["speaker"]``).
    per_file : bool, default=True
        If True, group within files by including ``"source"`` in group keys.
    stats : Sequence[str], default=("mean", "std")
        Statistical reductions to compute per numeric column.
    exclude_cols : Sequence[str], default=()
        Columns to drop prior to feature selection.
    include_regex : str or None, default=None
        Regex to include feature columns by name.
    exclude_regex : str or None, default=None
        Regex to exclude feature columns by name.
    dropna : bool, default=True
        Drop rows with NA in any group key.

    Returns
    -------
    AggregationPlan
        A configured plan instance for :func:`aggregate_features`.
    """

    return AggregationPlan(
        group_by=tuple(group_by),
        per_file=per_file,
        stats=tuple(stats),
        exclude_cols=tuple(exclude_cols),
        include_regex=include_regex,
        exclude_regex=exclude_regex,
        dropna=dropna,
    )

Plan describing how numeric feature columns should be aggregated.

Parameters:

Name	Type	Description	Default
`group_by`	`Sequence[str]`	One or more column names used as grouping keys (e.g., `["speaker"]`).	required
`per_file`	`bool`	If True, include `"source"` in the grouping keys to aggregate within each input file; if False, aggregate across all files globally.	`True`
`stats`	`Sequence[str]`	Statistical reductions to compute for each numeric feature column. Values are passed to `pandas.DataFrame.agg` (e.g., `"mean"`, `"std"`, `"median"`, etc.).	`("mean", "std")`
`exclude_cols`	`Sequence[str]`	Columns to drop before filtering/selecting numeric features (e.g., timestamps or free text).	`()`
`include_regex`	`str or None`	Optional regex; if provided, only columns matching this pattern are kept (after excluding `exclude_cols`).	`None`
`exclude_regex`	`str or None`	Optional regex; if provided, columns matching this pattern are removed (after applying `include_regex`, if any).	`None`
`dropna`	`bool`	Whether to drop rows with NA in any of the group-by keys before grouping.	`True`

Notes

This plan is consumed by :func:aggregate_features. Column filtering happens before numeric selection; only columns that remain and can be coerced to numeric will be aggregated.

Practical patterns¶

Prepare text for analysis: Use csv_to_analysis_ready_csv to gather and optionally group text (per speaker, per session), then feed into dictionary/archetype or embedding steps.
Unify features across runs: After extracting features for many files, run feature_gather to produce a single modeling table; add per-file grouping for summary statistics.
Curate inputs up front: Use find_files with include_globs/exclude_globs and ffprobe_verify to build clean file lists for pipelines.

This guide keeps things high level and practical; the API sections above have every parameter if you want to customize further.

Utilities & Helpers¶

File discovery: find_files¶

API: find_files¶

Text gathering: text_gather¶

API: csv_to_analysis_ready_csv¶

API: txt_folder_to_analysis_ready_csv¶

Feature gatherer: concatenate or aggregate many CSVs¶

API: feature_gather (single entry)¶

API: gather_csvs_to_one¶

API: aggregate_features¶

API: make_plan / AggregationPlan¶

Practical patterns¶

File discovery: `find_files`¶

Text gathering: `text_gather`¶