Text Modules¶

taters.text.analyze_with_archetypes ¶

analyze_with_archetypes ¶

analyze_with_archetypes(
    *,
    csv_path=None,
    txt_dir=None,
    analysis_csv=None,
    out_features_csv=None,
    overwrite_existing=False,
    archetype_csvs,
    encoding="utf-8-sig",
    delimiter=",",
    text_cols=("text",),
    id_cols=None,
    mode="concat",
    group_by=None,
    joiner=" ",
    num_buckets=512,
    max_open_bucket_files=64,
    tmp_root=None,
    recursive=True,
    pattern="*.txt",
    id_from="stem",
    include_source_path=True,
    model_name="sentence-transformers/all-roberta-large-v1",
    mean_center_vectors=True,
    fisher_z_transform=False,
    rounding=4
)

Compute archetype scores for text rows and write a wide, analysis-ready features CSV.

This function supports three input modes:

analysis_csv — Use a prebuilt CSV with exactly two columns: text_id and text.
csv_path — Gather text from an arbitrary CSV by specifying text_cols (and optionally id_cols and group_by) to construct an analysis-ready CSV on the fly.
txt_dir — Gather text from a folder of .txt files.

Archetype scoring is delegated to a middle layer that embeds text with a Sentence-Transformers model and evaluates cosine similarity to one or more archetype CSVs. If out_features_csv is omitted, the default path is ./features/archetypes/<analysis_ready_filename>.

Parameters:

Name	Type	Description	Default
`csv_path`	`str or Path`	Source CSV for gathering. Mutually exclusive with `txt_dir` and `analysis_csv`.	`None`
`txt_dir`	`str or Path`	Folder of `.txt` files to gather from. Mutually exclusive with the other input modes.	`None`
`analysis_csv`	`str or Path`	Precomputed analysis-ready CSV containing exactly the columns `text_id` and `text`.	`None`
`out_features_csv`	`str or Path`	Output path for the features CSV. If `None`, defaults to `./features/archetypes/<analysis_ready_filename>`.	`None`
`overwrite_existing`	`bool`	If `False` and the output file already exists, skip recomputation and return the existing path.	`False`
`archetype_csvs`	`Sequence[str or Path]`	One or more archetype CSVs (name → seed phrases). Directories are allowed and expanded recursively to all `.csv` files.	required
`encoding`	`str`	Text encoding for CSV I/O.	`"utf-8-sig"`
`delimiter`	`str`	Field delimiter for CSV I/O.	`","`
`text_cols`	`Sequence[str]`	When gathering from a CSV: column(s) that contain text. Used only if `csv_path` is provided.	`("text",)`
`id_cols`	`Sequence[str]`	When gathering from a CSV: optional ID columns to carry into grouping (e.g., `["speaker"]`).	`None`
`mode`	`(concat, separate)`	Gathering behavior when multiple `text_cols` are provided. `"concat"` joins into a single text field; `"separate"` creates one row per text column.	`"concat"`
`group_by`	`Sequence[str]`	Optional grouping keys used during gathering (e.g., `["speaker"]`). In `"concat"` mode, members are concatenated into one row per group.	`None`
`joiner`	`str`	Separator used when concatenating multiple text chunks.	`" "`
`num_buckets`	`int`	Number of temporary hash buckets used for scalable CSV gathering.	`512`
`max_open_bucket_files`	`int`	Maximum number of bucket files to keep open concurrently during gathering.	`64`
`tmp_root`	`str or Path`	Root directory for temporary files used by gathering.	`None`
`recursive`	`bool`	When gathering from a text folder, whether to recurse into subdirectories.	`True`
`pattern`	`str`	Filename glob used when gathering from a text folder.	`"*.txt"`
`id_from`	`(stem, name, path)`	How to derive the `text_id` when gathering from a text folder.	`"stem"`
`include_source_path`	`bool`	Whether to include the absolute source path as an additional column when gathering from a text folder.	`True`
`model_name`	`str`	Sentence-Transformers model used to embed text for archetype scoring.	`"sentence-transformers/all-roberta-large-v1"`
`mean_center_vectors`	`bool`	If `True`, mean-center embedding vectors prior to scoring.	`True`
`fisher_z_transform`	`bool`	If `True`, apply the Fisher z-transform to correlations.	`False`
`rounding`	`int`	Number of decimal places to round numeric outputs. Use `None` to disable rounding.	`4`

Returns:

Type	Description
`Path`	Path to the written features CSV.

Raises:

Type	Description
`FileNotFoundError`	If an input file or folder does not exist, or an archetype CSV path is invalid.
`ValueError`	If required arguments are incompatible or missing (e.g., no input mode chosen), or if the analysis-ready CSV lacks `text_id`/`text` columns.

Examples:

Run on a transcript CSV, grouped by speaker:

>>> analyze_with_archetypes(
...     csv_path="transcripts/session.csv",
...     text_cols=["text"],
...     id_cols=["speaker"],
...     group_by=["speaker"],
...     archetype_csvs=["dictionaries/archetypes"],
...     model_name="sentence-transformers/all-roberta-large-v1",
... )
PosixPath('.../features/archetypes/session.csv')

Notes

If out_features_csv exists and overwrite_existing=False, the existing path is returned without recomputation. Directories passed in archetype_csvs are expanded recursively to all .csv files and deduplicated before scoring.

Source code in src\taters\text\analyze_with_archetypes.py

def analyze_with_archetypes(
    *,
    # ----- Input source (choose exactly one, OR pass analysis_csv to skip gathering) -----
    csv_path: Optional[Union[str, Path]] = None,
    txt_dir: Optional[Union[str, Path]] = None,
    analysis_csv: Optional[Union[str, Path]] = None,   # <- NEW: skip gathering if provided

    # ----- Output -----
    out_features_csv: Optional[Union[str, Path]] = None,
    overwrite_existing: bool = False,  # if the file already exists, let's not overwrite by default

    # ----- Archetype CSVs (one or more) -----
    archetype_csvs: Sequence[Union[str, Path]],

    # ====== SHARED I/O OPTIONS ======
    encoding: str = "utf-8-sig",
    delimiter: str = ",",

    # ====== CSV GATHER OPTIONS (when csv_path is provided) ======
    text_cols: Sequence[str] = ("text",),
    id_cols: Optional[Sequence[str]] = None,
    mode: Literal["concat", "separate"] = "concat",
    group_by: Optional[Sequence[str]] = None,
    joiner: str = " ",
    num_buckets: int = 512,
    max_open_bucket_files: int = 64,
    tmp_root: Optional[Union[str, Path]] = None,

    # ====== TXT FOLDER GATHER OPTIONS (when txt_dir is provided) ======
    recursive: bool = True,
    pattern: str = "*.txt",
    id_from: Literal["stem", "name", "path"] = "stem",
    include_source_path: bool = True,

    # ====== Archetyper scoring options ======
    model_name: str = "sentence-transformers/all-roberta-large-v1",
    mean_center_vectors: bool = True,
    fisher_z_transform: bool = False,
    rounding: int = 4,
) -> Path:
    """
    Compute archetype scores for text rows and write a wide, analysis-ready features CSV.

    This function supports three input modes:

    1. ``analysis_csv`` — Use a prebuilt CSV with exactly two columns: ``text_id`` and ``text``.
    2. ``csv_path`` — Gather text from an arbitrary CSV by specifying ``text_cols`` (and optionally
    ``id_cols`` and ``group_by``) to construct an analysis-ready CSV on the fly.
    3. ``txt_dir`` — Gather text from a folder of ``.txt`` files.

    Archetype scoring is delegated to a middle layer that embeds text with a Sentence-Transformers
    model and evaluates cosine similarity to one or more archetype CSVs. If ``out_features_csv`` is
    omitted, the default path is ``./features/archetypes/<analysis_ready_filename>``.

    Parameters
    ----------
    csv_path : str or pathlib.Path, optional
        Source CSV for gathering. Mutually exclusive with ``txt_dir`` and ``analysis_csv``.
    txt_dir : str or pathlib.Path, optional
        Folder of ``.txt`` files to gather from. Mutually exclusive with the other input modes.
    analysis_csv : str or pathlib.Path, optional
        Precomputed analysis-ready CSV containing exactly the columns ``text_id`` and ``text``.
    out_features_csv : str or pathlib.Path, optional
        Output path for the features CSV. If ``None``, defaults to
        ``./features/archetypes/<analysis_ready_filename>``.
    overwrite_existing : bool, default=False
        If ``False`` and the output file already exists, skip recomputation and return the existing path.
    archetype_csvs : Sequence[str or pathlib.Path]
        One or more archetype CSVs (name → seed phrases). Directories are allowed and expanded
        recursively to all ``.csv`` files.
    encoding : str, default="utf-8-sig"
        Text encoding for CSV I/O.
    delimiter : str, default=","
        Field delimiter for CSV I/O.
    text_cols : Sequence[str], default=("text",)
        When gathering from a CSV: column(s) that contain text. Used only if ``csv_path`` is provided.
    id_cols : Sequence[str], optional
        When gathering from a CSV: optional ID columns to carry into grouping (e.g., ``["speaker"]``).
    mode : {"concat", "separate"}, default="concat"
        Gathering behavior when multiple ``text_cols`` are provided. ``"concat"`` joins into a single
        text field; ``"separate"`` creates one row per text column.
    group_by : Sequence[str], optional
        Optional grouping keys used during gathering (e.g., ``["speaker"]``). In ``"concat"`` mode,
        members are concatenated into one row per group.
    joiner : str, default=" "
        Separator used when concatenating multiple text chunks.
    num_buckets : int, default=512
        Number of temporary hash buckets used for scalable CSV gathering.
    max_open_bucket_files : int, default=64
        Maximum number of bucket files to keep open concurrently during gathering.
    tmp_root : str or pathlib.Path, optional
        Root directory for temporary files used by gathering.
    recursive : bool, default=True
        When gathering from a text folder, whether to recurse into subdirectories.
    pattern : str, default="*.txt"
        Filename glob used when gathering from a text folder.
    id_from : {"stem", "name", "path"}, default="stem"
        How to derive the ``text_id`` when gathering from a text folder.
    include_source_path : bool, default=True
        Whether to include the absolute source path as an additional column when gathering from a text folder.
    model_name : str, default="sentence-transformers/all-roberta-large-v1"
        Sentence-Transformers model used to embed text for archetype scoring.
    mean_center_vectors : bool, default=True
        If ``True``, mean-center embedding vectors prior to scoring.
    fisher_z_transform : bool, default=False
        If ``True``, apply the Fisher z-transform to correlations.
    rounding : int, default=4
        Number of decimal places to round numeric outputs. Use ``None`` to disable rounding.

    Returns
    -------
    pathlib.Path
        Path to the written features CSV.

    Raises
    ------
    FileNotFoundError
        If an input file or folder does not exist, or an archetype CSV path is invalid.
    ValueError
        If required arguments are incompatible or missing (e.g., no input mode chosen),
        or if the analysis-ready CSV lacks ``text_id``/``text`` columns.

    Examples
    --------
    Run on a transcript CSV, grouped by speaker:

    >>> analyze_with_archetypes(
    ...     csv_path="transcripts/session.csv",
    ...     text_cols=["text"],
    ...     id_cols=["speaker"],
    ...     group_by=["speaker"],
    ...     archetype_csvs=["dictionaries/archetypes"],
    ...     model_name="sentence-transformers/all-roberta-large-v1",
    ... )
    PosixPath('.../features/archetypes/session.csv')

    Notes
    -----
    If ``out_features_csv`` exists and ``overwrite_existing=False``, the existing path is returned
    without recomputation. Directories passed in ``archetype_csvs`` are expanded recursively to
    all ``.csv`` files and deduplicated before scoring.
    """


    # 1) Use analysis-ready CSV if given; otherwise gather from csv_path or txt_dir
    if analysis_csv is not None:
        analysis_ready = Path(analysis_csv)
        if not analysis_ready.exists():
            raise FileNotFoundError(f"analysis_csv not found: {analysis_ready}")
    else:
        if (csv_path is None) == (txt_dir is None):
            raise ValueError("Provide exactly one of csv_path or txt_dir (or pass analysis_csv).")
        if csv_path is not None:
            analysis_ready = Path(
                csv_to_analysis_ready_csv(
                    csv_path=csv_path,
                    text_cols=list(text_cols),
                    id_cols=list(id_cols) if id_cols else None,
                    mode=mode,
                    group_by=list(group_by) if group_by else None,
                    delimiter=delimiter,
                    encoding=encoding,
                    joiner=joiner,
                    num_buckets=num_buckets,
                    max_open_bucket_files=max_open_bucket_files,
                    tmp_root=tmp_root,
                )
            )
        else:
            analysis_ready = Path(
                txt_folder_to_analysis_ready_csv(
                    root_dir=txt_dir,
                    recursive=recursive,
                    pattern=pattern,
                    encoding=encoding,
                    id_from=id_from,
                    include_source_path=include_source_path,
                )
            )

    # 1b) Decide default features path if not provided:
    #     <analysis_ready_dir>/features/archetypes/<analysis_ready_filename>
    if out_features_csv is None:
        out_features_csv = Path.cwd() / "features" / "archetypes" / analysis_ready.name
    out_features_csv = Path(out_features_csv)
    out_features_csv.parent.mkdir(parents=True, exist_ok=True)

    if not overwrite_existing and Path(out_features_csv).is_file():
        print("Archetypes output file already exists; returning existing file.")
        return out_features_csv


    # 2) Resolve/validate archetype CSVs
    # Allow passing either:
    #   • one or more CSV files, or
    #   • one or more directories containing CSVs (recursively).
    #
    # We lean on the shared find_files helper to avoid redundancy.

    # 2) Resolve/validate archetype CSVs
    resolved_archetype_csvs: list[Path] = []

    for src in archetype_csvs:
        src_path = Path(src)
        if src_path.is_dir():
            # find all *.csv under this folder (recursive)
            found = find_files(
                root_dir=src_path,
                extensions=[".csv"],
                recursive=True,
                absolute=True,
                sort=True,
            )
            resolved_archetype_csvs.extend(Path(f) for f in found)
        else:
            resolved_archetype_csvs.append(src_path)

    # De-dup, normalize, and sort
    archetype_csvs = sorted({p.resolve() for p in resolved_archetype_csvs})

    if not archetype_csvs:
        raise ValueError(
            "No archetype CSVs found. Pass one or more CSV files, or a directory containing CSV files with your archetypes."
        )
    for p in archetype_csvs:
        if not p.exists():
            raise FileNotFoundError(f"Archetype CSV not found: {p}")



        # 3) Stream (text_id, text, meta) → middle layer → features CSV
    def _iter_items_from_csv_with_meta(
        path: Path,
        *,
        id_col: str = "text_id",
        text_col: str = "text",
        wanted: Optional[Sequence[str]] = None,
    ) -> Iterable[Tuple[str, str, dict]]:
        """
        Stream (text_id, text, meta) from an analysis-ready CSV.

        Enforces that all requested `wanted` columns exist (fail fast).
        """
        wanted = list(wanted or [])
        with path.open("r", newline="", encoding=encoding) as f:
            reader = csv.DictReader(f, delimiter=delimiter)
            fields = reader.fieldnames or []
            if id_col not in fields or text_col not in fields:
                raise ValueError(
                    f"Expected columns '{id_col}' and '{text_col}' in {path}; found {fields}"
                )
            missing = [c for c in wanted if c not in fields]
            if missing:
                raise ValueError(
                    f"Requested id_cols not present in analysis-ready CSV {path}: {missing}"
                )

            for row in reader:
                tid = str(row.get(id_col, "") or "")
                txt = str(row.get(text_col, "") or "")
                meta = {c: str(row.get(c, "") or "") for c in wanted}
                yield tid, txt, meta

    maa.analyze_texts_to_csv(
        items=_iter_items_from_csv_with_meta(analysis_ready, wanted=id_cols or []),
        archetype_csvs=archetype_csvs,
        out_csv=out_features_csv,
        model_name=model_name,
        mean_center_vectors=mean_center_vectors,
        fisher_z_transform=fisher_z_transform,
        rounding=rounding,
        encoding=encoding,
        delimiter=delimiter,
        id_col_name="text_id",
        pass_through_cols=list(id_cols or []),  # ← inject id_cols right after text_id
    )

    return out_features_csv

main ¶

main()

Command-line entry point for archetype scoring.

Parses arguments using :func:_build_arg_parser, normalizes list-like defaults, invokes :func:analyze_with_archetypes, and prints the resulting output path.

Notes

This function is executed when the module is run as a script:

python -m taters.text.analyze_with_archetypes             --analysis-csv transcripts/X/X.csv             --archetype dictionaries/archetypes             --model-name sentence-transformers/all-roberta-large-v1

Source code in src\taters\text\analyze_with_archetypes.py

def main():
    """
    Command-line entry point for archetype scoring.

    Parses arguments using :func:`_build_arg_parser`, normalizes list-like defaults,
    invokes :func:`analyze_with_archetypes`, and prints the resulting output path.

    Notes
    -----
    This function is executed when the module is run as a script:

        python -m taters.text.analyze_with_archetypes \
            --analysis-csv transcripts/X/X.csv \
            --archetype dictionaries/archetypes \
            --model-name sentence-transformers/all-roberta-large-v1
    """

    args = _build_arg_parser().parse_args()

    # Defaults for list-ish args
    text_cols = args.text_cols if args.text_cols else ["text"]
    id_cols = args.id_cols if args.id_cols else None
    group_by = args.group_by if args.group_by else None

    out = analyze_with_archetypes(
        csv_path=args.csv_path,
        txt_dir=args.txt_dir,
        analysis_csv=args.analysis_csv,
        out_features_csv=args.out_features_csv,
        overwrite_existing=args.overwrite_existing,
        archetype_csvs=args.archetype_csvs,
        encoding=args.encoding,
        delimiter=args.delimiter,
        text_cols=text_cols,
        id_cols=id_cols,
        mode=args.mode,
        group_by=group_by,
        joiner=args.joiner,
        num_buckets=args.num_buckets,
        max_open_bucket_files=args.max_open_bucket_files,
        tmp_root=args.tmp_root,
        recursive=args.recursive,
        pattern=args.pattern,
        id_from=args.id_from,
        include_source_path=args.include_source_path,
        model_name=args.model_name,
        mean_center_vectors=args.mean_center_vectors,
        fisher_z_transform=args.fisher_z_transform,
        rounding=args.rounding,
    )
    print(str(out))

taters.text.analyze_with_dictionaries ¶

analyze_with_dictionaries ¶

analyze_with_dictionaries(
    *,
    csv_path=None,
    txt_dir=None,
    analysis_csv=None,
    out_features_csv=None,
    overwrite_existing=False,
    dict_paths,
    encoding="utf-8-sig",
    text_cols=("text",),
    id_cols=None,
    mode="concat",
    group_by=None,
    delimiter=",",
    joiner=" ",
    num_buckets=512,
    max_open_bucket_files=64,
    tmp_root=None,
    recursive=True,
    pattern="*.txt",
    id_from="stem",
    include_source_path=True,
    relative_freq=True,
    drop_punct=True,
    rounding=4,
    retain_captures=False,
    wildcard_mem=True
)

Compute LIWC-style dictionary features for text rows and write a wide features CSV.

The function supports exactly one of three input modes:

analysis_csv — Use a prebuilt file with columns text_id and text.
csv_path — Gather text from an arbitrary CSV using text_cols (and optional id_cols/group_by) to produce an analysis-ready file.
txt_dir — Gather text from a folder of .txt files.

If out_features_csv is omitted, the default output path is ./features/dictionary/<analysis_ready_filename>. Multiple dictionaries are supported; passing a directory discovers all .dic, .dicx, and .csv dictionary files recursively in a stable order. Global columns (e.g., word counts, punctuation) are emitted once (from the first dictionary) and each dictionary contributes a namespaced block.

Parameters:

Name	Type	Description	Default
`csv_path`	`str or Path`	Source CSV to gather from. Mutually exclusive with `txt_dir` and `analysis_csv`.	`None`
`txt_dir`	`str or Path`	Folder containing `.txt` files to gather from. Mutually exclusive with other modes.	`None`
`analysis_csv`	`str or Path`	Prebuilt analysis-ready CSV with exactly two columns: `text_id` and `text`.	`None`
`out_features_csv`	`str or Path`	Output file path. If `None`, defaults to `./features/dictionary/<analysis_ready_filename>`.	`None`
`overwrite_existing`	`bool`	If `False` and the output file already exists, skip processing and return the path.	`False`
`dict_paths`	`Sequence[str or Path]`	One or more dictionary inputs (files or directories). Supported extensions: `.dic`, `.dicx`, `.csv`. Directories are expanded recursively.	required
`encoding`	`str`	Text encoding used for reading/writing CSV files.	`"utf-8-sig"`
`text_cols`	`Sequence[str]`	When gathering from a CSV, name(s) of the column(s) containing text.	`("text",)`
`id_cols`	`Sequence[str] or None`	Optional ID columns to carry into grouping when gathering from CSV.	`None`
`mode`	`(concat, separate)`	Gathering behavior when multiple text columns are provided. `"concat"` joins them into one text field using `joiner`; `"separate"` creates one row per column.	`"concat"`
`group_by`	`Sequence[str] or None`	Optional grouping keys used during CSV gathering (e.g., `["speaker"]`).	`None`
`delimiter`	`str`	Delimiter for reading/writing CSV files.	`","`
`joiner`	`str`	Separator used when concatenating multiple text chunks in `"concat"` mode.	`" "`
`num_buckets`	`int`	Number of temporary hash buckets used during scalable CSV gathering.	`512`
`max_open_bucket_files`	`int`	Maximum number of bucket files kept open concurrently during gathering.	`64`
`tmp_root`	`str or Path or None`	Root directory for temporary gathering artifacts.	`None`
`recursive`	`bool`	When gathering from a text folder, recurse into subdirectories.	`True`
`pattern`	`str`	Glob pattern for selecting text files when gathering from a folder.	`"*.txt"`
`id_from`	`(stem, name, path)`	How to derive `text_id` for gathered `.txt` files.	`"stem"`
`include_source_path`	`bool`	If `True`, include the absolute source path as an additional column when gathering from a text folder.	`True`
`relative_freq`	`bool`	Emit relative frequencies instead of raw counts, when supported by the dictionary engine.	`True`
`drop_punct`	`bool`	Drop punctuation prior to analysis (dictionary-dependent).	`True`
`rounding`	`int`	Decimal places to round numeric outputs. Use `None` to disable rounding.	`4`
`retain_captures`	`bool`	Pass-through flag to the underlying analyzer to retain capture groups, if applicable.	`False`
`wildcard_mem`	`bool`	Pass-through optimization flag for wildcard handling in the analyzer.	`True`

Returns:

Type	Description
`Path`	Path to the written features CSV.

Raises:

Type	Description
`FileNotFoundError`	If input files/folders or any dictionary file cannot be found.
`ValueError`	If input modes are misconfigured (e.g., multiple sources provided or none), required columns are missing from the analysis-ready CSV, or unsupported dictionary extensions are encountered.

Examples:

Run on a transcript CSV, grouped by speaker:

>>> analyze_with_dictionaries(
...     csv_path="transcripts/session.csv",
...     text_cols=["text"], id_cols=["speaker"], group_by=["speaker"],
...     dict_paths=["dictionaries/liwc/LIWC-22 Dictionary (2022-01-27).dicx"]
... )
PosixPath('.../features/dictionary/session.csv')

Notes

If overwrite_existing is False and the output exists, the existing file path is returned without recomputation.

Source code in src\taters\text\analyze_with_dictionaries.py

def analyze_with_dictionaries(
    *,
    # ----- Input source (choose exactly one, or pass analysis_csv directly) -----
    csv_path: Optional[Union[str, Path]] = None,
    txt_dir: Optional[Union[str, Path]] = None,
    analysis_csv: Optional[Union[str, Path]] = None,  # if provided, gathering is skipped

    # ----- Output -----
    out_features_csv: Optional[Union[str, Path]] = None,
    overwrite_existing: bool = False,  # if the file already exists, let's not overwrite by default

    # ----- Dictionaries -----
    dict_paths: Sequence[Union[str, Path]], # LIWC2007 (.dic) or LIWC-22 format (.dicx, .csv)

    # ====== SHARED I/O OPTIONS ======
    encoding: str = "utf-8-sig",

    # ====== CSV GATHER OPTIONS ======
    # Only used when csv_path is provided
    text_cols: Sequence[str] = ("text",),
    id_cols: Optional[Sequence[str]] = None,
    mode: Literal["concat", "separate"] = "concat",
    group_by: Optional[Sequence[str]] = None,
    delimiter: str = ",",
    joiner: str = " ",
    num_buckets: int = 512,
    max_open_bucket_files: int = 64,
    tmp_root: Optional[Union[str, Path]] = None,

    # ====== TXT FOLDER GATHER OPTIONS ======
    # Only used when txt_dir is provided
    recursive: bool = True,
    pattern: str = "*.txt",
    id_from: Literal["stem", "name", "path"] = "stem",
    include_source_path: bool = True,

    # ====== ANALYZER OPTIONS (passed through to ContentCoder) ======
    relative_freq: bool = True,
    drop_punct: bool = True,
    rounding: int = 4,
    retain_captures: bool = False,
    wildcard_mem: bool = True,
) -> Path:
    """
    Compute LIWC-style dictionary features for text rows and write a wide features CSV.

    The function supports exactly one of three input modes:

    1. ``analysis_csv`` — Use a prebuilt file with columns ``text_id`` and ``text``.
    2. ``csv_path`` — Gather text from an arbitrary CSV using ``text_cols`` (and optional
    ``id_cols``/``group_by``) to produce an analysis-ready file.
    3. ``txt_dir`` — Gather text from a folder of ``.txt`` files.

    If ``out_features_csv`` is omitted, the default output path is
    ``./features/dictionary/<analysis_ready_filename>``. Multiple dictionaries are supported;
    passing a directory discovers all ``.dic``, ``.dicx``, and ``.csv`` dictionary files
    recursively in a stable order. Global columns (e.g., word counts, punctuation) are emitted
    once (from the first dictionary) and each dictionary contributes a namespaced block.

    Parameters
    ----------
    csv_path : str or pathlib.Path, optional
        Source CSV to gather from. Mutually exclusive with ``txt_dir`` and ``analysis_csv``.
    txt_dir : str or pathlib.Path, optional
        Folder containing ``.txt`` files to gather from. Mutually exclusive with other modes.
    analysis_csv : str or pathlib.Path, optional
        Prebuilt analysis-ready CSV with exactly two columns: ``text_id`` and ``text``.
    out_features_csv : str or pathlib.Path, optional
        Output file path. If ``None``, defaults to
        ``./features/dictionary/<analysis_ready_filename>``.
    overwrite_existing : bool, default=False
        If ``False`` and the output file already exists, skip processing and return the path.
    dict_paths : Sequence[str or pathlib.Path]
        One or more dictionary inputs (files or directories). Supported extensions:
        ``.dic``, ``.dicx``, ``.csv``. Directories are expanded recursively.
    encoding : str, default="utf-8-sig"
        Text encoding used for reading/writing CSV files.
    text_cols : Sequence[str], default=("text",)
        When gathering from a CSV, name(s) of the column(s) containing text.
    id_cols : Sequence[str] or None, optional
        Optional ID columns to carry into grouping when gathering from CSV.
    mode : {"concat", "separate"}, default="concat"
        Gathering behavior when multiple text columns are provided. ``"concat"`` joins them
        into one text field using ``joiner``; ``"separate"`` creates one row per column.
    group_by : Sequence[str] or None, optional
        Optional grouping keys used during CSV gathering (e.g., ``["speaker"]``).
    delimiter : str, default=","
        Delimiter for reading/writing CSV files.
    joiner : str, default=" "
        Separator used when concatenating multiple text chunks in ``"concat"`` mode.
    num_buckets : int, default=512
        Number of temporary hash buckets used during scalable CSV gathering.
    max_open_bucket_files : int, default=64
        Maximum number of bucket files kept open concurrently during gathering.
    tmp_root : str or pathlib.Path or None, optional
        Root directory for temporary gathering artifacts.
    recursive : bool, default=True
        When gathering from a text folder, recurse into subdirectories.
    pattern : str, default="*.txt"
        Glob pattern for selecting text files when gathering from a folder.
    id_from : {"stem", "name", "path"}, default="stem"
        How to derive ``text_id`` for gathered ``.txt`` files.
    include_source_path : bool, default=True
        If ``True``, include the absolute source path as an additional column when gathering
        from a text folder.
    relative_freq : bool, default=True
        Emit relative frequencies instead of raw counts, when supported by the dictionary engine.
    drop_punct : bool, default=True
        Drop punctuation prior to analysis (dictionary-dependent).
    rounding : int, default=4
        Decimal places to round numeric outputs. Use ``None`` to disable rounding.
    retain_captures : bool, default=False
        Pass-through flag to the underlying analyzer to retain capture groups, if applicable.
    wildcard_mem : bool, default=True
        Pass-through optimization flag for wildcard handling in the analyzer.

    Returns
    -------
    pathlib.Path
        Path to the written features CSV.

    Raises
    ------
    FileNotFoundError
        If input files/folders or any dictionary file cannot be found.
    ValueError
        If input modes are misconfigured (e.g., multiple sources provided or none),
        required columns are missing from the analysis-ready CSV, or unsupported
        dictionary extensions are encountered.

    Examples
    --------
    Run on a transcript CSV, grouped by speaker:

    >>> analyze_with_dictionaries(
    ...     csv_path="transcripts/session.csv",
    ...     text_cols=["text"], id_cols=["speaker"], group_by=["speaker"],
    ...     dict_paths=["dictionaries/liwc/LIWC-22 Dictionary (2022-01-27).dicx"]
    ... )
    PosixPath('.../features/dictionary/session.csv')

    Notes
    -----
    If ``overwrite_existing`` is ``False`` and the output exists, the existing file path
    is returned without recomputation.
    """


    # 1) Produce or accept the analysis-ready CSV (must have columns: text_id,text)
    if analysis_csv is not None:
        analysis_ready = Path(analysis_csv)
        if not analysis_ready.exists():
            raise FileNotFoundError(f"analysis_csv not found: {analysis_ready}")
    else:
        if (csv_path is None) == (txt_dir is None):
            raise ValueError("Provide exactly one of csv_path or txt_dir (or pass analysis_csv).")

        if csv_path is not None:
            analysis_ready = Path(
                csv_to_analysis_ready_csv(
                    csv_path=csv_path,
                    text_cols=list(text_cols),
                    id_cols=list(id_cols) if id_cols else None,
                    mode=mode,
                    group_by=list(group_by) if group_by else None,
                    delimiter=delimiter,
                    encoding=encoding,
                    joiner=joiner,
                    num_buckets=num_buckets,
                    max_open_bucket_files=max_open_bucket_files,
                    tmp_root=tmp_root,
                )
            )
        else:
            analysis_ready = Path(
                txt_folder_to_analysis_ready_csv(
                    root_dir=txt_dir,
                    recursive=recursive,
                    pattern=pattern,
                    encoding=encoding,
                    id_from=id_from,
                    include_source_path=include_source_path,
                )
            )

    # 1b) Decide default features path if not provided:
    #     <cwd>/features/dictionary/<analysis_ready_filename>
    if out_features_csv is None:
        out_features_csv = Path.cwd() / "features" / "dictionary" / analysis_ready.name
    out_features_csv = Path(out_features_csv)
    out_features_csv.parent.mkdir(parents=True, exist_ok=True)

    if not overwrite_existing and Path(out_features_csv).is_file():
        print("Dictionary content coding output file already exists; returning existing file.")
        return out_features_csv


    # 2) Validate dictionaries
    def _expand_dict_inputs(paths):
        """
        Normalize dictionary inputs into a unique, ordered list of files.

        Parameters
        ----------
        paths : Iterable[Union[str, pathlib.Path]]
            Files or directories. Directories are expanded recursively to files with
            extensions ``.dic``, ``.dicx``, or ``.csv``.

        Returns
        -------
        list[pathlib.Path]
            Deduplicated, resolved file paths in stable order.

        Raises
        ------
        FileNotFoundError
            If a referenced file or directory does not exist.
        ValueError
            If a file has an unsupported extension or if no dictionary files are found.
        """

        out = []
        seen = set()
        for p in map(Path, paths):
            if p.is_dir():
                # Find .dic/.dicx/.csv under this folder (recursive), stable order
                found = find_files(
                    root_dir=p,
                    extensions=[".dic", ".dicx", ".csv"],
                    recursive=True,
                    absolute=True,
                    sort=True,
                )
                for f in found:
                    fp = Path(f).resolve()
                    if fp.suffix.lower().lstrip(".") in {"dic", "dicx", "csv"}:
                        if fp not in seen:
                            out.append(fp)
                            seen.add(fp)
            else:
                if not p.exists():
                    raise FileNotFoundError(f"Dictionary path not found: {p}")
                fp = p.resolve()
                if fp.suffix.lower().lstrip(".") not in {"dic", "dicx", "csv"}:
                    raise ValueError(f"Unsupported dictionary extension: {fp.name}")
                if fp not in seen:
                    out.append(fp)
                    seen.add(fp)
        if not out:
            raise ValueError("No dictionary files found. Supply .dic/.dicx/.csv files or folders containing them.")
        return out

    dict_paths = _expand_dict_inputs(dict_paths)

        # 3) Stream the analysis-ready CSV into the analyzer → features CSV
    def _iter_items_from_csv_with_meta(
        path: Path,
        *,
        id_col: str = "text_id",
        text_col: str = "text",
        pass_through_cols: Optional[Sequence[str]] = None,
    ) -> Iterable[Tuple[str, str, dict]]:
        """
        Stream (text_id, text, meta) from an analysis-ready CSV.

        Parameters
        ----------
        path : pathlib.Path
            Path to the analysis-ready CSV file.
        id_col : str, default="text_id"
            Identifier column.
        text_col : str, default="text"
            Text column.
        pass_through_cols : Sequence[str] or None
            Extra columns to fetch per row and forward to the analyzer.

        Yields
        ------
        tuple[str, str, dict]
            (text_id, text, meta_dict) where meta_dict maps each pass-through column
            to its string value ('' if missing).
        """
        wanted = list(pass_through_cols or [])
        with path.open("r", newline="", encoding=encoding) as f:
            reader = csv.DictReader(f, delimiter=delimiter)
            fields = reader.fieldnames or []
            if id_col not in fields or text_col not in fields:
                raise ValueError(
                    f"Expected columns '{id_col}' and '{text_col}' in {path}; found {fields}"
                )
            # If id_cols were requested, enforce they exist up-front (fail fast)
            missing = [c for c in wanted if c not in fields]
            if missing:
                raise ValueError(
                    f"Requested id_cols not present in analysis-ready CSV {path}: {missing}"
                )

            for row in reader:
                tid = str(row.get(id_col, "") or "")
                text = str(row.get(text_col, "") or "")
                meta = {c: str(row.get(c, "") or "") for c in wanted}
                yield tid, text, meta

    # Use multi_dict_analyzer as the middle layer (new API)
    mda.analyze_texts_to_csv(
        items=_iter_items_from_csv_with_meta(analysis_ready, pass_through_cols=id_cols or []),
        dict_files=dict_paths,
        out_csv=out_features_csv,
        relative_freq=relative_freq,
        drop_punct=drop_punct,
        rounding=rounding,
        retain_captures=retain_captures,
        wildcard_mem=wildcard_mem,
        id_col_name="text_id",
        pass_through_cols=list(id_cols or []),  # <-- inject id_cols right after text_id
        encoding=encoding,
    )


    return out_features_csv

main ¶

main()

Command-line entry point for multi-dictionary content coding.

Parses CLI arguments via :func:_build_arg_parser, normalizes list-like defaults, invokes :func:analyze_with_dictionaries, and prints the resulting output path.

Examples:

Basic usage on a CSV with grouping by speaker:

$ python -m taters.text.analyze_with_dictionaries \ --csv transcripts/session.csv \ --text-col text --id-col speaker --group-by speaker \ --dict dictionaries/liwc/LIWC-22\ Dictionary\ (2022-01-27).dicx

Notes

Boolean flags include positive/negative pairs (e.g., --recursive / --no-recursive, --relative-freq / --no-relative-freq) to make CLI behavior explicit.

Source code in src\taters\text\analyze_with_dictionaries.py

def main():
    r"""
    Command-line entry point for multi-dictionary content coding.

    Parses CLI arguments via :func:`_build_arg_parser`, normalizes list-like defaults,
    invokes :func:`analyze_with_dictionaries`, and prints the resulting output path.

    Examples
    --------
    Basic usage on a CSV with grouping by speaker:

    $ python -m taters.text.analyze_with_dictionaries \
        --csv transcripts/session.csv \
        --text-col text --id-col speaker --group-by speaker \
        --dict dictionaries/liwc/LIWC-22\ Dictionary\ (2022-01-27).dicx

    Notes
    -----
    Boolean flags include positive/negative pairs (e.g., ``--recursive`` /
    ``--no-recursive``, ``--relative-freq`` / ``--no-relative-freq``) to make
    CLI behavior explicit.
    """

    args = _build_arg_parser().parse_args()

    # Defaults for list-ish args
    text_cols = args.text_cols if args.text_cols else ["text"]
    id_cols = args.id_cols if args.id_cols else None
    group_by = args.group_by if args.group_by else None

    out = analyze_with_dictionaries(
        csv_path=args.csv_path,
        txt_dir=args.txt_dir,
        analysis_csv=args.analysis_csv,
        out_features_csv=args.out_features_csv,
        overwrite_existing=args.overwrite_existing,
        dict_paths=args.dict_paths,
        encoding=args.encoding,
        text_cols=text_cols,
        id_cols=id_cols,
        mode=args.mode,
        group_by=group_by,
        delimiter=args.delimiter,
        joiner=args.joiner,
        num_buckets=args.num_buckets,
        max_open_bucket_files=args.max_open_bucket_files,
        tmp_root=args.tmp_root,
        recursive=args.recursive,
        pattern=args.pattern,
        id_from=args.id_from,
        include_source_path=args.include_source_path,
        relative_freq=args.relative_freq,
        drop_punct=args.drop_punct,
        rounding=args.rounding,
        retain_captures=args.retain_captures,
        wildcard_mem=args.wildcard_mem,
    )
    print(str(out))

taters.text.analyze_lexical_richness ¶

analyze_lexical_richness ¶

analyze_lexical_richness(
    *,
    csv_path=None,
    txt_dir=None,
    analysis_csv=None,
    out_features_csv=None,
    overwrite_existing=False,
    encoding="utf-8-sig",
    text_cols=("text",),
    id_cols=None,
    mode="concat",
    group_by=None,
    delimiter=",",
    joiner=" ",
    num_buckets=512,
    max_open_bucket_files=64,
    tmp_root=None,
    recursive=True,
    pattern="*.txt",
    id_from="stem",
    include_source_path=True,
    msttr_window=100,
    mattr_window=100,
    mtld_threshold=0.72,
    hdd_draws=42,
    vocd_ntokens=50,
    vocd_within_sample=100,
    vocd_iterations=3,
    vocd_seed=42,
    pass_through_cols=None
)

Compute lexical richness/diversity metrics for each text row and write a features CSV. Draws heavily from https://github.com/LSYS/lexicalrichness but makes several key changes with the goals of minimizing dependencies, attempting to make some speed optimizations with grid search instead of precise curve specifications, and making some principled decisions around punctuation/hyphenization that differ from the original Note that these decisions are not objectively "better" than the original but, instead, reflect my own experiences/intuitions about what makes sense.

This function accepts (a) an analysis-ready CSV (with columns text_id,text), (b) a raw CSV plus instructions for gathering/aggregation, or (c) a folder of .txt files. For each resulting row of text, it tokenizes words and computes a suite of classical lexical richness measures (e.g., TTR, Herdan's C, Yule's K, MTLD, MATTR, HDD, VOCD). Results are written as a wide CSV whose rows align with the rows in the analysis-ready table (or the gathered group_by rows), preserving any non-text metadata columns.

Parameters:

Name	Type	Description	Default
`csv_path`	`str or Path`	Source CSV to gather from. Use with `text_cols`, optional `id_cols`, and optional `group_by`. Exactly one of `csv_path`, `txt_dir`, or `analysis_csv` must be provided (unless `analysis_csv` is given, which skips gathering).	`None`
`txt_dir`	`str or Path`	Folder of `.txt` files to gather. File identifiers are created from filenames via `id_from` and (optionally) a `source_path` column when `include_source_path=True`.	`None`
`analysis_csv`	`str or Path`	Existing analysis-ready CSV with columns `text_id,text`. When provided, all gathering options are ignored and the file is used as-is.	`None`
`out_features_csv`	`str or Path`	Output CSV path. If omitted, defaults to `./features/lexical-richness/<analysis_ready_filename>`.	`None`
`overwrite_existing`	`bool`	If `False` and `out_features_csv` exists, the function short-circuits and returns the existing path without recomputation.	`False`
`encoding`	`str`	Encoding for reading/writing CSVs.	`"utf-8-sig"`
`text_cols`	`sequence of str`	Text column(s) to use when `csv_path` is provided. When multiple columns are given, they are combined according to `mode` (`concat` or `separate`).	`("text",)`
`id_cols`	`sequence of str`	Columns to carry through unchanged into the analysis-ready CSV prior to analysis (e.g., `["source","speaker"]`). These will also appear in the output features CSV.	`None`
`mode`	`('concat', 'separate')`	Gathering behavior when multiple `text_cols` are provided. `"concat"` joins values using `joiner`; `"separate"` produces separate rows per text column.	`"concat"`
`group_by`	`sequence of str`	If provided, texts are grouped by these columns before analysis (e.g., `["source","speaker"]`). With `mode="concat"`, all texts in a group are joined into one blob per group; with `mode="separate"`, they remain separate rows.	`None`
`delimiter`	`str`	CSV delimiter used for input and output.	`","`
`joiner`	`str`	String used to join text fields when `mode="concat"`.	`" "`
`num_buckets`	`int`	Internal streaming/gather parameter to control temporary file bucketing (passed through to the gatherer).	`512`
`max_open_bucket_files`	`int`	Maximum number of temporary files simultaneously open during gathering.	`64`
`tmp_root`	`str or Path`	Temporary directory root for the gatherer. Defaults to a system temp location.	`None`
`recursive`	`bool`	When `txt_dir` is provided, whether to search subdirectories for `.txt` files.	`True`
`pattern`	`str`	Glob pattern for discovering text files under `txt_dir`.	`"*.txt"`
`id_from`	`('stem', 'name', 'path')`	How to construct `text_id` for `.txt` inputs: file stem, full name, or relative path.	`"stem"`
`include_source_path`	`bool`	When `txt_dir` is used, include a `source_path` column in the analysis-ready CSV.	`True`
`msttr_window`	`int`	Window size for MSTTR (Mean Segmental TTR). Must be smaller than the number of tokens in the text to produce a value.	`100`
`mattr_window`	`int`	Window size for MATTR (Moving-Average TTR). Must be smaller than the number of tokens.	`100`
`mtld_threshold`	`float`	MTLD threshold for factor completion. A higher threshold yields shorter factors and typically lower MTLD values; the default follows common practice.	`0.72`
`hdd_draws`	`int`	Sample size `n` for HD-D (Hypergeometric Distribution Diversity). Must be less than the number of tokens to produce a value.	`42`
`vocd_ntokens`	`int`	Maximum sample size used to estimate VOCD (D). For each `N` in 35..`vocd_ntokens`, the function computes the average TTR over many random samples (`vocd_within_sample`).	`50`
`vocd_within_sample`	`int`	Number of random samples drawn per `N` when estimating VOCD.	`100`
`vocd_iterations`	`int`	Repeat-estimate count for VOCD. The best-fit D from each repetition is averaged.	`3`
`vocd_seed`	`int`	Seed for the VOCD random sampler (controls reproducibility across runs).	`42`

Returns:

Type	Description
`Path`	Path to the written features CSV.

Output shape

The features CSV starts with::

text_id, <pass-through columns...>, ttr, rttr, cttr, ...

Pass-through behavior:

If pass_through_cols is provided, those columns are included in that order.
Otherwise, if id_cols were used during gathering, they are included in that order.
Otherwise (backward compatible), all non-text columns from the analysis-ready CSV are passed through.

Metrics emitted per row (None if the text is too short): ttr, rttr, cttr, herdan_c, summer_s, dugast, maas, yule_k, yule_i, herdan_vm, simpson_d, msttr_{msttr_window}, mattr_{mattr_window}, mtld_{mtld_threshold}, hdd_{hdd_draws}, vocd_{vocd_ntokens}.

Notes

Tokenization and preprocessing. Texts are lowercased, digits are removed, and punctuation characters are replaced with spaces prior to tokenization. As a result, hyphenated forms such as "state-of-the-art" will be split into separate tokens ("state", "of", "the", "art"). This choice yields robust behavior across corpora but can produce different numeric results than implementations that remove hyphens (treating "state-of-the-art" as a single token). If you require strict parity with a hyphen-removal scheme, adapt the internal preprocessing accordingly.

Metrics. The following measures are emitted per row (values are None when a text is too short to support the computation): - ttr: Type-Token Ratio (|V| / N) - rttr: Root TTR (|V| / sqrt(N)) - cttr: Corrected TTR (|V| / sqrt(2N)) - herdan_c: Herdan's C (log |V| / log N) - summer_s: Summer's S (log log |V| / log log N) - dugast: Dugast's U ((log N)^2 / (log N − log |V|)) - maas: Maas a^2 ((log N − log |V|) / (log N)^2) - yule_k: Yule's K (dispersion of frequencies; higher = less diverse) - yule_i: Yule's I (inverse of K, scaled) - herdan_vm: Herdan's Vm - simpson_d: Simpson's D (repeat-probability across tokens) - msttr_{msttr_window}: Mean Segmental TTR over fixed segments - mattr_{mattr_window}: Moving-Average TTR over a sliding window - mtld_{mtld_threshold}: Measure of Textual Lexical Diversity (bidirectional) - hdd_{hdd_draws}: HD-D (expected proportion of types in a sample of size hdd_draws) - vocd_{vocd_ntokens}: VOCD (D) estimated by fitting TTR(N) to a theoretical curve

VOCD estimation. VOCD is fit without external optimization libraries: the function performs a coarse grid search over candidate D values (minimizing squared error between observed mean TTRs and a theoretical TTR(N; D) curve) for multiple repetitions, then averages the best D across repetitions. This generally tracks SciPy-based curve fits closely; you can widen the search grid or add a fine local search if tighter agreement is desired.

Raises:

Type	Description
`FileNotFoundError`	If `analysis_csv` is provided but the file does not exist.
`ValueError`	If none or more than one of `csv_path`, `txt_dir`, or `analysis_csv` are provided, or if the analysis-ready CSV is missing required columns (`text_id`, `text`).

Examples:

Analyze an existing analysis-ready CSV (utterance-level):

>>> analyze_lexical_richness(
...     analysis_csv="transcripts_all.csv",
...     out_features_csv="features/lexical-richness.csv",
...     overwrite_existing=True,
... )

Gather from a transcript CSV and aggregate per (source, speaker):

>>> analyze_lexical_richness(
...     csv_path="transcripts/session.csv",
...     text_cols=["text"],
...     id_cols=["source", "speaker"],
...     group_by=["source", "speaker"],
...     mode="concat",
...     out_features_csv="features/lexical-richness.csv",
... )

hdd ¶

hdd(tokens, draws=42)

HD-D (McCarthy & Jarvis): sum over types of (1 - P(X=0)) / draws, where X ~ Hypergeom(N, K, n) with N=len(tokens), K=freq(term), n=draws.

Source code in src\taters\text\analyze_lexical_richness.py

def hdd(tokens: List[str], draws: int = 42) -> Optional[float]:
    """
    HD-D (McCarthy & Jarvis): sum over types of (1 - P(X=0)) / draws,
    where X ~ Hypergeom(N, K, n) with N=len(tokens), K=freq(term), n=draws.
    """
    N = len(tokens)
    if N == 0 or draws <= 0 or draws > N:
        return None
    term_freq = Counter(tokens)
    contribs = []
    for K in term_freq.values():
        p0 = _hypergeom_pmf_zero(N, K, draws)
        contribs.append((1 - p0) / draws)
    return sum(contribs)

main ¶

main()

CLI entry point.

Examples:

Analysis-ready CSV¶

$ python -m taters.text.analyze_lexical_richness --analysis-csv transcripts_all.csv

Gather from a CSV and group by source/speaker first (utterances -> per speaker)¶

$ python -m taters.text.analyze_lexical_richness \ --csv transcripts/session.csv \ --text-col text --id-col source --id-col speaker \ --group-by source --group-by speaker --mode concat

Source code in src\taters\text\analyze_lexical_richness.py

def main():
    """
    CLI entry point.

    Examples
    --------
    # Analysis-ready CSV
    $ python -m taters.text.analyze_lexical_richness --analysis-csv transcripts_all.csv

    # Gather from a CSV and group by source/speaker first (utterances -> per speaker)
    $ python -m taters.text.analyze_lexical_richness \\
        --csv transcripts/session.csv \\
        --text-col text --id-col source --id-col speaker \\
        --group-by source --group-by speaker --mode concat
    """
    args = _build_arg_parser().parse_args()

    text_cols = args.text_cols if args.text_cols else ["text"]
    id_cols = args.id_cols if args.id_cols else None
    group_by = args.group_by if args.group_by else None

    out = analyze_lexical_richness(
        csv_path=args.csv_path,
        txt_dir=args.txt_dir,
        analysis_csv=args.analysis_csv,
        out_features_csv=args.out_features_csv,
        overwrite_existing=args.overwrite_existing,
        encoding=args.encoding,
        text_cols=text_cols,
        id_cols=id_cols,
        mode=args.mode,
        group_by=group_by,
        delimiter=args.delimiter,
        joiner=args.joiner,
        num_buckets=args.num_buckets,
        max_open_bucket_files=args.max_open_bucket_files,
        tmp_root=args.tmp_root,
        recursive=args.recursive,
        pattern=args.pattern,
        id_from=args.id_from,
        include_source_path=args.include_source_path,
        msttr_window=args.msttr_window,
        mattr_window=args.mattr_window,
        mtld_threshold=args.mtld_threshold,
        hdd_draws=args.hdd_draws,
        vocd_ntokens=args.vocd_ntokens,
        vocd_within_sample=args.vocd_within_sample,
        vocd_iterations=args.vocd_iterations,
        vocd_seed=args.vocd_seed,
    )
    print(str(out))

vocd ¶

vocd(
    tokens,
    ntokens=50,
    within_sample=100,
    iterations=3,
    seed=42,
)

Estimate D by: - for N in 35..ntokens: * sample 'within_sample' subsets of size N, compute TTR, average - grid search D over a reasonable range to minimize squared error to _ttr_nd - repeat 'iterations' times and average the best D

Source code in src\taters\text\analyze_lexical_richness.py

def vocd(tokens: List[str], ntokens: int = 50, within_sample: int = 100,
         iterations: int = 3, seed: int = 42) -> Optional[float]:
    """
    Estimate D by:
      - for N in 35..ntokens:
          * sample 'within_sample' subsets of size N, compute TTR, average
      - grid search D over a reasonable range to minimize squared error to _ttr_nd
      - repeat 'iterations' times and average the best D
    """
    if len(tokens) <= ntokens or ntokens < 35:
        return None
    rng = random.Random(seed)
    Ds: List[float] = []

    # Preselect D search grid (log-like spread 5..200)
    grid: List[float] = []
    # denser where typical D lives (10..120)
    for d in range(5, 201):
        grid.append(float(d))

    for it in range(iterations):
        x_vals: List[int] = []
        y_means: List[float] = []
        for N in range(35, ntokens + 1):
            ttrs: List[float] = []
            for _ in range(within_sample):
                sample = rng.sample(tokens, k=N)
                ttrs.append(len(set(sample)) / N)
            x_vals.append(N)
            y_means.append(mean(ttrs))

        # find D that minimizes squared error
        best_D = None
        best_err = float("inf")
        for D in grid:
            err = 0.0
            for N, y in zip(x_vals, y_means):
                yhat = _ttr_nd(N, D)
                diff = (y - yhat)
                err += diff * diff
            if err < best_err:
                best_err = err
                best_D = D
        if best_D is not None:
            Ds.append(best_D)

    return mean(Ds) if Ds else None

taters.text.analyze_readability ¶

analyze_readability ¶

analyze_readability(
    *,
    csv_path=None,
    txt_dir=None,
    analysis_csv=None,
    out_features_csv=None,
    overwrite_existing=False,
    encoding="utf-8-sig",
    text_cols=("text",),
    id_cols=None,
    mode="concat",
    group_by=None,
    delimiter=",",
    joiner=" ",
    num_buckets=512,
    max_open_bucket_files=64,
    tmp_root=None,
    recursive=True,
    pattern="*.txt",
    id_from="stem",
    include_source_path=True,
    pass_through_cols=None
)

Compute per-row readability metrics using textstat and write a wide features CSV.

The function supports exactly one of three input modes:

analysis_csv — Use a prebuilt file with at least columns text_id and text.
csv_path — Gather text from an arbitrary CSV using text_cols (and optional id_cols/group_by) to produce an analysis-ready file.
txt_dir — Gather text from a folder of .txt files.

If out_features_csv is omitted, the default output path is ./features/readability/<analysis_ready_filename>. All metrics below are computed for every row. Non-numeric metrics (e.g., text_standard) are retained as strings.

Metrics (columns)

The following metrics are emitted as columns (subject to textstat availability):

flesch_reading_ease
smog_index
flesch_kincaid_grade
coleman_liau_index
automated_readability_index
dale_chall_readability_score
difficult_words
linsear_write_formula
gunning_fog
text_standard (string label)
spache_readability (for shorter/children texts; may be None)
readability_consensus (string label)
syllable_count (on entire text)
lexicon_count (word count)
sentence_count
char_count
avg_sentence_length
avg_syllables_per_word
avg_letter_per_word

Parameters:

Name	Type	Description	Default
`csv_path`	`str or Path`	Source CSV to gather from. Mutually exclusive with `txt_dir` and `analysis_csv`.	`None`
`txt_dir`	`str or Path`	Folder containing `.txt` files to gather from. Mutually exclusive with other modes.	`None`
`analysis_csv`	`str or Path`	Prebuilt analysis-ready CSV with columns `text_id` and `text` (additional columns such as `source`/`speaker` will be copied through to the output).	`None`
`out_features_csv`	`str or Path`	Output file path. If `None`, defaults to `./features/readability/<analysis_ready_filename>`.	`None`
`overwrite_existing`	`bool`	If `False` and the output file already exists, skip processing and return the path.	`False`
`encoding`	`str`	Text encoding used for reading/writing CSV files.	`"utf-8-sig"`
`text_cols`	`Sequence[str]`	When gathering from a CSV, name(s) of the column(s) containing text.	`("text",)`
`id_cols`	`Sequence[str] or None`	Optional ID columns to carry into grouping when gathering from CSV.	`None`
`mode`	`('concat', 'separate')`	Gathering behavior when multiple text columns are provided. `"concat"` joins them using `joiner`; `"separate"` creates one row per column.	`"concat"`
`group_by`	`Sequence[str] or None`	Optional grouping keys used during CSV gathering (e.g., `["speaker"]`).	`None`
`delimiter`	`str`	Delimiter for reading/writing CSV files.	`","`
`joiner`	`str`	Separator used when concatenating multiple text chunks in `"concat"` mode.	`" "`
`num_buckets`	`int`	Number of temporary hash buckets used during scalable CSV gathering.	`512`
`max_open_bucket_files`	`int`	Maximum number of bucket files kept open concurrently during gathering.	`64`
`tmp_root`	`str or Path or None`	Root directory for temporary gathering artifacts.	`None`
`recursive`	`bool`	When gathering from a text folder, recurse into subdirectories.	`True`
`pattern`	`str`	Glob pattern for selecting text files when gathering from a folder.	`"*.txt"`
`id_from`	`('stem', 'name', 'path')`	How to derive `text_id` for gathered `.txt` files.	`"stem"`
`include_source_path`	`bool`	If `True`, include the absolute source path as an additional column when gathering from a text folder.	`True`

Returns:

Type	Description
`Path`	Path to the written features CSV.

Output layout

The output CSV starts with: text_id, ,

Pass-through behavior: - If pass_through_cols is provided, those columns are included immediately after text_id in that order. - Else if id_cols were supplied during gathering, they are included in that order. - Else (backward compatible), all non-text columns in the analysis-ready CSV are copied through (e.g., source, speaker, etc.).

Raises:

Type	Description
`FileNotFoundError`	If an input is missing.
`ValueError`	If input modes are misconfigured or required columns are absent.
`RuntimeError`	If `textstat` is not installed.

Notes

All rows are processed; blank or missing text yields benign defaults (metrics may be 0 or None).
Additional columns present in the analysis-ready CSV (beyond text) are copied through to the output (e.g., source, speaker, group_count), aiding joins/aggregation.

Source code in src\taters\text\analyze_readability.py

def analyze_readability(
    *,
    # ----- Input source (choose exactly one, or pass analysis_csv directly) -----
    csv_path: Optional[Union[str, Path]] = None,
    txt_dir: Optional[Union[str, Path]] = None,
    analysis_csv: Optional[Union[str, Path]] = None,  # if provided, gathering is skipped

    # ----- Output -----
    out_features_csv: Optional[Union[str, Path]] = None,
    overwrite_existing: bool = False,

    # ====== SHARED I/O OPTIONS ======
    encoding: str = "utf-8-sig",

    # ====== CSV GATHER OPTIONS ======
    # Only used when csv_path is provided
    text_cols: Sequence[str] = ("text",),
    id_cols: Optional[Sequence[str]] = None,
    mode: Literal["concat", "separate"] = "concat",
    group_by: Optional[Sequence[str]] = None,
    delimiter: str = ",",
    joiner: str = " ",
    num_buckets: int = 512,
    max_open_bucket_files: int = 64,
    tmp_root: Optional[Union[str, Path]] = None,

    # ====== TXT FOLDER GATHER OPTIONS ======
    # Only used when txt_dir is provided
    recursive: bool = True,
    pattern: str = "*.txt",
    id_from: Literal["stem", "name", "path"] = "stem",
    include_source_path: bool = True,

    # ====== NEW: passthrough control (optional) ======
    pass_through_cols: Optional[Sequence[str]] = None
    ) -> Path:
    """
    Compute per-row readability metrics using `textstat` and write a wide features CSV.

    The function supports exactly one of three input modes:

    1. ``analysis_csv`` — Use a prebuilt file with at least columns ``text_id`` and ``text``.
    2. ``csv_path`` — Gather text from an arbitrary CSV using ``text_cols`` (and optional
       ``id_cols``/``group_by``) to produce an analysis-ready file.
    3. ``txt_dir`` — Gather text from a folder of ``.txt`` files.

    If ``out_features_csv`` is omitted, the default output path is
    ``./features/readability/<analysis_ready_filename>``. All metrics below are computed
    for every row. Non-numeric metrics (e.g., ``text_standard``) are retained as strings.

    Metrics (columns)
    -----------------
    The following metrics are emitted as columns (subject to `textstat` availability):

    - ``flesch_reading_ease``
    - ``smog_index``
    - ``flesch_kincaid_grade``
    - ``coleman_liau_index``
    - ``automated_readability_index``
    - ``dale_chall_readability_score``
    - ``difficult_words``
    - ``linsear_write_formula``
    - ``gunning_fog``
    - ``text_standard``                 (string label)
    - ``spache_readability``            (for shorter/children texts; may be None)
    - ``readability_consensus``         (string label)
    - ``syllable_count``                (on entire text)
    - ``lexicon_count``                 (word count)
    - ``sentence_count``
    - ``char_count``
    - ``avg_sentence_length``
    - ``avg_syllables_per_word``
    - ``avg_letter_per_word``

    Parameters
    ----------
    csv_path : str or pathlib.Path, optional
        Source CSV to gather from. Mutually exclusive with ``txt_dir`` and ``analysis_csv``.
    txt_dir : str or pathlib.Path, optional
        Folder containing ``.txt`` files to gather from. Mutually exclusive with other modes.
    analysis_csv : str or pathlib.Path, optional
        Prebuilt analysis-ready CSV with columns ``text_id`` and ``text`` (additional columns
        such as ``source``/``speaker`` will be copied through to the output).
    out_features_csv : str or pathlib.Path, optional
        Output file path. If ``None``, defaults to
        ``./features/readability/<analysis_ready_filename>``.
    overwrite_existing : bool, default=False
        If ``False`` and the output file already exists, skip processing and return the path.
    encoding : str, default="utf-8-sig"
        Text encoding used for reading/writing CSV files.
    text_cols : Sequence[str], default=("text",)
        When gathering from a CSV, name(s) of the column(s) containing text.
    id_cols : Sequence[str] or None, optional
        Optional ID columns to carry into grouping when gathering from CSV.
    mode : {"concat", "separate"}, default="concat"
        Gathering behavior when multiple text columns are provided. ``"concat"`` joins them
        using ``joiner``; ``"separate"`` creates one row per column.
    group_by : Sequence[str] or None, optional
        Optional grouping keys used during CSV gathering (e.g., ``["speaker"]``).
    delimiter : str, default=","
        Delimiter for reading/writing CSV files.
    joiner : str, default=" "
        Separator used when concatenating multiple text chunks in ``"concat"`` mode.
    num_buckets : int, default=512
        Number of temporary hash buckets used during scalable CSV gathering.
    max_open_bucket_files : int, default=64
        Maximum number of bucket files kept open concurrently during gathering.
    tmp_root : str or pathlib.Path or None, optional
        Root directory for temporary gathering artifacts.
    recursive : bool, default=True
        When gathering from a text folder, recurse into subdirectories.
    pattern : str, default="*.txt"
        Glob pattern for selecting text files when gathering from a folder.
    id_from : {"stem", "name", "path"}, default="stem"
        How to derive ``text_id`` for gathered ``.txt`` files.
    include_source_path : bool, default=True
        If ``True``, include the absolute source path as an additional column when gathering
        from a text folder.

    Returns
    -------
    pathlib.Path
        Path to the written features CSV.

    Output layout
    -------------
    The output CSV starts with:
        text_id, <pass-through columns...>, <metrics...>

    Pass-through behavior:
    - If `pass_through_cols` is provided, those columns are included immediately
      after `text_id` in that order.
    - Else if `id_cols` were supplied during gathering, they are included in that order.
    - Else (backward compatible), *all* non-`text` columns in the analysis-ready CSV
      are copied through (e.g., `source`, `speaker`, etc.).

    Raises
    ------
    FileNotFoundError
        If an input is missing.
    ValueError
        If input modes are misconfigured or required columns are absent.
    RuntimeError
        If ``textstat`` is not installed.

    Notes
    -----
    - All rows are processed; blank or missing text yields benign defaults (metrics may be 0 or None).
    - Additional columns present in the analysis-ready CSV (beyond ``text``) are copied through
      to the output (e.g., ``source``, ``speaker``, ``group_count``), aiding joins/aggregation.
    """
    textstat = _require_textstat()

    # 1) Accept or produce the analysis-ready CSV (must have: text_id, text)
    if analysis_csv is not None:
        analysis_ready = Path(analysis_csv)
        if not analysis_ready.exists():
            raise FileNotFoundError(f"analysis_csv not found: {analysis_ready}")
    else:
        if (csv_path is None) == (txt_dir is None):
            raise ValueError("Provide exactly one of csv_path or txt_dir (or pass analysis_csv).")

        if csv_path is not None:
            analysis_ready = Path(
                csv_to_analysis_ready_csv(
                    csv_path=csv_path,
                    text_cols=list(text_cols),
                    id_cols=list(id_cols) if id_cols else None,
                    mode=mode,
                    group_by=list(group_by) if group_by else None,
                    delimiter=delimiter,
                    encoding=encoding,
                    joiner=joiner,
                    num_buckets=num_buckets,
                    max_open_bucket_files=max_open_bucket_files,
                    tmp_root=tmp_root,
                )
            )
        else:
            analysis_ready = Path(
                txt_folder_to_analysis_ready_csv(
                    root_dir=txt_dir,
                    recursive=recursive,
                    pattern=pattern,
                    encoding=encoding,
                    id_from=id_from,
                    include_source_path=include_source_path,
                )
            )

    # 2) Decide default features path if not provided:
    if out_features_csv is None:
        out_features_csv = Path.cwd() / "features" / "readability" / analysis_ready.name
    out_features_csv = Path(out_features_csv)
    out_features_csv.parent.mkdir(parents=True, exist_ok=True)

    if not overwrite_existing and out_features_csv.is_file():
        print(f"Readability output file already exists; returning existing file: {out_features_csv}")
        return out_features_csv

    # 3) Metrics list
    metrics = [
        "flesch_reading_ease",
        "smog_index",
        "flesch_kincaid_grade",
        "coleman_liau_index",
        "automated_readability_index",
        "dale_chall_readability_score",
        "difficult_words",
        "linsear_write_formula",
        "gunning_fog",
        "text_standard",
        "spache_readability",
        "readability_consensus",
        "syllable_count",
        "lexicon_count",
        "sentence_count",
        "char_count",
        "avg_sentence_length",
        "avg_syllables_per_word",
        "avg_letter_per_word",
    ]

    # 4) Stream input and write output
    with analysis_ready.open("r", newline="", encoding=encoding) as fin, \
         out_features_csv.open("w", newline="", encoding=encoding) as fout:
        reader = csv.DictReader(fin, delimiter=delimiter)

        if "text_id" not in reader.fieldnames or "text" not in reader.fieldnames:
            raise ValueError(
                f"Expected columns 'text_id' and 'text' in {analysis_ready}; found {reader.fieldnames}"
            )

        # Decide pass-through columns:
        requested_pt = list(pass_through_cols or [])
        if not requested_pt and id_cols:
            requested_pt = list(id_cols)

        if not requested_pt:
            # Back-compat: pass through all non-text, excluding text_id (we add it explicitly)
            auto = [c for c in (reader.fieldnames or []) if c not in ("text", "text_id")]
            requested_pt = auto

        # Validate presence
        fields = set(reader.fieldnames or [])
        missing = [c for c in requested_pt if c not in fields]
        if missing:
            raise ValueError(
                f"Requested pass-through columns not present in analysis-ready CSV {analysis_ready}: {missing}"
            )

        # Output header: text_id + pass-through + metrics
        passthrough_cols = list(dict.fromkeys(requested_pt))  # preserve order, dedupe
        fieldnames = ["text_id", *passthrough_cols, *metrics]
        writer = csv.DictWriter(fout, fieldnames=fieldnames, delimiter=delimiter)
        writer.writeheader()

        # Safe metric caller (handles version differences)
        def _call_metric(name: str, txt: str) -> Any:
            fn = getattr(textstat, name, None)
            if fn is None:
                if name == "readability_consensus" and hasattr(textstat, "text_standard"):
                    try:
                        return textstat.text_standard(txt)
                    except Exception:
                        return None
                return None
            try:
                return fn(txt)
            except Exception:
                return None

        for row in reader:
            txt = (row.get("text") or "").strip()
            out_row: Dict[str, Any] = {
                "text_id": row.get("text_id"),
                **{k: row.get(k, "") for k in passthrough_cols},
            }
            for m in metrics:
                out_row[m] = _call_metric(m, txt)
            writer.writerow(out_row)

    return out_features_csv

main ¶

main()

Command-line entry point for readability metrics.

Examples:

On a prebuilt analysis-ready CSV:

$ python -m taters.text.analyze_readability --analysis-csv transcripts.csv

Gather from a transcript CSV and group by speaker before scoring:

$ python -m taters.text.analyze_readability \ --csv transcripts/session.csv \ --text-col text --id-col source --id-col speaker \ --group-by source --group-by speaker --mode concat

Source code in src\taters\text\analyze_readability.py

def main():
    r"""
    Command-line entry point for readability metrics.

    Examples
    --------
    On a prebuilt analysis-ready CSV:

    $ python -m taters.text.analyze_readability --analysis-csv transcripts.csv

    Gather from a transcript CSV and group by speaker before scoring:

    $ python -m taters.text.analyze_readability \
        --csv transcripts/session.csv \
        --text-col text --id-col source --id-col speaker \
        --group-by source --group-by speaker --mode concat
    """
    args = _build_arg_parser().parse_args()

    # Defaults for list-ish args
    text_cols = args.text_cols if args.text_cols else ["text"]
    id_cols = args.id_cols if args.id_cols else None
    group_by = args.group_by if args.group_by else None

    out = analyze_readability(
        csv_path=args.csv_path,
        txt_dir=args.txt_dir,
        analysis_csv=args.analysis_csv,
        out_features_csv=args.out_features_csv,
        overwrite_existing=args.overwrite_existing,
        encoding=args.encoding,
        text_cols=text_cols,
        id_cols=id_cols,
        mode=args.mode,
        group_by=group_by,
        delimiter=args.delimiter,
        joiner=args.joiner,
        num_buckets=args.num_buckets,
        max_open_bucket_files=args.max_open_bucket_files,
        tmp_root=args.tmp_root,
        recursive=args.recursive,
        pattern=args.pattern,
        id_from=args.id_from,
        include_source_path=args.include_source_path,
    )
    print(str(out))

taters.text.extract_sentence_embeddings ¶

analyze_with_sentence_embeddings ¶

analyze_with_sentence_embeddings(
    *,
    csv_path=None,
    txt_dir=None,
    analysis_csv=None,
    out_features_csv=None,
    overwrite_existing=False,
    encoding="utf-8-sig",
    delimiter=",",
    text_cols=("text",),
    id_cols=None,
    mode="concat",
    group_by=None,
    joiner=" ",
    num_buckets=512,
    max_open_bucket_files=64,
    tmp_root=None,
    recursive=True,
    pattern="*.txt",
    id_from="stem",
    include_source_path=True,
    model_name="sentence-transformers/all-roberta-large-v1",
    batch_size=32,
    normalize_l2=True,
    rounding=None,
    show_progress=False,
    pass_through_cols=None
)

Average sentence embeddings per row of text and write a wide features CSV.

Supports three mutually exclusive input modes:

analysis_csv — Use a prebuilt file with columns text_id and text.
csv_path — Gather from a CSV using text_cols (and optional id_cols/group_by) to build an analysis-ready CSV.
txt_dir — Gather from a folder of .txt files.

For each row, the text is split into sentences (NLTK if available; otherwise a regex fallback). Each sentence is embedded with a Sentence-Transformers model and the vectors are averaged into one row-level embedding. Optionally, vectors are L2-normalized. The output CSV schema is:

text_id[, <pass_through_cols...>], e0, e1, ..., e{D-1}

If out_features_csv is omitted, the default is ./features/sentence-embeddings/<analysis_ready_filename>. When overwrite_existing is False and the output exists, the function returns the existing path without recomputation.

Parameters:

Name	Type	Description	Default
`csv_path`	`str or Path`	Source CSV to gather from. Mutually exclusive with `txt_dir` and `analysis_csv`.	`None`
`txt_dir`	`str or Path`	Folder of `.txt` files to gather from. Mutually exclusive with the other modes.	`None`
`analysis_csv`	`str or Path`	Prebuilt analysis-ready CSV containing exactly `text_id` and `text`.	`None`
`out_features_csv`	`str or Path`	Output features CSV path. If `None`, a default path is derived from the analysis-ready filename under `./features/sentence-embeddings/`.	`None`
`overwrite_existing`	`bool`	If `False` and the output file already exists, skip processing and return it.	`False`
`pass_through_cols`	`Sequence[str]`	Column names from the analysis-ready CSV to copy into the output alongside `text_id` (e.g., `["source","speaker"]`). Any names given in `id_cols` are always included automatically, even if not listed here. Missing columns are ignored with a warning.	`None`
`encoding`	`str`	CSV I/O encoding.	`"utf-8-sig"`
`delimiter`	`str`	CSV field delimiter.	`","`
`text_cols`	`Sequence[str]`	When gathering from a CSV: column(s) containing text.	`("text",)`
`id_cols`	`Sequence[str]`	When gathering from a CSV: optional ID columns to carry through.	`None`
`mode`	`('concat', 'separate')`	Gathering behavior if multiple `text_cols` are provided. `"concat"` joins them with `joiner`; `"separate"` creates one row per column.	`"concat"`
`group_by`	`Sequence[str]`	Optional grouping keys used during CSV gathering (e.g., `["speaker"]`).	`None`
`joiner`	`str`	Separator used when concatenating text in `"concat"` mode.	`" "`
`num_buckets`	`int`	Number of temporary hash buckets for scalable gathering.	`512`
`max_open_bucket_files`	`int`	Maximum number of bucket files kept open concurrently during gathering.	`64`
`tmp_root`	`str or Path`	Root directory for temporary gathering artifacts.	`None`
`recursive`	`bool`	When gathering from a text folder, recurse into subdirectories.	`True`
`pattern`	`str`	Glob pattern for selecting text files.	`"*.txt"`
`id_from`	`('stem', 'name', 'path')`	How to derive `text_id` when gathering from a text folder.	`"stem"`
`include_source_path`	`bool`	Whether to include the absolute source path as an additional column when gathering from a text folder.	`True`
`model_name`	`str`	Sentence-Transformers model name or path.	`"sentence-transformers/all-roberta-large-v1"`
`batch_size`	`int`	Batch size for model encoding.	`32`
`normalize_l2`	`bool`	If `True`, L2-normalize each row's final vector.	`True`
`rounding`	`int or None`	If provided, round floats to this many decimals (useful for smaller files).	`None`
`show_progress`	`bool`	Show a progress bar during embedding.	`False`

Returns:

Type	Description
`Path`	Path to the written features CSV.

Raises:

Type	Description
`FileNotFoundError`	If an input file or directory does not exist.
`ImportError`	If `sentence-transformers` is not installed.
`ValueError`	If input modes are misconfigured (e.g., multiple or none provided), or if the analysis-ready CSV lacks `text_id`/`text`.

Examples:

Compute row-level embeddings from a transcript CSV, grouped by speaker:

>>> analyze_with_sentence_embeddings(
...     csv_path="transcripts/session.csv",
...     text_cols=["text"], id_cols=["speaker"], group_by=["speaker"],
...     model_name="sentence-transformers/all-roberta-large-v1",
...     normalize_l2=True
... )
PosixPath('.../features/sentence-embeddings/session.csv')

Notes

Rows with no recoverable sentences produce empty feature cells (not zeros).
The embedding dimensionality D is taken from the model and used to construct header columns e0..e{D-1}.

Source code in src\taters\text\extract_sentence_embeddings.py

def analyze_with_sentence_embeddings(
    *,
    # ----- Input source (choose exactly one, or pass analysis_csv directly) -----
    csv_path: Optional[Union[str, Path]] = None,
    txt_dir: Optional[Union[str, Path]] = None,
    analysis_csv: Optional[Union[str, Path]] = None,

    # ----- Output -----
    out_features_csv: Optional[Union[str, Path]] = None,
    overwrite_existing: bool = False,  # if the file already exists, let's not overwrite by default

    # ====== SHARED I/O OPTIONS ======
    encoding: str = "utf-8-sig",
    delimiter: str = ",",

    # ====== CSV GATHER OPTIONS (when csv_path is provided) ======
    text_cols: Sequence[str] = ("text",),
    id_cols: Optional[Sequence[str]] = None,
    mode: Literal["concat", "separate"] = "concat",
    group_by: Optional[Sequence[str]] = None,
    joiner: str = " ",
    num_buckets: int = 512,
    max_open_bucket_files: int = 64,
    tmp_root: Optional[Union[str, Path]] = None,

    # ====== TXT FOLDER GATHER OPTIONS (when txt_dir is provided) ======
    recursive: bool = True,
    pattern: str = "*.txt",
    id_from: Literal["stem", "name", "path"] = "stem",
    include_source_path: bool = True,

    # ====== SentenceTransformer options ======
    model_name: str = "sentence-transformers/all-roberta-large-v1",
    batch_size: int = 32,
    normalize_l2: bool = True,       # set True if you want unit-length vectors
    rounding: Optional[int] = None,   # None = full precision; e.g., 6 for ~float32-ish text
    show_progress: bool = False,
    pass_through_cols: Optional[Sequence[str]] = None,
) -> Path:
    """
    Average sentence embeddings per row of text and write a wide features CSV.

    Supports three mutually exclusive input modes:

    1. ``analysis_csv`` — Use a prebuilt file with columns ``text_id`` and ``text``.
    2. ``csv_path`` — Gather from a CSV using ``text_cols`` (and optional
    ``id_cols``/``group_by``) to build an analysis-ready CSV.
    3. ``txt_dir`` — Gather from a folder of ``.txt`` files.

    For each row, the text is split into sentences (NLTK if available; otherwise
    a regex fallback). Each sentence is embedded with a Sentence-Transformers
    model and the vectors are averaged into one row-level embedding. Optionally,
    vectors are L2-normalized. The output CSV schema is:

    ``text_id[, <pass_through_cols...>], e0, e1, ..., e{D-1}``

    If ``out_features_csv`` is omitted, the default is
    ``./features/sentence-embeddings/<analysis_ready_filename>``. When
    ``overwrite_existing`` is ``False`` and the output exists, the function
    returns the existing path without recomputation.

    Parameters
    ----------
    csv_path : str or pathlib.Path, optional
        Source CSV to gather from. Mutually exclusive with ``txt_dir`` and ``analysis_csv``.
    txt_dir : str or pathlib.Path, optional
        Folder of ``.txt`` files to gather from. Mutually exclusive with the other modes.
    analysis_csv : str or pathlib.Path, optional
        Prebuilt analysis-ready CSV containing exactly ``text_id`` and ``text``.
    out_features_csv : str or pathlib.Path, optional
        Output features CSV path. If ``None``, a default path is derived from the
        analysis-ready filename under ``./features/sentence-embeddings/``.
    overwrite_existing : bool, default=False
        If ``False`` and the output file already exists, skip processing and return it.
    pass_through_cols : Sequence[str], optional
        Column names from the analysis-ready CSV to copy into the output
        alongside ``text_id`` (e.g., ``["source","speaker"]``). **Any names
        given in ``id_cols`` are always included automatically**, even if not
        listed here. Missing columns are ignored with a warning.


    encoding : str, default="utf-8-sig"
        CSV I/O encoding.
    delimiter : str, default=","
        CSV field delimiter.

    text_cols : Sequence[str], default=("text",)
        When gathering from a CSV: column(s) containing text.
    id_cols : Sequence[str], optional
        When gathering from a CSV: optional ID columns to carry through.
    mode : {"concat", "separate"}, default="concat"
        Gathering behavior if multiple ``text_cols`` are provided. ``"concat"`` joins
        them with ``joiner``; ``"separate"`` creates one row per column.
    group_by : Sequence[str], optional
        Optional grouping keys used during CSV gathering (e.g., ``["speaker"]``).
    joiner : str, default=" "
        Separator used when concatenating text in ``"concat"`` mode.
    num_buckets : int, default=512
        Number of temporary hash buckets for scalable gathering.
    max_open_bucket_files : int, default=64
        Maximum number of bucket files kept open concurrently during gathering.
    tmp_root : str or pathlib.Path, optional
        Root directory for temporary gathering artifacts.

    recursive : bool, default=True
        When gathering from a text folder, recurse into subdirectories.
    pattern : str, default="*.txt"
        Glob pattern for selecting text files.
    id_from : {"stem", "name", "path"}, default="stem"
        How to derive ``text_id`` when gathering from a text folder.
    include_source_path : bool, default=True
        Whether to include the absolute source path as an additional column when
        gathering from a text folder.

    model_name : str, default="sentence-transformers/all-roberta-large-v1"
        Sentence-Transformers model name or path.
    batch_size : int, default=32
        Batch size for model encoding.
    normalize_l2 : bool, default=True
        If ``True``, L2-normalize each row's final vector.
    rounding : int or None, default=None
        If provided, round floats to this many decimals (useful for smaller files).
    show_progress : bool, default=False
        Show a progress bar during embedding.

    Returns
    -------
    pathlib.Path
        Path to the written features CSV.

    Raises
    ------
    FileNotFoundError
        If an input file or directory does not exist.
    ImportError
        If ``sentence-transformers`` is not installed.
    ValueError
        If input modes are misconfigured (e.g., multiple or none provided),
        or if the analysis-ready CSV lacks ``text_id``/``text``.

    Examples
    --------
    Compute row-level embeddings from a transcript CSV, grouped by speaker:

    >>> analyze_with_sentence_embeddings(
    ...     csv_path="transcripts/session.csv",
    ...     text_cols=["text"], id_cols=["speaker"], group_by=["speaker"],
    ...     model_name="sentence-transformers/all-roberta-large-v1",
    ...     normalize_l2=True
    ... )
    PosixPath('.../features/sentence-embeddings/session.csv')

    Notes
    -----
    - Rows with no recoverable sentences produce **empty** feature cells (not zeros).
    - The embedding dimensionality ``D`` is taken from the model and used to
    construct header columns ``e0..e{D-1}``.
    """

    # pre-check that nltk sent_tokenizer is usable
    use_nltk = _ensure_nltk_punkt(verbose=True)

    # 1) analysis-ready CSV
    if analysis_csv is not None:
        analysis_ready = Path(analysis_csv)
        if not analysis_ready.exists():
            raise FileNotFoundError(f"analysis_csv not found: {analysis_ready}")
    else:
        if (csv_path is None) == (txt_dir is None):
            raise ValueError("Provide exactly one of csv_path or txt_dir (or pass analysis_csv).")
        if csv_path is not None:
            analysis_ready = Path(
                csv_to_analysis_ready_csv(
                    csv_path=csv_path,
                    text_cols=list(text_cols),
                    id_cols=list(id_cols) if id_cols else None,
                    mode=mode,
                    group_by=list(group_by) if group_by else None,
                    delimiter=delimiter,
                    encoding=encoding,
                    joiner=joiner,
                    num_buckets=num_buckets,
                    max_open_bucket_files=max_open_bucket_files,
                    tmp_root=tmp_root,
                )
            )
        else:
            analysis_ready = Path(
                txt_folder_to_analysis_ready_csv(
                    root_dir=txt_dir,
                    recursive=recursive,
                    pattern=pattern,
                    encoding=encoding,
                    id_from=id_from,
                    include_source_path=include_source_path,
                )
            )

    # 1b) default output path
    if out_features_csv is None:
        out_features_csv = Path.cwd() / "features" / "sentence-embeddings" / analysis_ready.name
    out_features_csv = Path(out_features_csv)
    out_features_csv.parent.mkdir(parents=True, exist_ok=True)

    if not overwrite_existing and Path(out_features_csv).is_file():
        print("Sentence embedding feature output file already exists; returning existing file.")
        return out_features_csv

    # 2) load model
    if SentenceTransformer is None:
        raise ImportError(
            "sentence-transformers is required. Install with `pip install sentence-transformers`."
        )
    print(f"Loading sentence-transformer model: {model_name}")
    model = SentenceTransformer(model_name)
    dim = int(getattr(model, "get_sentence_embedding_dimension", lambda: 768)())

    # 3) header
    def _merge_cols(preferred: Optional[Sequence[str]], ensure: Optional[Sequence[str]]) -> list[str]:
        """
        Merge two sequences while preserving order and removing duplicates.
        'preferred' order is kept; any 'ensure' items not present are appended.
        """
        out: list[str] = []
        seen: set[str] = set()
        for seq in (preferred or []), (ensure or []):
            for c in seq:
                if c not in seen:
                    out.append(c)
                    seen.add(c)
        return out

    # Always include id_cols in pass-through set (user doesn't need to repeat them)
    pt_cols: list[str] = _merge_cols(pass_through_cols, id_cols)
    header = ["text_id"] + pt_cols + [f"e{i}" for i in range(dim)]


    # 4) stream rows → split → encode → average → (optional) L2 normalize → write
    def _norm(v: np.ndarray) -> np.ndarray:
        if not normalize_l2:
            return v
        n = float(np.linalg.norm(v))
        return v if n < 1e-12 else (v / n)

    print("Extracting embeddings...")
    with out_features_csv.open("w", newline="", encoding=encoding) as f:
        writer = csv.writer(f)
        writer.writerow(header)

        # Open the analysis-ready CSV as dicts so we can read extra cols
        with analysis_ready.open("r", newline="", encoding=encoding) as rf:
            reader = csv.DictReader(rf, delimiter=delimiter)
            # Light validation: warn if any requested pass-through column is missing
                        # Light validation: warn if any requested pass-through column is missing
            missing = [c for c in pt_cols if c not in (reader.fieldnames or [])]
            if missing:
                print(f"[sentence-embeddings] WARNING: pass-through columns missing in source: {missing}")

            for row in reader:
                text_id = str(row.get("text_id", ""))
                text = (row.get("text") or "")
                pt_vals = [row.get(c, "") for c in pt_cols]

                sents = _split_sentences(text)
                if not sents:
                    vec = None
                else:
                    emb = model.encode(
                        sents,
                        batch_size=batch_size,
                        convert_to_numpy=True,
                        normalize_embeddings=False,
                        show_progress_bar=show_progress,
                    )
                    vec = emb.mean(axis=0).astype(np.float32, copy=False)

                if vec is None:
                    values = [""] * dim
                else:
                    if normalize_l2:
                        n = float(np.linalg.norm(vec))
                        if n > 1e-12:
                            vec = vec / n
                    values = [float(x) for x in vec.tolist()]
                    if rounding is not None:
                        values = [round(v, int(rounding)) for v in values]

                writer.writerow([text_id] + pt_vals + values)


    return out_features_csv

main ¶

main()

Command-line entry point for row-level sentence embeddings.

Parses CLI arguments via :func:_build_arg_parser, normalizes list-like defaults (e.g., --text-col, --id-col, --group-by), invokes :func:analyze_with_sentence_embeddings, and prints the resulting path.

Examples:

$ python -m taters.text.extract_sentence_embeddings \ --csv transcripts/session.csv \ --text-col text --id-col speaker --group-by speaker \ --model-name sentence-transformers/all-roberta-large-v1 \ --normalize-l2

Source code in src\taters\text\extract_sentence_embeddings.py

def main():
    """
    Command-line entry point for row-level sentence embeddings.

    Parses CLI arguments via :func:`_build_arg_parser`, normalizes list-like
    defaults (e.g., ``--text-col``, ``--id-col``, ``--group-by``), invokes
    :func:`analyze_with_sentence_embeddings`, and prints the resulting path.

    Examples
    --------
    $ python -m taters.text.extract_sentence_embeddings \\
        --csv transcripts/session.csv \\
        --text-col text --id-col speaker --group-by speaker \\
        --model-name sentence-transformers/all-roberta-large-v1 \\
        --normalize-l2
    """
    args = _build_arg_parser().parse_args()

    # Defaults for list-ish args
    text_cols = args.text_cols if args.text_cols else ["text"]
    id_cols = args.id_cols if args.id_cols else None
    group_by = args.group_by if args.group_by else None

    out = analyze_with_sentence_embeddings(
        csv_path=args.csv_path,
        txt_dir=args.txt_dir,
        analysis_csv=args.analysis_csv,
        out_features_csv=args.out_features_csv,
        overwrite_existing=args.overwrite_existing,
        encoding=args.encoding,
        delimiter=args.delimiter,
        text_cols=text_cols,
        id_cols=id_cols,
        mode=args.mode,
        group_by=group_by,
        joiner=args.joiner,
        num_buckets=args.num_buckets,
        max_open_bucket_files=args.max_open_bucket_files,
        tmp_root=args.tmp_root,
        recursive=args.recursive,
        pattern=args.pattern,
        id_from=args.id_from,
        include_source_path=args.include_source_path,
        model_name=args.model_name,
        batch_size=args.batch_size,
        normalize_l2=args.normalize_l2,
        rounding=args.rounding,
        show_progress=args.show_progress,
        pass_through_cols=(args.pass_through_cols or None),
    )
    print(str(out))

taters.text.subtitle_parser ¶

SubtitleSegment `dataclass` ¶

SubtitleSegment(number, start_ms, end_ms, text, name=None)

Normalized subtitle cue spanning a time interval.

Parameters:

Name	Type	Description	Default
`number`	`int or None`	SRT block index if present; `None` for VTT or SRTs without explicit numbering.	required
`start_ms`	`int`	Start time in milliseconds.	required
`end_ms`	`int`	End time in milliseconds.	required
`text`	`str`	Cue text content. May contain embedded newlines if the source had multiple lines.	required
`name`	`str or None`	Optional speaker/name field (not populated by the built-in parsers).	`None`

Notes

Instances are immutable (frozen=True) so they can be safely shared and hashed.

convert_subtitles ¶

convert_subtitles(
    *,
    input,
    to,
    output=None,
    encoding=None,
    include_name=False
)

Convert an SRT/VTT file to CSV/SRT/VTT.

Reads a subtitle file, parses into normalized segments, and renders to the requested format. When output is omitted, a default path is created at ./features/subtitles/<input_stem>.<ext>.

Parameters:

Name	Type	Description	Default
`input`	`str or Path`	Path to the input `.srt` or `.vtt` file.	required
`to`	`('csv', 'srt', 'vtt')`	Desired output format.	`'csv'`
`output`	`str or Path`	Explicit output path. If `None`, use the default location.	`None`
`encoding`	`str`	Input encoding override; otherwise auto-detected (or UTF-8).	`None`
`include_name`	`bool`	When `to='csv'`, include a `name` column if available.	`False`

Returns:

Type	Description
`Path`	Path to the written output file.

Raises:

Type	Description
`FileNotFoundError`	If the input file does not exist.
`ValueError`	If the output format is unsupported or input content is malformed.

Source code in src\taters\text\subtitle_parser.py

def convert_subtitles(
    *,
    input: Union[str, Path],
    to: Literal["csv", "srt", "vtt"],
    output: Optional[Union[str, Path]] = None,
    encoding: Optional[str] = None,
    include_name: bool = False,
) -> Path:
    """
    Convert an SRT/VTT file to CSV/SRT/VTT.

    Reads a subtitle file, parses into normalized segments, and renders to the
    requested format. When ``output`` is omitted, a default path is created at
    ``./features/subtitles/<input_stem>.<ext>``.

    Parameters
    ----------
    input : str or pathlib.Path
        Path to the input ``.srt`` or ``.vtt`` file.
    to : {'csv', 'srt', 'vtt'}
        Desired output format.
    output : str or pathlib.Path, optional
        Explicit output path. If ``None``, use the default location.
    encoding : str, optional
        Input encoding override; otherwise auto-detected (or UTF-8).
    include_name : bool, default=False
        When ``to='csv'``, include a ``name`` column if available.

    Returns
    -------
    pathlib.Path
        Path to the written output file.

    Raises
    ------
    FileNotFoundError
        If the input file does not exist.
    ValueError
        If the output format is unsupported or input content is malformed.
    """

    in_path = Path(input).resolve()
    segs = parse_subtitles(in_path, encoding=encoding)

    # Default output location if not provided
    if output is not None:
        out_path = Path(output)
    else:
        out_dir = _default_out_dir()
        ext = {"csv": ".csv", "srt": ".srt", "vtt": ".vtt"}[to]
        out_path = out_dir / f"{in_path.stem}{ext}"

    # Render
    if to == "csv":
        return render_to_csv(segs, out_path, include_name=include_name)
    elif to == "srt":
        return render_to_srt(segs, out_path)
    else:  # "vtt"
        return render_to_vtt(segs, out_path)

main ¶

main()

Command-line entry point for subtitle parsing and conversion.

Parses arguments via :func:_build_arg_parser, calls :func:convert_subtitles, and prints the resulting output path.

Examples:

$ python -m taters.text.subtitle_parser --input transcript.srt --to csv --output features/subtitles/transcript.csv

Source code in src\taters\text\subtitle_parser.py

def main():
    """
    Command-line entry point for subtitle parsing and conversion.

    Parses arguments via :func:`_build_arg_parser`, calls
    :func:`convert_subtitles`, and prints the resulting output path.

    Examples
    --------
    $ python -m taters.text.subtitle_parser \
        --input transcript.srt --to csv \
        --output features/subtitles/transcript.csv
    """

    args = _build_arg_parser().parse_args()
    out = convert_subtitles(
        input=args.input,
        to=args.to,
        output=args.output,
        encoding=args.encoding,
        include_name=args.include_name,
    )
    print(str(out))

parse_srt ¶

parse_srt(text)

Parse SRT content into normalized subtitle segments.

The parser tolerates extra whitespace and the optional numeric index line. Each cue must include a timestamp line of the form HH:MM:SS,mmm --> HH:MM:SS,mmm (a dot separator for milliseconds is also accepted for robustness).

Parameters:

Name	Type	Description	Default
`text`	`str`	Entire SRT file content.	required

Returns:

Type	Description
`list[SubtitleSegment]`	Parsed cues with millisecond times and original (joined) text.

Raises:

Type	Description
`ValueError`	If a well-formed timestamp line is missing where expected.

Source code in src\taters\text\subtitle_parser.py

def parse_srt(text: str) -> List[SubtitleSegment]:
    """
    Parse SRT content into normalized subtitle segments.

    The parser tolerates extra whitespace and the optional numeric index line.
    Each cue must include a timestamp line of the form
    ``HH:MM:SS,mmm --> HH:MM:SS,mmm`` (a dot separator for milliseconds is also
    accepted for robustness).

    Parameters
    ----------
    text : str
        Entire SRT file content.

    Returns
    -------
    list[SubtitleSegment]
        Parsed cues with millisecond times and original (joined) text.

    Raises
    ------
    ValueError
        If a well-formed timestamp line is missing where expected.
    """

    lines = [ln.rstrip("\r") for ln in text.splitlines()]
    i = 0
    n = len(lines)
    out: List[SubtitleSegment] = []

    while i < n:
        # Skip blank lines
        while i < n and lines[i].strip() == "":
            i += 1
        if i >= n:
            break

        # Optional numeric index
        number = None
        maybe_num = lines[i].strip()
        ts_line_idx = i
        if maybe_num.isdigit():
            number = int(maybe_num)
            i += 1
            ts_line_idx = i

        if i >= n:
            break

        # Timestamp line
        m = _SRT_TS_LINE.match(lines[ts_line_idx].strip())
        if not m:
            # Some SRTs omit numeric indices—allow timestamps immediately
            m = _SRT_TS_LINE.match(lines[i].strip())
            if not m:
                raise ValueError(f"SRT parse error: expected timestamp near line {ts_line_idx+1}")
            ts_line_idx = i
        i = ts_line_idx + 1

        start_ms = _parse_timestamp(m.group("start"))
        end_ms = _parse_timestamp(m.group("end"))

        # Content lines until blank
        content: List[str] = []
        while i < n and lines[i].strip() != "":
            content.append(lines[i])
            i += 1

        if not content:
            # SRT often allows empty entries, but we'll keep it consistent:
            # accept empty text as empty string.
            content = [""]

        text_block = "\n".join(content)
        out.append(SubtitleSegment(number=number, start_ms=start_ms, end_ms=end_ms, text=text_block, name=None))

        # Skip the trailing blank between blocks
        while i < n and lines[i].strip() == "":
            i += 1

    return out

parse_subtitles ¶

parse_subtitles(input_path, *, encoding=None)

Auto-detect and parse a subtitle file by extension.

.vtt files are parsed as WebVTT; .srt and unknown extensions are parsed as SRT. Input encoding is detected with chardet when available, otherwise UTF-8 is assumed. Decoding errors are replaced.

Parameters:

Name	Type	Description	Default
`input_path`	`str or Path`	Path to an SRT or VTT file.	required
`encoding`	`str`	Override input encoding. If omitted, try detect then fall back to UTF-8.	`None`

Returns:

Type	Description
`list[SubtitleSegment]`	Normalized subtitle segments.

Raises:

Type	Description
`FileNotFoundError`	If the path does not exist.

Source code in src\taters\text\subtitle_parser.py

def parse_subtitles(input_path: Union[str, Path], *, encoding: Optional[str] = None) -> List[SubtitleSegment]:
    """
    Auto-detect and parse a subtitle file by extension.

    ``.vtt`` files are parsed as WebVTT; ``.srt`` and unknown extensions are
    parsed as SRT. Input encoding is detected with ``chardet`` when available,
    otherwise UTF-8 is assumed. Decoding errors are replaced.

    Parameters
    ----------
    input_path : str or pathlib.Path
        Path to an SRT or VTT file.
    encoding : str, optional
        Override input encoding. If omitted, try detect then fall back to UTF-8.

    Returns
    -------
    list[SubtitleSegment]
        Normalized subtitle segments.

    Raises
    ------
    FileNotFoundError
        If the path does not exist.
    """

    path = Path(input_path)
    if not path.exists():
        raise FileNotFoundError(f"Subtitle file not found: {path}")

    enc = encoding or _detect_encoding(path)
    raw = path.read_text(encoding=enc, errors="replace")

    ext = path.suffix.lower()
    if ext == ".vtt":
        return parse_vtt(raw)
    else:
        # Default to SRT for .srt or any other unknown extension (common in the wild)
        return parse_srt(raw)

parse_vtt ¶

parse_vtt(text)

Parse WebVTT content into normalized subtitle segments.

Behavior: - Skips the WEBVTT header and any header metadata. - Skips NOTE and STYLE blocks. - Ignores optional cue identifiers. - Requires a timestamp line of the form HH:MM:SS.mmm --> HH:MM:SS.mmm (comma also accepted).

Parameters:

Name	Type	Description	Default
`text`	`str`	Entire VTT file content.	required

Returns:

Type	Description
`list[SubtitleSegment]`	Parsed cues with millisecond times and original (joined) text.

Raises:

Type	Description
`ValueError`	If a required timestamp line is malformed or missing.

Source code in src\taters\text\subtitle_parser.py

def parse_vtt(text: str) -> List[SubtitleSegment]:
    """
    Parse WebVTT content into normalized subtitle segments.

    Behavior:
    - Skips the ``WEBVTT`` header and any header metadata.
    - Skips ``NOTE`` and ``STYLE`` blocks.
    - Ignores optional cue identifiers.
    - Requires a timestamp line of the form
    ``HH:MM:SS.mmm --> HH:MM:SS.mmm`` (comma also accepted).

    Parameters
    ----------
    text : str
        Entire VTT file content.

    Returns
    -------
    list[SubtitleSegment]
        Parsed cues with millisecond times and original (joined) text.

    Raises
    ------
    ValueError
        If a required timestamp line is malformed or missing.
    """

    lines = [ln.rstrip("\r") for ln in text.splitlines()]
    i = 0
    n = len(lines)

    # Header
    if i < n and lines[i].strip().upper().startswith("WEBVTT"):
        i += 1
        # Skip header meta until blank line
        while i < n and lines[i].strip() != "":
            i += 1
        while i < n and lines[i].strip() == "":
            i += 1

    out: List[SubtitleSegment] = []

    while i < n:
        # Skip NOTE/STYLE blocks
        if lines[i].strip().startswith("NOTE") or lines[i].strip().upper() == "STYLE":
            # Skip until blank line
            i += 1
            while i < n and lines[i].strip() != "":
                i += 1
            while i < n and lines[i].strip() == "":
                i += 1
            continue

        # Optional cue identifier line (not used here)
        # If next non-empty line contains '-->' treat as timestamp; else it's an ID.
        # Lookahead 2 lines max
        if i < n and "-->" not in lines[i]:
            # Might be identifier; check next line
            if i + 1 < n and "-->" in lines[i + 1]:
                i += 1  # consume ID; ignore value
            # else fall through; if invalid, timestamp line will fail below

        if i >= n:
            break

        # Timestamp line
        line = lines[i].strip()
        if "-->" not in line:
            raise ValueError(f"VTT parse error: expected timestamp at line {i+1}")
        parts = [p.strip() for p in line.split("-->")]
        if len(parts) < 2:
            raise ValueError(f"VTT parse error: invalid timestamp at line {i+1}")

        start_ms = _parse_timestamp(parts[0])
        end_ms = _parse_timestamp(parts[1].split(" ")[0])  # drop cue settings if present
        i += 1

        # Content until blank
        content: List[str] = []
        while i < n and lines[i].strip() != "":
            content.append(lines[i])
            i += 1

        if not content:
            content = [""]

        text_block = "\n".join(content)
        out.append(SubtitleSegment(number=None, start_ms=start_ms, end_ms=end_ms, text=text_block, name=None))

        while i < n and lines[i].strip() == "":
            i += 1

    return out

render_to_csv ¶

render_to_csv(segs, out_path, *, include_name=False)

Write segments to a CSV file.

The CSV schema is:

start_time,end_time[,name],text

Times are written as integer milliseconds (stringified) to preserve exact alignment for downstream tools.

Parameters:

Name	Type	Description	Default
`segs`	`Iterable[SubtitleSegment]`	Segments to write.	required
`out_path`	`str or Path`	Output CSV path.	required
`include_name`	`bool`	Include a `name` column (useful if upstream added speaker names).	`False`

Returns:

Type	Description
`Path`	Path to the written CSV file.

Source code in src\taters\text\subtitle_parser.py

def render_to_csv(segs: Iterable[SubtitleSegment], out_path: Union[str, Path], *, include_name: bool = False) -> Path:
    """
    Write segments to a CSV file.

    The CSV schema is:

    ``start_time,end_time[,name],text``

    Times are written as integer milliseconds (stringified) to preserve exact
    alignment for downstream tools.

    Parameters
    ----------
    segs : Iterable[SubtitleSegment]
        Segments to write.
    out_path : str or pathlib.Path
        Output CSV path.
    include_name : bool, default=False
        Include a ``name`` column (useful if upstream added speaker names).

    Returns
    -------
    pathlib.Path
        Path to the written CSV file.
    """

    out_path = Path(out_path)
    out_path.parent.mkdir(parents=True, exist_ok=True)
    with out_path.open("w", encoding="utf-8", newline="") as f:
        w = csv.writer(f)
        header = ["start_time", "end_time"]
        if include_name:
            header.append("name")
        header.append("text")
        w.writerow(header)
        for s in segs:
            row: List[str] = [f"{s.start_ms}", f"{s.end_ms}"]
            if include_name:
                row.append(s.name or "")
            row.append(s.text)
            w.writerow(row)
    return out_path

render_to_srt ¶

render_to_srt(segs, out_path)

Write segments to SRT format.

Blocks are 1-indexed and use HH:MM:SS,mmm timestamps.

Parameters:

Name	Type	Description	Default
`segs`	`Iterable[SubtitleSegment]`	Segments to write.	required
`out_path`	`str or Path`	Output `.srt` path.	required

Returns:

Type	Description
`Path`	Path to the written SRT file.

Source code in src\taters\text\subtitle_parser.py

def render_to_srt(segs: Iterable[SubtitleSegment], out_path: Union[str, Path]) -> Path:
    """
    Write segments to SRT format.

    Blocks are 1-indexed and use ``HH:MM:SS,mmm`` timestamps.

    Parameters
    ----------
    segs : Iterable[SubtitleSegment]
        Segments to write.
    out_path : str or pathlib.Path
        Output ``.srt`` path.

    Returns
    -------
    pathlib.Path
        Path to the written SRT file.
    """

    out_path = Path(out_path)
    out_path.parent.mkdir(parents=True, exist_ok=True)
    with out_path.open("w", encoding="utf-8", newline="") as f:
        for i, s in enumerate(segs, start=1):
            f.write(f"{i}\n")
            f.write(f"{_fmt_ms_srt(s.start_ms)} --> {_fmt_ms_srt(s.end_ms)}\n")
            f.write(f"{s.text}\n\n")
    return out_path

render_to_vtt ¶

render_to_vtt(segs, out_path)

Write segments to WebVTT format.

Includes a standard WEBVTT header and uses HH:MM:SS.mmm timestamps.

Parameters:

Name	Type	Description	Default
`segs`	`Iterable[SubtitleSegment]`	Segments to write.	required
`out_path`	`str or Path`	Output `.vtt` path.	required

Returns:

Type	Description
`Path`	Path to the written VTT file.

Source code in src\taters\text\subtitle_parser.py

def render_to_vtt(segs: Iterable[SubtitleSegment], out_path: Union[str, Path]) -> Path:
    """
    Write segments to WebVTT format.

    Includes a standard ``WEBVTT`` header and uses ``HH:MM:SS.mmm`` timestamps.

    Parameters
    ----------
    segs : Iterable[SubtitleSegment]
        Segments to write.
    out_path : str or pathlib.Path
        Output ``.vtt`` path.

    Returns
    -------
    pathlib.Path
        Path to the written VTT file.
    """

    out_path = Path(out_path)
    out_path.parent.mkdir(parents=True, exist_ok=True)
    with out_path.open("w", encoding="utf-8", newline="") as f:
        f.write("WEBVTT\n\n")
        for s in segs:
            f.write(f"{_fmt_ms_vtt(s.start_ms)} --> {_fmt_ms_vtt(s.end_ms)}\n")
            f.write(f"{s.text}\n\n")
    return out_path

Text Modules¶

taters.text.analyze_with_archetypes ¶

analyze_with_archetypes ¶

main ¶

taters.text.analyze_with_dictionaries ¶

analyze_with_dictionaries ¶

main ¶

taters.text.analyze_lexical_richness ¶

analyze_lexical_richness ¶

hdd ¶

main ¶

Analysis-ready CSV¶

Gather from a CSV and group by source/speaker first (utterances -> per speaker)¶

vocd ¶

taters.text.analyze_readability ¶

analyze_readability ¶

main ¶

taters.text.extract_sentence_embeddings ¶

analyze_with_sentence_embeddings ¶

main ¶

taters.text.subtitle_parser ¶

SubtitleSegment dataclass ¶

convert_subtitles ¶

main ¶

parse_srt ¶

parse_subtitles ¶

parse_vtt ¶

render_to_csv ¶

render_to_srt ¶

render_to_vtt ¶

SubtitleSegment `dataclass` ¶