🥔 TATERS¶

Takes All Things, Extracts Relevant Stuff

Taters

Available on both GitHub and PyPI!

Status: very early and evolving. It already works for many common workflows, but expect some rough edges and the occasional renaming as things mature. (If you like stability, pin versions.)

Taters is a Python toolkit (and CLI) for getting from raw media to analysis-ready artifacts — quickly, repeatably, and without a weekend's worth of glue code. Point it at video, audio, or text, and it will turn your raw data into analyzable features.

So... what does that actually mean? Let's say that you have a folder full of text data, and you want to extract readability scores for each row? You can do that with a single command. What if you have a CSV file full of social media posts, and you want to aggregate all text for each user, then calculate dictionary-based scores for each user? Surprise! You can do that with a single command.

But, wait! What if you have a folder full of video files, and you want to 1) extract the audio from those files, 2) transcribe and diarize the audio, 3) extract transformer-based embeddings for each text, 4) aggregate text by speaker and calculate dictionary-based scores, 5) compute Whisper embeddings for each utterance... and so on? Ohhh buddy, you guessed it — a single command.

Really, Taters is a big (and growing) toolbox that contains single-purpose tools to make feature extraction easy for social scientists. What's more, it comes with pre-made "pipelines" that bolt together lots of different steps, allowing you run to many, many tools all at once in a fixed and predictable order. You can mix-and-match different types of feature extraction, integrate them into a Python script or run them directly from the CLI... you can even stitch them together into "pipelines" so that you can do all the things with a single command, in any order that you want. And yes, you can even build your own custom "pipeline" files to do any job that you need.

Think of Taters as a small, dependable kitchen crew for your data: you bring the potatoes (files), and it handles the peeling, chopping, and plating. You can tell them to prepare a single dish, or you can give them a list of instructions telling them how to prepare an entire banquet.

I'm starting to get hungry. Let's move on...

What problems Taters tries to solve¶

Lower the "first mile" cost of multimodal analysis (A/V + text) by shipping batteries-included tools that work well together.
Standardize I/O so outputs land in predictable places (e.g., ./features/<kind>/<file>.csv) and can be piped into downstream tools without spelunking through folders.
Keep the knobs. Every step is a clear, reusable function (and CLI entry point), not a black box. Use them à la carte or chain them together with reusable, customizable YAML presets.
Make batch runs sane: a pipeline runner coordinates per-item steps (e.g., per input file) and/or global steps (e.g., process a single file, aggregate data in a single step, etc.).

Each tool has a consistent "don't overwrite unless asked" behavior and sensible defaults. If you don't pass an output path, Taters will.

How you'll use it¶

Python Example: Processing a Video File with Individual Functions¶

from taters.Taters import Taters
t = Taters()

# Extract audio from video
wavs = t.audio.extract_wavs_from_video(input_path="input.mp4")

# Diarize & transcribe (CSV/SRT/TXT)
diar = t.audio.diarize_with_thirdparty(audio_path=wavs[0], device="cuda")

# Features land under ./features/<kind>/ by default
t.audio.extract_whisper_embeddings(source_wav=wavs[0], transcript_csv=diar["csv"])
t.text.analyze_with_dictionaries(csv_path=diar["csv"], dict_paths=["dicts/EPrime-Dictionary.dicx"])
t.text.analyze_with_archetypes(csv_path=diar["csv"], archetype_csvs=["archetypes/Resilience.csv"])
t.text.extract_sentence_embeddings(csv_path=diar["csv"], text_cols=["text"], id_cols=["speaker"], group_by=["speaker"])

CLI Example¶

# Diarize a single file
python -m taters.audio.diarize_with_thirdparty --audio_path audio/session.wav --device cuda

# Gather text from CSV (auto-names the output if --out omitted)
python -m taters.helpers.text_gather \
  --csv transcripts/session.csv \
  --text-col text \
  --group-by speaker \
  --delimiter ,

Every function ships with a helpful --help page; you can compose these pieces into YAML pipelines to batch entire studies.

Pipelines (do it all at once)¶

Pick a preset, point at a dataset, and Taters runs the steps in order, feeding each step's output into the next. Override variables (e.g., device, models, overwrite behavior) on the command line. It's early days for presets, but they're already useful — and you can write your own when you need something custom.

Who it's for¶

Researchers and engineers who:

wrangle interviews, conversations, text data, or any combination of those data types (and others)
need consistent, analysis-ready features extracted from their raw data
value reproducible pipelines but are still complete control freaks who enjoy micromanaging everything

What it isn't¶

It isn't edible. But, also, it isn't a single monolithic "do everything in one-click" application. Instead, Taters gives you small, composable building blocks with predictable I/O — and a pipeline runner to tie them together when you need it.