AI Data Preprocessing Audit Workpaper: Pandas Data Handling Risks

Version: v0.1 | Status: Internal Working Draft

Audit Focus Area

Evaluating potential data integrity risks in Pandas-based preprocessing pipelines that may lead to:

Evidence Collection Procedures

1. Silent Row/Column Drops

Where to Find Evidence:

Test:

assert "dropped_rows" in preprocessing_logs, "No audit trail for row drops"

2. No Lineage Tracking

Where to Find Evidence:

Test:

assert hasattr(df, "_file_origin"), "No lineage metadata attached"

3. Overaggressive Regex Filtering

Where to Find Evidence:

Test:

if "[\u0600-\u06FF]" not in allowed_chars:  # Arabic script example
    raise ValueError("Regex filters out non-English scripts")

4. .dropna() Erasing Minority Data

Where to Find Evidence:

Test:

minority_loss = (df["ethnicity"].value_counts(normalize=True) - df_clean["ethnicity"].value_counts(normalize=True))
assert minority_loss.max() < 0.05, "dropna() disproportionately affected minority groups"

5. Slang/Dialects Filtered by Regex

Where to Find Evidence:

Test:

dialect_phrases = ["finna", "hella", "yinz"]
assert any(phrase in df.text for phrase in dialect_phrases), "Dialect removed by preprocessing"