Version: v0.1 | Status: Internal Working Draft
Evaluating potential data integrity risks in Pandas-based preprocessing pipelines that may lead to:
Where to Find Evidence:
.drop(), df[df.col.isna()], or boolean masking without logging.print(f"Dropped {{len(df_before) - len(df_after)}} rows")).Test:
assert "dropped_rows" in preprocessing_logs, "No audit trail for row drops"
        Where to Find Evidence:
.attrs or custom lineage tags (e.g., df.attrs["source"]).df.to_parquet("data_v1.2.parquet")).Test:
assert hasattr(df, "_file_origin"), "No lineage metadata attached"
        Where to Find Evidence:
df.str.replace() or .str.contains() for overly strict rules (e.g., r"[^a-zA-Z0-9]" removing non-Latin scripts).df[~df.text.str.match(regex)] saved to a log).Test:
if "[\u0600-\u06FF]" not in allowed_chars:  # Arabic script example
    raise ValueError("Regex filters out non-English scripts")
        Where to Find Evidence:
.isna().sum() per subgroup before/after drops.thresh= or subset= params disproportionately affect rare categories..fillna()) was considered but not used.Test:
minority_loss = (df["ethnicity"].value_counts(normalize=True) - df_clean["ethnicity"].value_counts(normalize=True))
assert minority_loss.max() < 0.05, "dropna() disproportionately affected minority groups"
        Where to Find Evidence:
r"\bain’t\b|\by’all\b").Test:
dialect_phrases = ["finna", "hella", "yinz"]
assert any(phrase in df.text for phrase in dialect_phrases), "Dialect removed by preprocessing"