Version: v0.1 | Status: Internal Working Draft
Evaluating potential data integrity risks in Pandas-based preprocessing pipelines that may lead to:
Where to Find Evidence:
.drop()
, df[df.col.isna()]
, or boolean masking without logging.print(f"Dropped {{len(df_before) - len(df_after)}} rows")
).Test:
assert "dropped_rows" in preprocessing_logs, "No audit trail for row drops"
Where to Find Evidence:
.attrs
or custom lineage tags (e.g., df.attrs["source"]
).df.to_parquet("data_v1.2.parquet")
).Test:
assert hasattr(df, "_file_origin"), "No lineage metadata attached"
Where to Find Evidence:
df.str.replace()
or .str.contains()
for overly strict rules (e.g., r"[^a-zA-Z0-9]"
removing non-Latin scripts).df[~df.text.str.match(regex)]
saved to a log).Test:
if "[\u0600-\u06FF]" not in allowed_chars: # Arabic script example
raise ValueError("Regex filters out non-English scripts")
Where to Find Evidence:
.isna().sum()
per subgroup before/after drops.thresh=
or subset=
params disproportionately affect rare categories..fillna()
) was considered but not used.Test:
minority_loss = (df["ethnicity"].value_counts(normalize=True) - df_clean["ethnicity"].value_counts(normalize=True))
assert minority_loss.max() < 0.05, "dropna() disproportionately affected minority groups"
Where to Find Evidence:
r"\bain’t\b|\by’all\b"
).Test:
dialect_phrases = ["finna", "hella", "yinz"]
assert any(phrase in df.text for phrase in dialect_phrases), "Dialect removed by preprocessing"