AI Data Preprocessing Audit Workpaper: Scikit-Learn Risks
Audit Focus Area
Evaluating transparency and statistical assumption risks in Scikit-Learn preprocessing pipelines:
- Black Box Pipelines: Reproducible but opaque transformations
- Normality Assumptions: Implicit Gaussian distribution requirements
- Silent Behavior: Default behaviors that may distort data
Evidence Collection Methodology
1. Pipeline Transparency Risks
Where to Find Evidence:
Risk Indicator | Investigation Method | Documentation Reference |
Composite transformers masking steps |
Inspect Pipeline/ColumnTransformer objects for nested transforms |
sklearn.compose.make_column_transformer |
Untracked hyperparameters |
Check get_params() vs actual behavior |
OneHotEncoder(drop='first') default changes |
Non-invertible transformations |
Identify PowerTransformer, QuantileTransformer |
inverse_transform availability |
Test Code:
for i, (name, step) in enumerate(pipeline.steps):
if not hasattr(step, 'get_feature_names_out'):
print(f"Step {i} ({name}) lacks feature naming")
2. Normality Assumptions
Where to Find Evidence:
Assumption | Problematic Transformers | Data Check |
Gaussian features | StandardScaler, PCA | Check feature kurtosis |
Linear relationships | LinearRegression preprocessors | Rank correlation tests |
Symmetric outliers | RobustScaler | Quantile analysis |
Statistical Tests:
from scipy import stats
for col in X_train.columns:
_, p = stats.normaltest(X_train[col])
if p < 0.05 and col in scaler.feature_names_in_:
print(f"Non-Gaussian feature {col} being scaled")
Workpaper Template
Pipeline Transparency Findings
Pipeline Component | Transparency Issue | Evidence Location | Severity |
ColumnTransformer | Nested Pipeline obscures 3 processing steps | preprocessing.py L45-62 | High |
SimpleImputer | No record of missing value counts | Fit log missing | Medium |
PowerTransformer | No inverse_transform available | Model deployment code | Critical |
Normality Assumption Findings
Feature | Preprocessor | Normality Test (p-value) | Impact |
income | StandardScaler | 2.1e-16 (non-normal) | High |
age | QuantileTransformer | 0.83 (normal) | Low |
purchase_freq | RobustScaler | 3.4e-9 (power law) | Medium |
Key Risks
- Critical: PowerTransformer on financial data with no inverse for explainability
- High: Nested pipelines hide 4 sequential transformations of sensitive features
- Medium: 68% of scaled features violate normality assumptions (Shapiro-Wilk p<0.01)
Recommendations
Transparency Fixes
from sklearn import set_config
set_config(display='diagram') # Enable visual tracing
class LoggingTransformer(BaseEstimator, TransformerMixin):
def transform(self, X):
print(f"Shape: {X.shape}, NaNs: {np.isnan(X).sum()}")
return X
Normality Solutions
from sklearn.preprocessing import RobustScaler, QuantileTransformer
pipeline.steps.insert(1, ('dist_check', DistributionChecker()))
auto_scaler = ColumnTransformer([
('robust', RobustScaler(), make_selector(dtype_include=np.number)),
('quantile', QuantileTransformer(), ['income','purchase_amount'])
])
Auditor Notes
- Pipeline visualization (pipeline.plot() output)
- Pre/post-scaling distribution plots
- get_feature_names_out() completeness report
Sign-off
Auditor | Date | ML Engineer | Compliance |
[Your Name] | [Date] | [Engineer Name] | [Approver Name] |
Standards References
- EU AI Act Article 13 (Transparency)
- NIST AI RMF (Explainability)
- IEEE 7001-2021 (Algorithmic Bias Considerations)