AI Data Preprocessing Audit Workpaper: Scikit-Learn Risks

Audit Focus Area

Evaluating transparency and statistical assumption risks in Scikit-Learn preprocessing pipelines:

Evidence Collection Methodology

1. Pipeline Transparency Risks

Where to Find Evidence:
Risk IndicatorInvestigation MethodDocumentation Reference
Composite transformers masking steps Inspect Pipeline/ColumnTransformer objects for nested transforms sklearn.compose.make_column_transformer
Untracked hyperparameters Check get_params() vs actual behavior OneHotEncoder(drop='first') default changes
Non-invertible transformations Identify PowerTransformer, QuantileTransformer inverse_transform availability
Test Code:
for i, (name, step) in enumerate(pipeline.steps):
    if not hasattr(step, 'get_feature_names_out'):
        print(f"Step {i} ({name}) lacks feature naming")

2. Normality Assumptions

Where to Find Evidence:
AssumptionProblematic TransformersData Check
Gaussian featuresStandardScaler, PCACheck feature kurtosis
Linear relationshipsLinearRegression preprocessorsRank correlation tests
Symmetric outliersRobustScalerQuantile analysis
Statistical Tests:
from scipy import stats

for col in X_train.columns:
    _, p = stats.normaltest(X_train[col])
    if p < 0.05 and col in scaler.feature_names_in_:
        print(f"Non-Gaussian feature {col} being scaled")

Workpaper Template

Pipeline Transparency Findings

Pipeline ComponentTransparency IssueEvidence LocationSeverity
ColumnTransformerNested Pipeline obscures 3 processing stepspreprocessing.py L45-62High
SimpleImputerNo record of missing value countsFit log missingMedium
PowerTransformerNo inverse_transform availableModel deployment codeCritical

Normality Assumption Findings

FeaturePreprocessorNormality Test (p-value)Impact
incomeStandardScaler2.1e-16 (non-normal)High
ageQuantileTransformer0.83 (normal)Low
purchase_freqRobustScaler3.4e-9 (power law)Medium

Key Risks

Recommendations

Transparency Fixes

from sklearn import set_config
set_config(display='diagram')  # Enable visual tracing

class LoggingTransformer(BaseEstimator, TransformerMixin):
    def transform(self, X):
        print(f"Shape: {X.shape}, NaNs: {np.isnan(X).sum()}")
        return X

Normality Solutions

from sklearn.preprocessing import RobustScaler, QuantileTransformer

pipeline.steps.insert(1, ('dist_check', DistributionChecker()))

auto_scaler = ColumnTransformer([
    ('robust', RobustScaler(), make_selector(dtype_include=np.number)),
    ('quantile', QuantileTransformer(), ['income','purchase_amount'])
])

Auditor Notes

Sign-off

AuditorDateML EngineerCompliance
[Your Name][Date][Engineer Name][Approver Name]

Standards References