AI Data Preprocessing Audit Workpaper: Scikit-Learn Risks
    Audit Focus Area
    Evaluating transparency and statistical assumption risks in Scikit-Learn preprocessing pipelines:
    
        - Black Box Pipelines: Reproducible but opaque transformations
 
        - Normality Assumptions: Implicit Gaussian distribution requirements
 
        - Silent Behavior: Default behaviors that may distort data
 
    
    Evidence Collection Methodology
    1. Pipeline Transparency Risks
    Where to Find Evidence:
    
        
            | Risk Indicator | Investigation Method | Documentation Reference | 
        
        
            
                | Composite transformers masking steps | 
                Inspect Pipeline/ColumnTransformer objects for nested transforms | 
                sklearn.compose.make_column_transformer | 
            
            
                | Untracked hyperparameters | 
                Check get_params() vs actual behavior | 
                OneHotEncoder(drop='first') default changes | 
            
            
                | Non-invertible transformations | 
                Identify PowerTransformer, QuantileTransformer | 
                inverse_transform availability | 
            
        
    
    Test Code:
    for i, (name, step) in enumerate(pipeline.steps):
    if not hasattr(step, 'get_feature_names_out'):
        print(f"Step {i} ({name}) lacks feature naming")
    2. Normality Assumptions
    Where to Find Evidence:
    
        
            | Assumption | Problematic Transformers | Data Check | 
        
        
            | Gaussian features | StandardScaler, PCA | Check feature kurtosis | 
            | Linear relationships | LinearRegression preprocessors | Rank correlation tests | 
            | Symmetric outliers | RobustScaler | Quantile analysis | 
        
    
    Statistical Tests:
    from scipy import stats
for col in X_train.columns:
    _, p = stats.normaltest(X_train[col])
    if p < 0.05 and col in scaler.feature_names_in_:
        print(f"Non-Gaussian feature {col} being scaled")
    Workpaper Template
    Pipeline Transparency Findings
    
        | Pipeline Component | Transparency Issue | Evidence Location | Severity | 
        
            | ColumnTransformer | Nested Pipeline obscures 3 processing steps | preprocessing.py L45-62 | High | 
            | SimpleImputer | No record of missing value counts | Fit log missing | Medium | 
            | PowerTransformer | No inverse_transform available | Model deployment code | Critical | 
        
    
    Normality Assumption Findings
    
        | Feature | Preprocessor | Normality Test (p-value) | Impact | 
        
            | income | StandardScaler | 2.1e-16 (non-normal) | High | 
            | age | QuantileTransformer | 0.83 (normal) | Low | 
            | purchase_freq | RobustScaler | 3.4e-9 (power law) | Medium | 
        
    
    Key Risks
    
        - Critical: PowerTransformer on financial data with no inverse for explainability
 
        - High: Nested pipelines hide 4 sequential transformations of sensitive features
 
        - Medium: 68% of scaled features violate normality assumptions (Shapiro-Wilk p<0.01)
 
    
    Recommendations
    Transparency Fixes
    from sklearn import set_config
set_config(display='diagram')  # Enable visual tracing
class LoggingTransformer(BaseEstimator, TransformerMixin):
    def transform(self, X):
        print(f"Shape: {X.shape}, NaNs: {np.isnan(X).sum()}")
        return X
    Normality Solutions
    from sklearn.preprocessing import RobustScaler, QuantileTransformer
pipeline.steps.insert(1, ('dist_check', DistributionChecker()))
auto_scaler = ColumnTransformer([
    ('robust', RobustScaler(), make_selector(dtype_include=np.number)),
    ('quantile', QuantileTransformer(), ['income','purchase_amount'])
])
    Auditor Notes
    
        - Pipeline visualization (pipeline.plot() output)
 
        - Pre/post-scaling distribution plots
 
        - get_feature_names_out() completeness report
 
    
    Sign-off
    
        | Auditor | Date | ML Engineer | Compliance | 
        | [Your Name] | [Date] | [Engineer Name] | [Approver Name] | 
    
    Standards References
    
        - EU AI Act Article 13 (Transparency)
 
        - NIST AI RMF (Explainability)
 
        - IEEE 7001-2021 (Algorithmic Bias Considerations)