tf.data / PyTorch DataLoader Preprocessing Audit Workpaper
Audit Focus Area
Evaluating risks in deep learning data pipeline implementations:
    - Entangled Preprocessing: Transformations baked into training loops
 
    - Bias Confounding: Difficulty separating model vs data bias
 
    - Augmentation Distortion: Truth-altering modifications during loading
 
Evidence Collection Methodology
1. Baked-in Preprocessing
    | Risk Indicator | Investigation Method | Code Pattern Examples | 
    | Lambda layers in model | Inspect model.summary()/print(model) | tf.keras.layers.Lambda(lambda x: x/255) | 
    | Opaque map() calls | Audit dataset.map() operations | .map(preprocess_fn, num_parallel_calls=4) | 
    | Training-loop transforms | Check for inline tensor ops | batch[:,0:1] *= 100 in training step | 
python
# Detect entangled preprocessing (TensorFlow)
if any('lambda' in layer.name for layer in model.layers):
    print("Warning: Preprocessing baked into model architecture")
# PyTorch DataLoader inspection
for transform in train_loader.dataset.transform.transforms:
    if isinstance(transform, torch.nn.Module):
        print("Warning: NN-based transform in data pipeline")
2. Bias Confounding
    | Bias Type | Detection Approach | Diagnostic Tools | 
    | Data sampling | Check WeightedRandomSampler usage | Class distribution before/after loading | 
    | Feature bias | Compare raw vs preprocessed statistics | pandas_profiling comparison | 
    | Label leakage | Inspect augmentation transforms | Test if transform(x) affects y | 
python
# Compare raw vs preprocessed distributions
raw_mean = raw_data[:,0].mean()
loader_mean = next(iter(train_loader))[0][:,0].mean()
if abs(raw_mean - loader_mean) > 3*raw_data[:,0].std():
    print(f"Significant distribution shift: {raw_mean} → {loader_mean}")
3. Augmentation Distortion
    | Distortion Risk | Augmentation Examples | Validation Method | 
    | Truth-altering | Random erasing of lesions | Medical image QA checks | 
    | Label corruption | Rotation of oriented objects | Bounding box verification | 
    | Signal destruction | Aggressive audio noise | Waveform similarity index | 
python
# Validate augmentations preserve essential truth
original_img, label = raw_dataset[0]
augmented_img = train_loader.dataset.transform(original_img)
if not medical_qa_check(original_img, augmented_img):
    print("Augmentation violates diagnostic truth criteria")
Workpaper Template
Entangled Preprocessing Findings
    | Location | Preprocessing Type | Entanglement Level | Severity | 
    | Model Layer 1 | Normalization (Lambda) | Hardcoded in graph | Critical | 
    | dataset.map() | Text tokenization | Opaque parallel ops | High | 
    | Training Step | Channel swapping | Inline tensor ops | Medium | 
Bias Confounding Findings
    | Bias Source | Preprocessing Stage | Impact Metric | Confounding Factor | 
    | Sampler weights | DataLoader init | Class F1 variance | ±8% from raw data | 
    | Normalization | TFRecord decoding | Feature skew | 2.3σ shift | 
    | Crop padding | TorchVision transform | Edge artifact rate | 12% false positives | 
Augmentation Distortion Findings
    | Augmentation | Domain | Truth Alteration | QA Failure Rate | 
    | RandomErasing | Medical Imaging | 5% lesion removal | 18/200 cases | 
    | ColorJitter | Satellite | NDVI corruption | 32% spectral shift | 
    | SpeedPerturb | Audio | Phoneme distortion | 0.4 WER increase | 
Key Risks
    - Critical: CT scan window-leveling baked into model prevents DICOM validation
 
    - High: Text preprocessing removes rare dialects (AAVE retention <2%)
 
    - Medium: 14% of bounding boxes misaligned after rotation augmentations
 
Recommendations
For Disentanglement
python
# PyTorch best practice
class AuditableTransform:
    def __call__(self, x):
        self.last_params = {'operation': 'brightness', 'factor': 0.2}
        return adjust_brightness(x, 0.2)
# TensorFlow solution
preprocess_fn = tf.saved_model.save(
    tf.Module(), 'preprocessor')  # Versioned separately
For Bias Isolation
python
# Bias attribution framework
def compare_biases(raw_data, loader):
    raw_stats = calculate_stats(raw_data)
    loaded_stats = calculate_stats(concat_batches(loader))
    return {
        'data_bias': raw_stats,
        'pipeline_bias': loaded_stats - raw_stats
    }
For Truth-Preserving Augmentation
python
# Medical imaging guardrails
class SafeAugment:
    def __call__(self, img, mask):
        if random() < 0.5:
            assert mask.sum() == rotate(mask).sum(), "Augmentation destroyed lesions"
            return rotate(img), rotate(mask)
        return img, mask
Auditor Notes
    - Required Attachments:
        
            - Model architecture diagram highlighting preprocessing layers
 
            - Side-by-side raw vs augmented samples (minimum 10 examples)
 
            - Statistical comparison tables of input/output distributions
 
        
     
Sign-off:
    | Auditor | ML Engineer | Domain Expert | Date | 
    | [Your Name] | [Eng Name] | [MD/PhD Name] | [Date] | 
Standards References:
    - DICOM PS3.15 (Medical Image Transformations)
 
    - ACL Rolling Review (NLP Preprocessing)
 
    - MIL-STD-2525D (Geospatial Truth Preservation)