tf.data / PyTorch DataLoader Preprocessing Audit Workpaper
Audit Focus Area
Evaluating risks in deep learning data pipeline implementations:
- Entangled Preprocessing: Transformations baked into training loops
- Bias Confounding: Difficulty separating model vs data bias
- Augmentation Distortion: Truth-altering modifications during loading
Evidence Collection Methodology
1. Baked-in Preprocessing
Risk Indicator | Investigation Method | Code Pattern Examples |
Lambda layers in model | Inspect model.summary()/print(model) | tf.keras.layers.Lambda(lambda x: x/255) |
Opaque map() calls | Audit dataset.map() operations | .map(preprocess_fn, num_parallel_calls=4) |
Training-loop transforms | Check for inline tensor ops | batch[:,0:1] *= 100 in training step |
python
# Detect entangled preprocessing (TensorFlow)
if any('lambda' in layer.name for layer in model.layers):
print("Warning: Preprocessing baked into model architecture")
# PyTorch DataLoader inspection
for transform in train_loader.dataset.transform.transforms:
if isinstance(transform, torch.nn.Module):
print("Warning: NN-based transform in data pipeline")
2. Bias Confounding
Bias Type | Detection Approach | Diagnostic Tools |
Data sampling | Check WeightedRandomSampler usage | Class distribution before/after loading |
Feature bias | Compare raw vs preprocessed statistics | pandas_profiling comparison |
Label leakage | Inspect augmentation transforms | Test if transform(x) affects y |
python
# Compare raw vs preprocessed distributions
raw_mean = raw_data[:,0].mean()
loader_mean = next(iter(train_loader))[0][:,0].mean()
if abs(raw_mean - loader_mean) > 3*raw_data[:,0].std():
print(f"Significant distribution shift: {raw_mean} → {loader_mean}")
3. Augmentation Distortion
Distortion Risk | Augmentation Examples | Validation Method |
Truth-altering | Random erasing of lesions | Medical image QA checks |
Label corruption | Rotation of oriented objects | Bounding box verification |
Signal destruction | Aggressive audio noise | Waveform similarity index |
python
# Validate augmentations preserve essential truth
original_img, label = raw_dataset[0]
augmented_img = train_loader.dataset.transform(original_img)
if not medical_qa_check(original_img, augmented_img):
print("Augmentation violates diagnostic truth criteria")
Workpaper Template
Entangled Preprocessing Findings
Location | Preprocessing Type | Entanglement Level | Severity |
Model Layer 1 | Normalization (Lambda) | Hardcoded in graph | Critical |
dataset.map() | Text tokenization | Opaque parallel ops | High |
Training Step | Channel swapping | Inline tensor ops | Medium |
Bias Confounding Findings
Bias Source | Preprocessing Stage | Impact Metric | Confounding Factor |
Sampler weights | DataLoader init | Class F1 variance | ±8% from raw data |
Normalization | TFRecord decoding | Feature skew | 2.3σ shift |
Crop padding | TorchVision transform | Edge artifact rate | 12% false positives |
Augmentation Distortion Findings
Augmentation | Domain | Truth Alteration | QA Failure Rate |
RandomErasing | Medical Imaging | 5% lesion removal | 18/200 cases |
ColorJitter | Satellite | NDVI corruption | 32% spectral shift |
SpeedPerturb | Audio | Phoneme distortion | 0.4 WER increase |
Key Risks
- Critical: CT scan window-leveling baked into model prevents DICOM validation
- High: Text preprocessing removes rare dialects (AAVE retention <2%)
- Medium: 14% of bounding boxes misaligned after rotation augmentations
Recommendations
For Disentanglement
python
# PyTorch best practice
class AuditableTransform:
def __call__(self, x):
self.last_params = {'operation': 'brightness', 'factor': 0.2}
return adjust_brightness(x, 0.2)
# TensorFlow solution
preprocess_fn = tf.saved_model.save(
tf.Module(), 'preprocessor') # Versioned separately
For Bias Isolation
python
# Bias attribution framework
def compare_biases(raw_data, loader):
raw_stats = calculate_stats(raw_data)
loaded_stats = calculate_stats(concat_batches(loader))
return {
'data_bias': raw_stats,
'pipeline_bias': loaded_stats - raw_stats
}
For Truth-Preserving Augmentation
python
# Medical imaging guardrails
class SafeAugment:
def __call__(self, img, mask):
if random() < 0.5:
assert mask.sum() == rotate(mask).sum(), "Augmentation destroyed lesions"
return rotate(img), rotate(mask)
return img, mask
Auditor Notes
- Required Attachments:
- Model architecture diagram highlighting preprocessing layers
- Side-by-side raw vs augmented samples (minimum 10 examples)
- Statistical comparison tables of input/output distributions
Sign-off:
Auditor | ML Engineer | Domain Expert | Date |
[Your Name] | [Eng Name] | [MD/PhD Name] | [Date] |
Standards References:
- DICOM PS3.15 (Medical Image Transformations)
- ACL Rolling Review (NLP Preprocessing)
- MIL-STD-2525D (Geospatial Truth Preservation)