tf.data / PyTorch DataLoader Preprocessing Audit Workpaper

Audit Focus Area

Evaluating risks in deep learning data pipeline implementations:

Evidence Collection Methodology

1. Baked-in Preprocessing

Risk IndicatorInvestigation MethodCode Pattern Examples
Lambda layers in modelInspect model.summary()/print(model)tf.keras.layers.Lambda(lambda x: x/255)
Opaque map() callsAudit dataset.map() operations.map(preprocess_fn, num_parallel_calls=4)
Training-loop transformsCheck for inline tensor opsbatch[:,0:1] *= 100 in training step
python
# Detect entangled preprocessing (TensorFlow)
if any('lambda' in layer.name for layer in model.layers):
    print("Warning: Preprocessing baked into model architecture")

# PyTorch DataLoader inspection
for transform in train_loader.dataset.transform.transforms:
    if isinstance(transform, torch.nn.Module):
        print("Warning: NN-based transform in data pipeline")

2. Bias Confounding

Bias TypeDetection ApproachDiagnostic Tools
Data samplingCheck WeightedRandomSampler usageClass distribution before/after loading
Feature biasCompare raw vs preprocessed statisticspandas_profiling comparison
Label leakageInspect augmentation transformsTest if transform(x) affects y
python
# Compare raw vs preprocessed distributions
raw_mean = raw_data[:,0].mean()
loader_mean = next(iter(train_loader))[0][:,0].mean()

if abs(raw_mean - loader_mean) > 3*raw_data[:,0].std():
    print(f"Significant distribution shift: {raw_mean} → {loader_mean}")

3. Augmentation Distortion

Distortion RiskAugmentation ExamplesValidation Method
Truth-alteringRandom erasing of lesionsMedical image QA checks
Label corruptionRotation of oriented objectsBounding box verification
Signal destructionAggressive audio noiseWaveform similarity index
python
# Validate augmentations preserve essential truth
original_img, label = raw_dataset[0]
augmented_img = train_loader.dataset.transform(original_img)

if not medical_qa_check(original_img, augmented_img):
    print("Augmentation violates diagnostic truth criteria")

Workpaper Template

Entangled Preprocessing Findings

LocationPreprocessing TypeEntanglement LevelSeverity
Model Layer 1Normalization (Lambda)Hardcoded in graphCritical
dataset.map()Text tokenizationOpaque parallel opsHigh
Training StepChannel swappingInline tensor opsMedium

Bias Confounding Findings

Bias SourcePreprocessing StageImpact MetricConfounding Factor
Sampler weightsDataLoader initClass F1 variance±8% from raw data
NormalizationTFRecord decodingFeature skew2.3σ shift
Crop paddingTorchVision transformEdge artifact rate12% false positives

Augmentation Distortion Findings

AugmentationDomainTruth AlterationQA Failure Rate
RandomErasingMedical Imaging5% lesion removal18/200 cases
ColorJitterSatelliteNDVI corruption32% spectral shift
SpeedPerturbAudioPhoneme distortion0.4 WER increase

Key Risks

Recommendations

For Disentanglement

python
# PyTorch best practice
class AuditableTransform:
    def __call__(self, x):
        self.last_params = {'operation': 'brightness', 'factor': 0.2}
        return adjust_brightness(x, 0.2)

# TensorFlow solution
preprocess_fn = tf.saved_model.save(
    tf.Module(), 'preprocessor')  # Versioned separately

For Bias Isolation

python
# Bias attribution framework
def compare_biases(raw_data, loader):
    raw_stats = calculate_stats(raw_data)
    loaded_stats = calculate_stats(concat_batches(loader))
    return {
        'data_bias': raw_stats,
        'pipeline_bias': loaded_stats - raw_stats
    }

For Truth-Preserving Augmentation

python
# Medical imaging guardrails
class SafeAugment:
    def __call__(self, img, mask):
        if random() < 0.5:
            assert mask.sum() == rotate(mask).sum(), "Augmentation destroyed lesions"
            return rotate(img), rotate(mask)
        return img, mask

Auditor Notes

Sign-off:

AuditorML EngineerDomain ExpertDate
[Your Name][Eng Name][MD/PhD Name][Date]

Standards References: