AI Data Preprocessing Audit Workpaper: NumPy Data Handling Risks

Audit Focus Area

Evaluating potential data integrity risks in NumPy-based preprocessing pipelines that may lead to:

Low semantic awareness (mathematically valid but logically incorrect operations)
No audit trail (untraceable value transformations)
Silent value coercion/clipping (unlogged data modifications)

Evidence Collection Procedures

1. Low Semantic Awareness

Where to Find Evidence:

Type Conversions: .astype() used on semantically sensitive data
Mathematical Operations: Averaging ordinal/categorical data
Array Reshaping: Misuse of .reshape()

Example Tests:

if np.issubdtype(df['patient_id'].dtype, np.floating):
    raise ValueError("Numeric IDs coerced to float - may lose precision")

if 'Likert_scale' in df.columns and np.mean(df['Likert_scale']) not in [1,2,3,4,5]:
    print("Warning: Ordinal data treated as interval")

2. No Audit Trail

Where to Find Evidence:

In-place operations without logging
Absence of checksum/hash logs
Unversioned np.save() usage

Example Tests:

assert 'input_mean' in transformation_metadata, "No record of pre-normalization stats"
assert hashlib.md5(input_array).hexdigest() == saved_checksum, "Input altered before processing"

3. Silent Value Coercion/Clipping

Where to Find Evidence:

Type casting via np.nan_to_num(), .astype()
Unconstrained np.clip(), np.round()
Risk of silent overflow in int32 operations

Example Tests:

if np.max(output_array) == upper_bound:
    print("Warning: Values may have been silently clipped")

if (input_array.dtype == np.int32) and (np.max(input_array) > 2**30):
    print("Warning: Potential int32 overflow risk")

Workpaper Template

Risk	Test Performed	Evidence Location	Result
Semantic unawareness	Ordinal data averaged	normalize.py line 22	FAIL
No audit trail	Missing input checksums	preprocess_data.ipynb	FAIL
Silent value clipping	5% of values at upper bound	scale_features() output	WARNING
Type coercion	ID field converted to float64	Git commit #a1b3d4	FAIL

Key Findings

Critical: Patient IDs lost precision when coerced to float64
High: 18% of temperature readings silently clipped to 100.0
Medium: No versioning of intermediate NumPy arrays

Recommendations

Semantic Guards

def safe_convert_to_int(arr):
    if not np.all(np.modf(arr)[0] == 0):
        raise ValueError("Float-to-int conversion would lose precision")
    return arr.astype(int)

Audit Trail

def logged_operation(arr, op_name):
    print(f"{op_name} - Input hash: {hashlib.sha256(arr.tobytes()).hexdigest()}")
    return arr

Bounds Checking

def safe_clip(arr, min_val, max_val):
    n_clipped = np.sum((arr < min_val) | (arr > max_val))
    if n_clipped > 0:
        warnings.warn(f"Clipped {n_clipped} values")
    return np.clip(arr, min_val, max_val)

Auditor Notes

Attach:

Pre/post-processing value distribution plots
Type conversion validation reports
Example clipped values from testing

Sign-off:

Auditor	Date	Reviewer
[Your Name]	[Date]	[AI Governance Lead]

Framework References

NIST SP 800-218 (Secure Software Development)
ISO/IEC 23053 (AI System Engineering)
GDPR Article 22 (Automated Decision Making)