Evaluating potential data integrity risks in NumPy-based preprocessing pipelines that may lead to:
if np.issubdtype(df['patient_id'].dtype, np.floating):
raise ValueError("Numeric IDs coerced to float - may lose precision")
if 'Likert_scale' in df.columns and np.mean(df['Likert_scale']) not in [1,2,3,4,5]:
print("Warning: Ordinal data treated as interval")
assert 'input_mean' in transformation_metadata, "No record of pre-normalization stats"
assert hashlib.md5(input_array).hexdigest() == saved_checksum, "Input altered before processing"
if np.max(output_array) == upper_bound:
print("Warning: Values may have been silently clipped")
if (input_array.dtype == np.int32) and (np.max(input_array) > 2**30):
print("Warning: Potential int32 overflow risk")
Risk | Test Performed | Evidence Location | Result |
---|---|---|---|
Semantic unawareness | Ordinal data averaged | normalize.py line 22 | FAIL |
No audit trail | Missing input checksums | preprocess_data.ipynb | FAIL |
Silent value clipping | 5% of values at upper bound | scale_features() output | WARNING |
Type coercion | ID field converted to float64 | Git commit #a1b3d4 | FAIL |
def safe_convert_to_int(arr):
if not np.all(np.modf(arr)[0] == 0):
raise ValueError("Float-to-int conversion would lose precision")
return arr.astype(int)
def logged_operation(arr, op_name):
print(f"{op_name} - Input hash: {hashlib.sha256(arr.tobytes()).hexdigest()}")
return arr
def safe_clip(arr, min_val, max_val):
n_clipped = np.sum((arr < min_val) | (arr > max_val))
if n_clipped > 0:
warnings.warn(f"Clipped {n_clipped} values")
return np.clip(arr, min_val, max_val)
Attach:
Sign-off:
Auditor | Date | Reviewer |
---|---|---|
[Your Name] | [Date] | [AI Governance Lead] |