Functional Quality
"Does it work?"
This is the pillar closest to traditional QE, but it requires a fundamental shift in thinking. With AI, "works" isn't a binary. You're not checking whether outputs match expected values. You're evaluating whether outputs are fit for purpose.
A language model that produces grammatically correct but misleading text has "worked" in one sense and failed in another. A recommendation engine that technically functions but consistently makes poor suggestions isn't providing value. Functional quality for AI means evaluating outputs against real-world fitness, not just technical correctness.
What this looks like in practice
- Evaluation rubrics that define "good enough" for your specific use case, with clear criteria for human reviewers
- Representative test sets that cover edge cases, adversarial inputs, and realistic variation. Not just happy paths
- Consistency testing to understand how much outputs vary and whether that variation is acceptable
- Baseline comparisons to measure whether the AI actually improves on non-AI alternatives
Standards Alignment
Maps to NIST AI RMF MEASURE function (2.1-2.4) and ISO 42001 requirements for AI system performance evaluation and validation.