Evals (evaluations) are how we should be testing AI (and other stochastic systems).
Gen AI has brought stochastic algorithms into the spotlight, but using them in production systems is not a new thing.
We've been doing ML in production for over a decade.
There are three main types of evals:
- Code-based: write code that automatically validates model / ai outputs (using regex or checking for particular keywords)
- Human-based: use human judgement. This is best suited for situations where there are qualitative aspects of the output that need to be graded.
- LLM-based: can be used as a proxy for human-based grading, but is likely more cost effective. This is asking other AI systems to grade outputs.
More coming soon, I will update this page with examples of
evals
and criteria that can be used.