Evals

May 19 2025

Evals (evaluations) are how we should be testing AI (and other stochastic systems).

Gen AI has brought stochastic algorithms into the spotlight, but using them in production systems is not a new thing.

We've been doing ML in production for over a decade.

There are three main types of evals:

  • Code-based: write code that automatically validates model / ai outputs (using regex or checking for particular keywords)
  • Human-based: use human judgement. This is best suited for situations where there are qualitative aspects of the output that need to be graded.
  • LLM-based: can be used as a proxy for human-based grading, but is likely more cost effective. This is asking other AI systems to grade outputs.

More coming soon, I will update this page with examples of evals and criteria that can be used.