Beyond benchmark pass rates
Many agent evals ask whether a task eventually passes. Production teams need richer questions: Was the solution aligned? Were the tests meaningful? Did the agent introduce hidden risk? Could reviewers trust the completion claim?
FeelGoot turns those questions into a structured evidence-oriented evaluation model.
Agent score dimensions
Intent match: the diff addresses the right requirement.
Evidence strength: tests and checks prove real behavior.
Shortcut risk: suspicious mocks, stubs, skips, hardcoding, or incomplete paths.
Human review leverage: the agent leaves a useful trail for reviewers.
Who uses this
Teams choosing between coding agents, platform teams building internal agents, engineering leaders measuring adoption risk, and high-assurance teams evaluating whether agent-created work can enter production workflows.