-
Notifications
You must be signed in to change notification settings - Fork 8
feat(evaluator): Evidence handles, ATIF traces and trace normalisation #432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,19 +3,22 @@ | |
|
|
||
| """Reference metrics-over-evidence for this example (not SDK API). | ||
|
|
||
| These show how to score from the SDK's filesystem evidence handle instead of a | ||
| stamped verifier reward: | ||
| These show how to score from the SDK's evidence handles instead of a stamped | ||
| verifier reward: | ||
|
|
||
| * :class:`TestsPassMetric` runs a command against ``final_state`` filesystem | ||
| evidence (in a throwaway overlay) and scores on exit 0. | ||
| * :class:`NoTestCheatingMetric` diffs ``initial_state`` against ``final_state`` | ||
| and fails if the agent touched protected (e.g. test) paths. | ||
| * :class:`InefficientRetryLoopMetric` reads the normalized ``trace`` and fails | ||
| when the same tool call repeats past a threshold. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| from collections.abc import Sequence | ||
|
|
||
| from nemo_evaluator_sdk.agent_eval.trials import EVIDENCE_FINAL_STATE, EVIDENCE_INITIAL_STATE, EVIDENCE_TRACE | ||
| from nemo_evaluator_sdk.metrics.protocol import MetricInput, MetricOutput, MetricOutputSpec, MetricResult | ||
|
|
||
|
|
||
|
|
@@ -26,7 +29,7 @@ def __init__( | |
| self, | ||
| command: Sequence[str], | ||
| *, | ||
| evidence_name: str = "final_state", | ||
| evidence_name: str = EVIDENCE_FINAL_STATE, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would it make sense to ad hint for well known literals |
||
| cwd: str = ".", | ||
| timeout_s: float = 300.0, | ||
| ) -> None: | ||
|
|
@@ -60,8 +63,8 @@ def __init__( | |
| *, | ||
| protected: Sequence[str] = ("tests/",), | ||
| change_types: Sequence[str] = ("added", "modified", "deleted"), | ||
| initial_name: str = "initial_state", | ||
| final_name: str = "final_state", | ||
| initial_name: str = EVIDENCE_INITIAL_STATE, | ||
| final_name: str = EVIDENCE_FINAL_STATE, | ||
|
Comment on lines
+66
to
+67
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question, do you think these need to be injected here? When would the initial/final states have a different name? |
||
| ) -> None: | ||
| self._protected = tuple(protected) | ||
| self._change_types = set(change_types) | ||
|
|
@@ -87,3 +90,38 @@ async def compute_scores(self, input: MetricInput) -> MetricResult: | |
| ] | ||
| clean = not violations | ||
| return MetricResult(outputs=[MetricOutput(name="no_test_cheating", value=clean)]) | ||
|
|
||
|
|
||
| class InefficientRetryLoopMetric: | ||
| """Score ``False`` when the same tool call repeats more than ``threshold`` times.""" | ||
|
|
||
| def __init__(self, *, threshold: int = 2, evidence_name: str = EVIDENCE_TRACE) -> None: | ||
| self._threshold = threshold | ||
| self._evidence_name = evidence_name | ||
|
|
||
| @property | ||
| def type(self) -> str: | ||
| return "inefficient_retry_loop" | ||
|
|
||
| def output_spec(self) -> list[MetricOutputSpec]: | ||
| return [ | ||
| MetricOutputSpec.boolean("efficient_tool_use"), | ||
| MetricOutputSpec.discrete_score("max_repeated_tool_calls"), | ||
| ] | ||
|
|
||
| async def compute_scores(self, input: MetricInput) -> MetricResult: | ||
| max_repeats = 0 | ||
| evidence = input.candidate.evidence | ||
| if evidence is not None and evidence.get(self._evidence_name) is not None: | ||
| calls = await (await evidence.trace(self._evidence_name)).tool_calls() | ||
|
Comment on lines
+115
to
+116
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why would the user choose to get different evidence types in the metric? It seems like this always relies on it being |
||
| counts: dict[str, int] = {} | ||
| for call in calls: | ||
| key = f"{call.function_name}:{sorted((call.arguments or {}).items())}" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do we sort call arguments? wouldn't it change the order of args which is important? |
||
| counts[key] = counts.get(key, 0) + 1 | ||
| max_repeats = max(counts.values(), default=0) | ||
| return MetricResult( | ||
| outputs=[ | ||
| MetricOutput(name="efficient_tool_use", value=max_repeats <= self._threshold), | ||
| MetricOutput(name="max_repeated_tool_calls", value=max_repeats), | ||
| ] | ||
| ) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -44,7 +44,7 @@ | |
| from nemo_evaluator_sdk.metrics.protocol import Metric, validate_metric_result | ||
| from nemo_evaluator_sdk.metrics.utils import metric_type_name | ||
| from nemo_evaluator_sdk.values import Agent, Model, RunConfig, RunConfigOnline, RunConfigOnlineModel | ||
| from nemo_evaluator_sdk.values.evidence import CandidateEvidence, EvidenceDescriptor | ||
| from nemo_evaluator_sdk.values.evidence import CandidateEvidence, EvidenceDescriptor, normalize_trace_descriptor | ||
| from openai import AsyncOpenAI | ||
|
|
||
| log = getLogger(__name__) | ||
|
|
@@ -327,7 +327,9 @@ def _trial_from_sample(task: AgentEvalTask, target: Model | Agent, sample: dict[ | |
| # trial stays scorable instead of being dropped as empty output. | ||
| output_text = _reasoning_content_fallback(sample.get("response")) | ||
| if "trajectory" in sample: | ||
| trace = EvidenceDescriptor(kind="trace", format="json", data=sample["trajectory"]) | ||
| # Normalize to ATIF before the trial is persisted so the stored shape is | ||
| # source-agnostic (sources in, ATIF out); TraceHandle then reads it uniformly. | ||
| trace = normalize_trace_descriptor(EvidenceDescriptor(kind="trace", format="json", data=sample["trajectory"])) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this is the right approach tbh. I think persisting the raw trace and convert on metric usage is the right approach. The problem with converting before persisting is that if there's a bug in our conversion code or we don't handle the source format fully, the conversion may be lossy. At least if we convert on read, users can improve the conversion function without loss. |
||
| else: | ||
| trace = EvidenceDescriptor(kind="sdk_online_generation", data={"task_id": task.id, "target": target.name}) | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Count retry streaks, not global frequency.
Lines 118-121 count identical calls across the entire trace. That flags legitimate reuse separated by other work as a retry loop, and the current key still splits semantically identical nested args when dict insertion order differs. Track the longest consecutive run from a canonicalized call payload instead.
Suggested fix
Also applies to: 112-125
🤖 Prompt for AI Agents