How to define a code evaluator

Code evaluators are functions that take a dataset example and the resulting application output, and return one or more metrics. These functions can be passed directly into the evaluate() or aevaluate() functions.

To define code evaluators in the LangSmith UI, refer to How to define a code evaluator (UI).

Basic example

from langsmith import evaluate

def correct(outputs: dict, reference_outputs: dict) -> bool:
    """Check if the answer exactly matches the expected answer."""
    return outputs["answer"] == reference_outputs["answer"]

def dummy_app(inputs: dict) -> dict:
    return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}

results = evaluate(
    dummy_app,
    data="dataset_name",
    evaluators=[correct]
)

Evaluator args

code evaluator functions must have specific argument names. They can take any subset of the following arguments:

run: Run: The full Run object generated by the application on the given example.
example: Example: The full dataset Example, including the example inputs, outputs (if available), and metdata (if available).
inputs: dict: A dictionary of the inputs corresponding to a single example in a dataset.
outputs: dict: A dictionary of the outputs generated by the application on the given inputs.
reference_outputs/referenceOutputs: dict: A dictionary of the reference outputs associated with the example, if available.

For most use cases you’ll only need inputs, outputs, and reference_outputs. run and example are useful only if you need some extra trace or example metadata outside of the actual inputs and outputs of the application. When using JS/TS these should all be passed in as part of a single object argument.

Evaluator output

Code evaluators are expected to return one of the following types: Python and JS/TS

dict: dicts of the form {"score" | "value": ..., "key": ...} allow you to customize the metric type (“score” for numerical and “value” for categorical) and metric name. This if useful if, for example, you want to log an integer as a categorical metric.

Python only

int | float | bool: this is interepreted as an continuous metric that can be averaged, sorted, etc. The function name is used as the name of the metric.
str: this is intepreted as a categorical metric. The function name is used as the name of the metric.
list[dict]: return multiple metrics using a single function.

Additional examples

Requires langsmith>=0.2.0

from langsmith import evaluate, wrappers
from langsmith.schemas import Run, Example
from openai import AsyncOpenAI
# Assumes you've installed pydantic.
from pydantic import BaseModel

# We can still pass in Run and Example objects if we'd like
def correct_old_signature(run: Run, example: Example) -> dict:
    """Check if the answer exactly matches the expected answer."""
    return {"key": "correct", "score": run.outputs["answer"] == example.outputs["answer"]}

# Just evaluate actual outputs
def concision(outputs: dict) -> int:
    """Score how concise the answer is. 1 is the most concise, 5 is the least concise."""
    return min(len(outputs["answer"]) // 1000, 4) + 1

# Use an LLM-as-a-judge
oai_client = wrappers.wrap_openai(AsyncOpenAI())

async def valid_reasoning(inputs: dict, outputs: dict) -> bool:
    """Use an LLM to judge if the reasoning and the answer are consistent."""
    instructions = """
Given the following question, answer, and reasoning, determine if the reasoning for the
answer is logically valid and consistent with question and the answer."""

    class Response(BaseModel):
        reasoning_is_valid: bool

    msg = f"Question: {inputs['question']}\nAnswer: {outputs['answer']}\nReasoning: {outputs['reasoning']}"
    response = await oai_client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": instructions,}, {"role": "user", "content": msg}],
        response_format=Response
    )
    return response.choices[0].message.parsed.reasoning_is_valid

def dummy_app(inputs: dict) -> dict:
    return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}

results = evaluate(
    dummy_app,
    data="dataset_name",
    evaluators=[correct_old_signature, concision, valid_reasoning]
)

Evaluate aggregate experiment results: Define summary evaluators, which compute metrics for an entire experiment.
Run an evaluation comparing two experiments: Define pairwise evaluators, which compute metrics by comparing two (or more) experiments against each other.

Edit this page on GitHub or file an issue.

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

How to define a code evaluator

Basic example

Evaluator args

Evaluator output

Additional examples

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

​Basic example

​Evaluator args

​Evaluator output

​Additional examples

​Related

Basic example

Evaluator args

Evaluator output

Additional examples

Related