Docs
Scores & Evaluation
Custom via SDKs/API

Custom Scores via API/SDKs

Langfuse gives you full flexibility to ingest custom scores via the Langfuse SDKs or API. If scores are required to follow a specific schema, you can define and refer a score configuration (config) in the Langfuse UI or via our API. The scoring workflow allows you to run custom quality checks on the output of your workflows at runtime, or to run custom human evaluation workflows.

Exemplary use cases:

  • Deterministic rules at runtime: e.g. check if output contains a certain keyword, adheres to a specified structure/format or if the output is longer than a certain length.
  • Custom internal workflow tooling: build custom internal tooling that helps you manage human-in-the-loop workflows. Ingest scores back into Langfuse, optionally following your custom schema by referencing a config.
  • Automated data pipeline: continuously monitor the quality by fetching traces from Langfuse, running custom evaluations, and ingesting scores back into Langfuse.

How to add scores

You can add scores via the Langfuse SDKs or API. When ingesting data via API you can define scores to be of data type numeric, categorical or boolean. Further, if you'd like to ensure that your scores follow a specific schema, you can define a score config in the Langfuse UI or via our API.

When ingesting scores, you may provide:

  • Value: required, of type string or float. Please provide string values for categorical scores and float values for numeric and boolean scores.
  • Data Type: optional, of type NUMERIC, CATEGORICAL or BOOLEAN
  • Config Id: optional, of type string

Certain score properties might be inferred based on your input. If you don't provide a score data type it will always be inferred. See tables below for details. For boolean and categorical scores, we will provide the score value in both numerical and string format where possible. The score value format that is not provided as input, i.e. the translated value is referred to as the inferred value in the tables below. On read for boolean scores both numerical and string representations of the score value will be returned, e.g. both 1 and True. For categorical scores, the string representation is always provided and a numerical mapping of the category will be produced only if a score config was provided.

Score ingestion: provide value only

ValueData TypeConfig IdDescriptionInferred Data TypeInferred Value representationValid
1NullNullData type is inferredNUMERICNullYes
depthNullNullData type is inferredCATEGORICALNullYes

Score ingestion: provide data type and no config

ValueData TypeConfig IdDescriptionInferred Value representationValid
1NUMERICNullNo properties inferredNullYes
depthCATEGORICALNullNo properties inferredNullYes
1BooleanNullString value representation inferredTrueYes
depthNUMERICNullError: data type of value does not match provided data typeNo
1CATEGORICALNullError: data type of value does not match provided data typeNo
trueBooleanNullError: boolean data type expects numeric input valueNo
3BooleanNullError: boolean data type expects 0 or 1 as input valueNo

Score ingestion: provide config

Whenever you provide a config, the score data will be validated against the config. The following rules apply:

  • Score Name: Must equal the config's name
  • Score Data Type: When provided, must match the config's data type
  • Score Value: Must match the config's data type and be within the config's value range:
    • Numeric: Value must be within the min and max values defined in the config (if provided, min and max are optional and otherwise are assumed as -∞ and +∞ respectively)
    • Categorical: Value must map to one of the categories defined in the config
    • Boolean: Value must equal 0 or 1
ValueData TypeConfig IdDescriptionInferred Data TypeInferred Value representationValid
1Null78545Data type inferredNUMERICNullConditional on config validation
depthNull12345Data type and numeric value inferredCATEGORICAL4 numeric config category mappingConditional on config validation
1NUMERIC78545No properties inferredNullConditional on config validation
depthCATEGORICAL12345Numeric value inferred4 numeric config category mappingConditional on config validation
1BOOLEAN93547String value inferredTrueConditional on config validation
depthNUMERIC78545Error: data type of value does not match provided data typeNo
1CATEGORICAL12345Error: data type of value does not match provided data typeNo
trueBOOLEAN93547Error: boolean data type expects numeric input valueNo
langfuse.score(
    trace_id=message.trace_id,
    observation_id=message.generation_id, # optional
    name="depth",
    value="Good", # required, can be passed as string or float 
    comment="Factually correct", # optional
    id="unique_id" # optional, can be used as an indempotency key to update the score subsequently
    config_id="12345-6565-3453654-43543" # optional, to ensure that the score follows a specific schema
    data_type="CATEGORICAL" # optional, possibly inferred
)

→ More details in Python SDK docs

Creating Score Config object in Langfuse

A score config includes the desired score name, data type, and constraints on score value range such as min and max values for numerical data types and custom categories for categorical data types. See API reference for more details on POST/GET score configs endpoints. Configs are crucial to ensure that scores comply with a specific schema therefore standardizing them for future analysis.

AttributeTypeDescription
idstringUnique identifier of the score config.
namestringName of the score config, e.g. user_feedback, hallucination_eval
dataTypestringCan be either NUMERIC, CATEGORICAL or BOOLEAN
isArchivedbooleanWhether the score config is archived. Defaults to false
minValuenumberOptional: Sets minimum value for numerical scores. If not set, the minimum value defaults to -∞
maxValuenumberOptional: Sets maximum value for numerical scores. If not set, the maximum value defaults to +∞
categorieslistOptional: Defines categories for categorical scores. List of objects with label value pairs
descriptionstringOptional: Provides further description of the score configuration

Data pipeline example

You can run custom evaluations on data in Langfuse by fetching traces from Langfuse (e.g. via the Python SDK) and then adding evaluation results as scores back to the traces in Langfuse.

Was this page useful?

Questions? We're here to help

Subscribe to updates