Model-based Evaluations in Langfuse

Model-based evaluations are a powerful tool to automate the evaluation of LLM applications integrated with Langfuse. With model-based evalutions, LLMs are used to score a specific session/trace/LLM-call in Langfuse on criteria such as correctness, toxicity, or hallucinations.

Via Python SDK

You can run model-based evals on data in Langfuse via the Python SDK. This gives you full flexibility to run various eval libraries on your production data and discover which work well for your use case. Popular libraries are:

OpenAI Evals
Langchain Evaluators (Cookbook)
RAGAS for RAG applications (Cookbook)
UpTrain evals (Cookbook)
Whylabs Langkit

Via Langfuse UI

Coming soon: Langfuse evaluation service to run model-based evals directly from the Langfuse UI/Server. Ping us if you are interested to join the beta testing.

Model-based Evaluations in Langfuse

Via Python SDK

Via Langfuse UI

Was this page useful?

Questions? We're here to help

Subscribe to updates