Model-based Evaluations in Langfuse
Model-based evaluations are a powerful tool to automate the evaluation of LLM applications integrated with Langfuse. With model-based evalutions, LLMs are used to score a specific session/trace/LLM-call in Langfuse on criteria such as correctness, toxicity, or hallucinations.
Via Python SDK
You can run model-based evals on data in Langfuse via the Python SDK. This gives you full flexibility to run various eval libraries on your production data and discover which work well for your use case. Popular libraries are:
- OpenAI Evals
- Langchain Evaluators (Cookbook)
- RAGAS for RAG applications (Cookbook)
- UpTrain evals (Cookbook)
- Whylabs Langkit
Via Langfuse UI
Coming soon: Langfuse evaluation service to run model-based evals directly from the Langfuse UI/Server. Ping us if you are interested to join the beta testing.