MLFlow is a fossmlops platform for logging model experiments, tracking training and evaluation metrics and managing model lifecycles.
MLFlow Evaluations
MLFlow offers functionality for evaluating LLM flows in their Evaluation Module.
MLFlow LLM Evaluations and Ollama
Here are some code snippets for evaluating ollama models based on the examples from the mlflow docs. You’ll need to make sure that you have Ollama installed and running:
First run:
export OPENAI_API_KEY="dummy"
export OPENAI_BASE_URL=http://localhost:11434/v1
ollama pull llama3.2:latest
In Python:
import mlflow
import openai
import os
import pandas as pd
from getpass import getpass
import mlflow
# Enable auto-tracing for OpenAI
mlflow.openai.autolog()
# Optional: Set a tracking URI and an experiment
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("Ollama")
eval_data = pd.DataFrame(
{
"inputs": [
"What is MLflow?",
"What is Spark?",
],
"ground_truth": [
"MLflow is an open-source platform for managing the end-to-end machine learning lifecycle, allowing data scientists and engineers to track, manage, and deploy machine learning projects from development to production. It provides a suite of tools, including experimentation management, model versioning, deployment, and monitoring, to streamline the machine learning workflow and ensure reproducibility and scalability.",
"Apache Spark is an open-source unified analytics engine for large-scale data processing, providing high-performance computing capabilities for a wide range of data types and applications. It offers various functionalities, such as batch and streaming data processing, machine learning algorithms, and graph processing, making it a popular choice for big data analytics, data warehousing, and other industries.",
],
}
)
with mlflow.start_run() as run:
system_prompt = "Answer the following question in two sentences"
# Wrap "gpt-4" as an MLflow model.
logged_model_info = mlflow.openai.log_model(
model="llama3.2:latest",
task=openai.chat.completions,
artifact_path="model",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "{question}"},
],
)
# Use predefined question-answering metrics to evaluate our model.
results = mlflow.evaluate(
logged_model_info.model_uri,
eval_data,
targets="ground_truth",
model_type="question-answering",
)
print(f"See aggregated evaluation results below: \n{results.metrics}")
# Evaluation result for each data record is available in `results.tables`.
eval_table = results.tables["eval_results_table"]
print(f"See evaluation table below: \n{eval_table}")
RAG Example
Using UV:
uv add langchain langchain-community langchain-ollama beautifulsoup4 chromadb