Paid GitHub Issues
[FR] Token budget enforcement for mlflow.genai.evaluate() runs
mlflow/mlflow
Description
<!-- issue-warning --> > [!WARNING] > Before submitting a PR, please make sure that: > - A maintainer has triaged this issue and applied the `ready` label > - This issue has no assignee > - No duplicate PR exists > > PRs not meeting these requirements may be automatically closed. ### Willingness to contribute Yes. I can contribute this feature independently. ### Proposal Summary Add `TokenBudgetEvaluator`, a thread-safe wrapper for `mlflow.genai.evaluate()`'s `predict_fn` that tracks token usage across evaluation rows, enforces a configurable budget ceiling (raise or warn), and exposes the accounting as both `Scorer` results (via `budget_scorers()`) and run-level metrics (via `log_to_mlflow()`). Implementation is complete and tested (35 tests); happy to open a PR once there's alignment on direction. ### Motivation > #### What is the use case for this feature? Running `mlflow.genai.evaluate()` against a paid LLM API with no way to cap or track cumulative token spend across the run. A 10k-row eval at ~1k tokens/row on GPT-4o costs ~$300. One misconfigured run can exhaust a budget silently, with no warning and no logged record. > #### Why is this use case valuable to support for MLflow users in general? Every team running GenAI evaluation against a metered API has this problem. It's not a niche workflow. It's the default cost-control gap in the current eval loop. > #### Why is this use case valuable to support for your project(s) or organization? Running recurring LLM evals as part of CI/regression testing; need a hard ceiling to prevent a bad test run from generating an unexpected bill, plus a logged record of spend per run for cost attribution. > #### Why is it currently difficult to achieve this use case? `mlflow.genai.evaluate()`'s `predict_fn` is a plain callable. There is no hook for cumulative usage tracking across rows, no budget enforcement, and no standard way to surface token spend as a scorer or run metric. Teams currently write one-off wrappers around `predict_fn` that are untested, invisible in the MLflow UI, and not reusable across projects. ### Details Proposed addition: `mlflow/genai/evaluation/token_budget.py` import mlflow from mlflow.genai.evaluation.token_budget import TokenBudgetEvaluator evaluator = TokenBudgetEvaluator( model=lambda question: my_llm(question), max_tokens=50_000, cost_per_1k_tokens=0.03, # USD, for cost reporting only on_exceed="raise", # or "warn" to continue past budget ) results = mlflow.genai.evaluate( data=eval_data, predict_fn=evaluator, scorers=evaluator.budget_scorers(), # adds token_budget_* scores per row ) evaluator.log_to_mlflow() # flush final accounting as run metrics `budget_scorers()` returns three `Scorer` objects (via `mlflow.genai.scorers.scorer`), each reading the evaluator's live, thread-safe accumulator: `token_budget_used`, `token_budget_remaining`, `token_budget_exceeded`. `log_to_mlflow()` logs run-level metrics: `token_budget_used`, `token_budget_input_tokens`, `token_budget_output_tokens`, `token_budget_remaining`, `token_budget_calls`, `token_budget_exceeded`, `token_budget_estimated_cost_usd`. (Note: metric and scorer names use underscores, not slashes. MLflow rejects metric names containing `"/"`.) Token counter is pluggable (defaults to whitespace approximation, accepts tiktoken or any model-specific counter). Implementation status: complete, ~310 LOC, single file, no new required dependencies, thread-safe (RLock: predict_fn runs in a background threadpool), graceful degradation when mlflow is absent, 35 tests passing (construction, call tracking, budget enforcement, cost computation, thread safety with 20 concurrent workers, scorer live-state reading, metric name validation, edge cases). End-to-end integration contract verified via a simulation of mlflow.genai.evaluate()'s exact calling convention. Open questions for maintainers: 1. Is mlflow/genai/evaluation/token_budget.py the right location, or would you prefer a different submodule (e.g. mlflow/genai/scorers/)? 2. Any concerns about the on_exceed="raise" default? Would "warn" be safer for a feature that can halt a run mid-flight? 3. Should budget_scorers() emit a Feedback object (with rationale) instead of a primitive float, for richer display in the eval results table? Related: mlflow.genai.evaluate() docs (https://mlflow.org/docs/latest/genai/eval-monitor/), custom scorers docs (https://mlflow.org/docs/latest/genai/eval-monitor/scorers/custom/). No existing issues found on token budget tracking as of 2026-06-30. ### What machine learning domain(s) is this feature request about? - [x] `domain/genai`: LLMs, Agents, and other GenAI-related use cases - [ ] `domain/classical-ml`: Traditional machine learning, such as linear regression. - [ ] `domain/deep-learning`: Deep learning and neural networks. - [ ] `domain/platform`: MLflow platform foundation, not specific to a particular machine learning domain. ### What area(s) of MLflow is this feature request about? - [ ] `area/tracking`: Tracking Service, tracking client APIs, autologging - [ ] `area/model-registry`: Model Registry service, APIs, and the fluent client calls for Model Registry - [ ] `area/scoring`: MLflow model serving, deployment tools, Spark UDFs - [x] `area/evaluation`: MLflow model evaluation features, evaluation metrics, and evaluation workflows - [ ] `area/prompt`: MLflow prompt engineering features, prompt templates, and prompt management - [ ] `area/tracing`: MLflow Tracing features, tracing APIs, and LLM tracing functionality - [ ] `area/gateway`: MLflow AI Gateway client APIs, server, and third-party integrations - [ ] `area/projects`: MLproject format, project running backends - [ ] `area/uiux`: Front-end, user experience, plotting - [ ] `area/docs`: MLflow documentation pages