Evaluation: Gemini batch integration (text)#867
Conversation
- Introduced `run_evaluation_batch_submission` task to handle evaluation batch submissions. - Created `start_evaluation_batch_submission` utility to enqueue evaluation tasks. - Updated `start_evaluation` to utilize the new Celery task for batch submissions. - Enhanced error handling in evaluation run creation and processing. - Added support for Google Gemini in evaluation batch processing. - Refactored model configuration to support multiple providers and completion types. - Improved logging for better traceability during evaluation processes. - Updated tests to reflect changes in evaluation batch submission logic.
|
Warning Review limit reached
More reviews will be available in 43 minutes and 53 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds Google AI Studio (Gemini) as a second supported provider for evaluation batch jobs alongside OpenAI. Rewrites ChangesMulti-provider Evaluation Batch (Gemini + Async Celery)
Sequence Diagram(s)sequenceDiagram
participant API as API Route
participant EvalSvc as validate_and_start_batch_evaluation
participant DB as Database
participant CeleryQ as start_evaluation_batch_submission
participant Worker as run_evaluation_batch_submission
participant BatchSvc as execute_evaluation_batch_submission
participant BatchCRUD as start_evaluation_batch
API->>EvalSvc: POST /evaluations (provider, config_id)
EvalSvc->>EvalSvc: check provider in _SUPPORTED_BATCH_PROVIDERS
EvalSvc->>DB: create EvaluationRun (status=pending)
EvalSvc->>CeleryQ: enqueue(project_id, job_id, config_id, trace_id)
CeleryQ-->>EvalSvc: celery task_id
EvalSvc-->>API: return eval_run
Note over Worker,BatchSvc: Async Celery worker picks up task
Worker->>BatchSvc: execute_evaluation_batch_submission(project_id, job_id, ...)
BatchSvc->>DB: fetch EvaluationRun
BatchSvc->>DB: resolve config by UUID
BatchSvc->>BatchCRUD: start_evaluation_batch(langfuse, session, eval_run, params, provider)
BatchCRUD->>BatchCRUD: normalize provider, build JSONL, create batch provider
BatchCRUD->>DB: update eval_run (batch_job_id, status, total_items)
BatchSvc-->>Worker: {"success": true, "batch_job_id": ...}
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
OpenAPI changes ⚪ No API surface changesNote This PR does not modify the API contract.
|
…xtLLMParams class
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
… logic in evaluation batch processing
There was a problem hiding this comment.
🧹 Nitpick comments (3)
backend/app/crud/evaluations/langfuse.py (1)
158-170: 💤 Low valueConsider using WARNING level for rate-limit logs.
Rate-limit responses (429) are expected transient conditions, not errors in the application. Using
logger.errormay cause alert fatigue.logger.warningwould be more appropriate since the code handles this gracefully by continuing to the next item.💡 Suggested change
if getattr(e, "status_code", None) == 429: - logger.error( + logger.warning( f"[create_langfuse_dataset_run] Langfuse rate limit (429) | " f"item_id={item_id}" )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/crud/evaluations/langfuse.py` around lines 158 - 170, In the exception handler of the create_langfuse_dataset_run function, the rate-limit (429) response is being logged at the ERROR level, but this is an expected transient condition that should not trigger alerts. Change the logger.error call in the if block that checks for status_code == 429 to logger.warning instead, since the code handles this gracefully by continuing to the next item.backend/app/tests/api/routes/test_evaluation.py (1)
611-818: 💤 Low valueConsider adding type hints to mock parameters for full guideline compliance.
The coding guideline requires type hints on all function parameters. While the existing test suite doesn't type-hint mock parameters, adding them would align with the guideline. For example:
from unittest.mock import MagicMock def test_example(self, mock_fetch: MagicMock, mock_map: MagicMock) -> None:As per coding guidelines:
**/*.py: Always add type hints to all function parameters and return values in Python🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/tests/api/routes/test_evaluation.py` around lines 611 - 818, All test methods in the TestBatchEvaluationJSONLBuilding class should comply with the coding guideline requiring type hints on all function parameters. Add type hints to the self parameter for all test methods (test_build_batch_jsonl_basic, test_build_batch_jsonl_with_tools, test_build_batch_jsonl_minimal_config, test_build_batch_jsonl_skips_empty_questions, test_build_batch_jsonl_multiple_items, test_build_batch_jsonl_temperature_included_when_explicitly_set, test_build_batch_jsonl_temperature_excluded_when_not_set, and test_build_batch_jsonl_temperature_zero_included_when_explicitly_set). If any of these test methods use mock parameters (injected via pytest fixtures or as function arguments), add appropriate type hints such as MagicMock from unittest.mock to those parameters as well.Source: Coding guidelines
backend/app/services/evaluations/batch_job.py (1)
21-30: ⚡ Quick winAdd type hints to function parameters and return value.
Per coding guidelines, all function parameters and return values need type hints. The
task_instanceparameter and**kwargslack type annotations. Consider usingAnyfortask_instance(matching the Celery task pattern) and aTypedDictor more specific return type.+from typing import Any + def execute_evaluation_batch_submission( project_id: int, job_id: str, task_id: str, - task_instance, + task_instance: Any, organization_id: int, config_id: str, config_version: int, **kwargs, -) -> dict: +) -> dict[str, Any]:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/services/evaluations/batch_job.py` around lines 21 - 30, Add missing type hints to the execute_evaluation_batch_submission function. Annotate the task_instance parameter with Any type to match Celery task patterns, add type hints to the **kwargs parameter (also using Any), and consider defining the return type more specifically than just dict by using a TypedDict or a more precise type annotation that reflects what the function actually returns.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@backend/app/crud/evaluations/langfuse.py`:
- Around line 158-170: In the exception handler of the
create_langfuse_dataset_run function, the rate-limit (429) response is being
logged at the ERROR level, but this is an expected transient condition that
should not trigger alerts. Change the logger.error call in the if block that
checks for status_code == 429 to logger.warning instead, since the code handles
this gracefully by continuing to the next item.
In `@backend/app/services/evaluations/batch_job.py`:
- Around line 21-30: Add missing type hints to the
execute_evaluation_batch_submission function. Annotate the task_instance
parameter with Any type to match Celery task patterns, add type hints to the
**kwargs parameter (also using Any), and consider defining the return type more
specifically than just dict by using a TypedDict or a more precise type
annotation that reflects what the function actually returns.
In `@backend/app/tests/api/routes/test_evaluation.py`:
- Around line 611-818: All test methods in the TestBatchEvaluationJSONLBuilding
class should comply with the coding guideline requiring type hints on all
function parameters. Add type hints to the self parameter for all test methods
(test_build_batch_jsonl_basic, test_build_batch_jsonl_with_tools,
test_build_batch_jsonl_minimal_config,
test_build_batch_jsonl_skips_empty_questions,
test_build_batch_jsonl_multiple_items,
test_build_batch_jsonl_temperature_included_when_explicitly_set,
test_build_batch_jsonl_temperature_excluded_when_not_set, and
test_build_batch_jsonl_temperature_zero_included_when_explicitly_set). If any of
these test methods use mock parameters (injected via pytest fixtures or as
function arguments), add appropriate type hints such as MagicMock from
unittest.mock to those parameters as well.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d78ee05c-da68-4a57-8a2c-a6e4fe2ac00b
📒 Files selected for processing (13)
backend/app/celery/tasks/job_execution.pybackend/app/celery/utils.pybackend/app/core/batch/gemini.pybackend/app/core/batch/operations.pybackend/app/crud/evaluations/batch.pybackend/app/crud/evaluations/langfuse.pybackend/app/crud/evaluations/processing.pybackend/app/services/evaluations/batch_job.pybackend/app/services/evaluations/evaluation.pybackend/app/tests/api/routes/test_evaluation.pybackend/app/tests/crud/evaluations/test_langfuse.pybackend/app/tests/crud/evaluations/test_processing.pybackend/app/tests/services/evaluations/test_evaluation_service_s3.py
Target issue is #879
Summary
Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Improvements