Skip to content

Evaluation: Gemini batch integration (text)#867

Open
vprashrex wants to merge 9 commits into
mainfrom
feat/eval-gemini-integration
Open

Evaluation: Gemini batch integration (text)#867
vprashrex wants to merge 9 commits into
mainfrom
feat/eval-gemini-integration

Conversation

@vprashrex

@vprashrex vprashrex commented May 20, 2026

Copy link
Copy Markdown
Collaborator

Target issue is #879

Summary

  • Moved evaluation batch submission from the API request (sync, blocking) to a Celery worker (async).
  • Added Celery task run_evaluation_batch_submission, enqueue helper start_evaluation_batch_submission, and worker entrypoint execute_evaluation_batch_submission; API now only creates the EvaluationRun and queues the job.
  • Run is marked failed with an error message on timeout or any exception during submission.
  • Added Google Gemini support to evaluation batches (was OpenAI-only).
  • start_evaluation_batch now takes a params dict + provider and branches: OpenAI builds Responses JSONL via build_openai_evaluation_jsonl, Gemini builds contents/systemInstruction/generationConfig JSONL via build_google_evaluation_jsonl.
  • Provider validation widened to accept openai, openai-native, google-aistudio, google-aistudio-native.
  • Result polling picks the provider via _get_batch_provider; parse_evaluation_output now reads Gemini responses (key id, extract_text_from_response_dict, usageMetadata → usage for cost).
  • Moved blocking I/O in polling (download, dataset fetch, langfuse run, embeddings) to asyncio.to_thread.
  • Fixed Gemini batch upload mime type: jsonl → application/jsonl.
  • Added/updated tests for OpenAI + Google JSONL builders, evaluation processing, and Langfuse dataset-run handling.

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for Google Gemini as a batch evaluation provider
    • Batch evaluations now submit asynchronously via background task queue
  • Bug Fixes

    • Improved error handling for batch job creation failures with proper status tracking
    • Enhanced rate-limit detection in batch item processing
  • Improvements

    • Optimized long-running batch operations with background processing
    • Better cost tracking for Gemini evaluations

- Introduced `run_evaluation_batch_submission` task to handle evaluation batch submissions.
- Created `start_evaluation_batch_submission` utility to enqueue evaluation tasks.
- Updated `start_evaluation` to utilize the new Celery task for batch submissions.
- Enhanced error handling in evaluation run creation and processing.
- Added support for Google Gemini in evaluation batch processing.
- Refactored model configuration to support multiple providers and completion types.
- Improved logging for better traceability during evaluation processes.
- Updated tests to reflect changes in evaluation batch submission logic.
@coderabbitai

coderabbitai Bot commented May 20, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@vprashrex, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 43 minutes and 53 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 02f9ffc9-b9a6-4928-bde3-84cee10683db

📥 Commits

Reviewing files that changed from the base of the PR and between 5c2435b and 5ced362.

📒 Files selected for processing (2)
  • backend/app/crud/evaluations/processing.py
  • backend/app/tests/crud/evaluations/test_processing.py
📝 Walkthrough

Walkthrough

Adds Google AI Studio (Gemini) as a second supported provider for evaluation batch jobs alongside OpenAI. Rewrites start_evaluation_batch and JSONL builders in batch.py to dispatch per-provider. Moves batch submission off the request path into a new low-priority Celery task (run_evaluation_batch_submission) backed by a service function (execute_evaluation_batch_submission) with timeout and error handling. Refactors processing.py to support multi-provider output parsing and wraps synchronous I/O in asyncio.to_thread. Fixes Gemini MIME type, batch failure status propagation, and Langfuse 429 rate-limit handling.

Changes

Multi-provider Evaluation Batch (Gemini + Async Celery)

Layer / File(s) Summary
Gemini MIME type, batch failure status, and Langfuse rate-limit handling
backend/app/core/batch/gemini.py, backend/app/core/batch/operations.py, backend/app/crud/evaluations/langfuse.py
Fixes JSONL MIME type to "application/jsonl" for Gemini uploads, adds provider_status="failed" to batch creation error payloads, and adds a special-case log branch for HTTP 429 in Langfuse trace creation.
Multi-provider JSONL builders and start_evaluation_batch rewrite
backend/app/crud/evaluations/batch.py
Removes the OpenAI-only build_evaluation_jsonl and old start_evaluation_batch signature; adds build_openai_evaluation_jsonl, build_google_evaluation_jsonl, and a provider-driven start_evaluation_batch that normalizes the provider string, dispatches to the correct JSONL builder and client, and raises on empty JSONL.
Multi-provider output parsing and asyncio.to_thread refactor
backend/app/crud/evaluations/processing.py
Adds _get_batch_provider, _extract_gemini_usage, and extends parse_evaluation_output with a provider_name parameter for Gemini response parsing. Wraps all long-running synchronous calls in asyncio.to_thread, broadens failure state detection, and adds langfuse.flush() per project.
execute_evaluation_batch_submission Celery service
backend/app/services/evaluations/batch_job.py
New service function that fetches an EvaluationRun, resolves config, calls start_evaluation_batch, and handles gevent/Celery timeouts and general exceptions by marking the run failed and re-raising.
Celery task registration and enqueue helper
backend/app/celery/tasks/job_execution.py, backend/app/celery/utils.py
Registers run_evaluation_batch_submission as a low-priority bound task with gevent timeout, and adds start_evaluation_batch_submission to enqueue it with propagated trace context.
validate_and_start_batch_evaluation: Celery queuing and multi-provider gating
backend/app/services/evaluations/evaluation.py
Introduces _SUPPORTED_BATCH_PROVIDERS, switches provider validation to set membership, replaces synchronous inline batch start with Celery queuing via start_evaluation_batch_submission, and marks the run failed without raising on queue errors.
Tests: JSONL builders, processing, orchestration, service
backend/app/tests/api/routes/test_evaluation.py, backend/app/tests/crud/evaluations/test_processing.py, backend/app/tests/crud/evaluations/test_langfuse.py, backend/app/tests/services/evaluations/test_evaluation_service_s3.py
Refactors OpenAI JSONL tests to use plain dicts; adds Gemini JSONL, processing, and provider-dispatch test suites; adds 429 rate-limit skip test; and adds TestValidateAndStartBatchEvaluation for gating, queue failure, and success.

Sequence Diagram(s)

sequenceDiagram
    participant API as API Route
    participant EvalSvc as validate_and_start_batch_evaluation
    participant DB as Database
    participant CeleryQ as start_evaluation_batch_submission
    participant Worker as run_evaluation_batch_submission
    participant BatchSvc as execute_evaluation_batch_submission
    participant BatchCRUD as start_evaluation_batch

    API->>EvalSvc: POST /evaluations (provider, config_id)
    EvalSvc->>EvalSvc: check provider in _SUPPORTED_BATCH_PROVIDERS
    EvalSvc->>DB: create EvaluationRun (status=pending)
    EvalSvc->>CeleryQ: enqueue(project_id, job_id, config_id, trace_id)
    CeleryQ-->>EvalSvc: celery task_id
    EvalSvc-->>API: return eval_run

    Note over Worker,BatchSvc: Async Celery worker picks up task
    Worker->>BatchSvc: execute_evaluation_batch_submission(project_id, job_id, ...)
    BatchSvc->>DB: fetch EvaluationRun
    BatchSvc->>DB: resolve config by UUID
    BatchSvc->>BatchCRUD: start_evaluation_batch(langfuse, session, eval_run, params, provider)
    BatchCRUD->>BatchCRUD: normalize provider, build JSONL, create batch provider
    BatchCRUD->>DB: update eval_run (batch_job_id, status, total_items)
    BatchSvc-->>Worker: {"success": true, "batch_job_id": ...}
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • ProjectTech4DevAI/kaapi-backend#714: Introduced the same Celery task + start_* helper pattern in job_execution.py and celery/utils.py that this PR extends with run_evaluation_batch_submission and start_evaluation_batch_submission.
  • ProjectTech4DevAI/kaapi-backend#685: Both PRs touch GeminiBatchProvider.upload_file MIME handling in backend/app/core/batch/gemini.py.
  • ProjectTech4DevAI/kaapi-backend#800: Introduced the gevent_timeout decorator consumed by the new run_evaluation_batch_submission Celery task added in this PR.

Suggested labels

enhancement

Suggested reviewers

  • AkhileshNegi
  • Prajna1999

Poem

🐇 Hop hop, the batch queue grows,
Two providers now, the data flows!
Gemini joins the OpenAI crew,
Async tasks dispatch the JSONL queue.
asyncio.to_thread — no blocking in sight,
This bunny submits batches left and right! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding Gemini batch integration to the evaluation system, which is the primary objective of this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/eval-gemini-integration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@vprashrex vprashrex changed the title feat: Add evaluation batch submission functionality Evaluation: Gemini batch integration (text) May 22, 2026
@vprashrex vprashrex linked an issue May 22, 2026 that may be closed by this pull request
@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown

OpenAPI changes   ⚪ No API surface changes

Note

This PR does not modify the API contract.

mainb6c873e1 · generated by oasdiff

@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.72897% with 14 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
backend/app/crud/evaluations/processing.py 94.04% 5 Missing ⚠️
backend/app/celery/utils.py 20.00% 4 Missing ⚠️
backend/app/celery/tasks/job_execution.py 50.00% 3 Missing ⚠️
backend/app/crud/evaluations/batch.py 96.49% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@vprashrex vprashrex self-assigned this Jun 17, 2026
@vprashrex vprashrex requested review from AkhileshNegi and Prajna1999 and removed request for AkhileshNegi June 17, 2026 05:32
@vprashrex vprashrex added ready-for-review enhancement New feature or request labels Jun 17, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
backend/app/crud/evaluations/langfuse.py (1)

158-170: 💤 Low value

Consider using WARNING level for rate-limit logs.

Rate-limit responses (429) are expected transient conditions, not errors in the application. Using logger.error may cause alert fatigue. logger.warning would be more appropriate since the code handles this gracefully by continuing to the next item.

💡 Suggested change
             if getattr(e, "status_code", None) == 429:
-                    logger.error(
+                    logger.warning(
                         f"[create_langfuse_dataset_run] Langfuse rate limit (429) | "
                         f"item_id={item_id}"
                     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/crud/evaluations/langfuse.py` around lines 158 - 170, In the
exception handler of the create_langfuse_dataset_run function, the rate-limit
(429) response is being logged at the ERROR level, but this is an expected
transient condition that should not trigger alerts. Change the logger.error call
in the if block that checks for status_code == 429 to logger.warning instead,
since the code handles this gracefully by continuing to the next item.
backend/app/tests/api/routes/test_evaluation.py (1)

611-818: 💤 Low value

Consider adding type hints to mock parameters for full guideline compliance.

The coding guideline requires type hints on all function parameters. While the existing test suite doesn't type-hint mock parameters, adding them would align with the guideline. For example:

from unittest.mock import MagicMock

def test_example(self, mock_fetch: MagicMock, mock_map: MagicMock) -> None:

As per coding guidelines: **/*.py: Always add type hints to all function parameters and return values in Python

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/tests/api/routes/test_evaluation.py` around lines 611 - 818, All
test methods in the TestBatchEvaluationJSONLBuilding class should comply with
the coding guideline requiring type hints on all function parameters. Add type
hints to the self parameter for all test methods (test_build_batch_jsonl_basic,
test_build_batch_jsonl_with_tools, test_build_batch_jsonl_minimal_config,
test_build_batch_jsonl_skips_empty_questions,
test_build_batch_jsonl_multiple_items,
test_build_batch_jsonl_temperature_included_when_explicitly_set,
test_build_batch_jsonl_temperature_excluded_when_not_set, and
test_build_batch_jsonl_temperature_zero_included_when_explicitly_set). If any of
these test methods use mock parameters (injected via pytest fixtures or as
function arguments), add appropriate type hints such as MagicMock from
unittest.mock to those parameters as well.

Source: Coding guidelines

backend/app/services/evaluations/batch_job.py (1)

21-30: ⚡ Quick win

Add type hints to function parameters and return value.

Per coding guidelines, all function parameters and return values need type hints. The task_instance parameter and **kwargs lack type annotations. Consider using Any for task_instance (matching the Celery task pattern) and a TypedDict or more specific return type.

+from typing import Any
+
 def execute_evaluation_batch_submission(
     project_id: int,
     job_id: str,
     task_id: str,
-    task_instance,
+    task_instance: Any,
     organization_id: int,
     config_id: str,
     config_version: int,
     **kwargs,
-) -> dict:
+) -> dict[str, Any]:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/services/evaluations/batch_job.py` around lines 21 - 30, Add
missing type hints to the execute_evaluation_batch_submission function. Annotate
the task_instance parameter with Any type to match Celery task patterns, add
type hints to the **kwargs parameter (also using Any), and consider defining the
return type more specifically than just dict by using a TypedDict or a more
precise type annotation that reflects what the function actually returns.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@backend/app/crud/evaluations/langfuse.py`:
- Around line 158-170: In the exception handler of the
create_langfuse_dataset_run function, the rate-limit (429) response is being
logged at the ERROR level, but this is an expected transient condition that
should not trigger alerts. Change the logger.error call in the if block that
checks for status_code == 429 to logger.warning instead, since the code handles
this gracefully by continuing to the next item.

In `@backend/app/services/evaluations/batch_job.py`:
- Around line 21-30: Add missing type hints to the
execute_evaluation_batch_submission function. Annotate the task_instance
parameter with Any type to match Celery task patterns, add type hints to the
**kwargs parameter (also using Any), and consider defining the return type more
specifically than just dict by using a TypedDict or a more precise type
annotation that reflects what the function actually returns.

In `@backend/app/tests/api/routes/test_evaluation.py`:
- Around line 611-818: All test methods in the TestBatchEvaluationJSONLBuilding
class should comply with the coding guideline requiring type hints on all
function parameters. Add type hints to the self parameter for all test methods
(test_build_batch_jsonl_basic, test_build_batch_jsonl_with_tools,
test_build_batch_jsonl_minimal_config,
test_build_batch_jsonl_skips_empty_questions,
test_build_batch_jsonl_multiple_items,
test_build_batch_jsonl_temperature_included_when_explicitly_set,
test_build_batch_jsonl_temperature_excluded_when_not_set, and
test_build_batch_jsonl_temperature_zero_included_when_explicitly_set). If any of
these test methods use mock parameters (injected via pytest fixtures or as
function arguments), add appropriate type hints such as MagicMock from
unittest.mock to those parameters as well.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d78ee05c-da68-4a57-8a2c-a6e4fe2ac00b

📥 Commits

Reviewing files that changed from the base of the PR and between fbe5e56 and 5c2435b.

📒 Files selected for processing (13)
  • backend/app/celery/tasks/job_execution.py
  • backend/app/celery/utils.py
  • backend/app/core/batch/gemini.py
  • backend/app/core/batch/operations.py
  • backend/app/crud/evaluations/batch.py
  • backend/app/crud/evaluations/langfuse.py
  • backend/app/crud/evaluations/processing.py
  • backend/app/services/evaluations/batch_job.py
  • backend/app/services/evaluations/evaluation.py
  • backend/app/tests/api/routes/test_evaluation.py
  • backend/app/tests/crud/evaluations/test_langfuse.py
  • backend/app/tests/crud/evaluations/test_processing.py
  • backend/app/tests/services/evaluations/test_evaluation_service_s3.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ready-for-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Evaluation: Add Gemini model support

1 participant