Evaluation: Gemini batch integration (text) by vprashrex · Pull Request #867 · ProjectTech4DevAI/kaapi-backend

vprashrex · 2026-05-20T09:57:11Z

Target issue is #879

Summary

Moved evaluation batch submission from the API request (sync, blocking) to a Celery worker (async).
Added Celery task run_evaluation_batch_submission, enqueue helper start_evaluation_batch_submission, and worker entrypoint execute_evaluation_batch_submission; API now only creates the EvaluationRun and queues the job.
Run is marked failed with an error message on timeout or any exception during submission.
Added Google Gemini support to evaluation batches (was OpenAI-only).
start_evaluation_batch now takes a params dict + provider and branches: OpenAI builds Responses JSONL via build_openai_evaluation_jsonl, Gemini builds contents/systemInstruction/generationConfig JSONL via build_google_evaluation_jsonl.
Provider validation widened to accept openai, openai-native, google-aistudio, google-aistudio-native.
Result polling picks the provider via _get_batch_provider; parse_evaluation_output now reads Gemini responses (key id, extract_text_from_response_dict, usageMetadata → usage for cost).
Moved blocking I/O in polling (download, dataset fetch, langfuse run, embeddings) to asyncio.to_thread.
Fixed Gemini batch upload mime type: jsonl → application/jsonl.
Added/updated tests for OpenAI + Google JSONL builders, evaluation processing, and Langfuse dataset-run handling.

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Summary by CodeRabbit

Release Notes

New Features
- Added support for Google Gemini as a batch evaluation provider
- Batch evaluations now submit asynchronously via background task queue
Bug Fixes
- Improved error handling for batch job creation failures with proper status tracking
- Enhanced rate-limit detection in batch item processing
Improvements
- Optimized long-running batch operations with background processing
- Better cost tracking for Gemini evaluations

- Introduced `run_evaluation_batch_submission` task to handle evaluation batch submissions. - Created `start_evaluation_batch_submission` utility to enqueue evaluation tasks. - Updated `start_evaluation` to utilize the new Celery task for batch submissions. - Enhanced error handling in evaluation run creation and processing. - Added support for Google Gemini in evaluation batch processing. - Refactored model configuration to support multiple providers and completion types. - Improved logging for better traceability during evaluation processes. - Updated tests to reflect changes in evaluation batch submission logic.

coderabbitai · 2026-05-20T09:57:18Z

Warning

Review limit reached

@vprashrex, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 43 minutes and 53 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 02f9ffc9-b9a6-4928-bde3-84cee10683db

📥 Commits

Reviewing files that changed from the base of the PR and between 5c2435b and 5ced362.

📒 Files selected for processing (2)

backend/app/crud/evaluations/processing.py
backend/app/tests/crud/evaluations/test_processing.py

📝 Walkthrough

Walkthrough

Adds Google AI Studio (Gemini) as a second supported provider for evaluation batch jobs alongside OpenAI. Rewrites start_evaluation_batch and JSONL builders in batch.py to dispatch per-provider. Moves batch submission off the request path into a new low-priority Celery task (run_evaluation_batch_submission) backed by a service function (execute_evaluation_batch_submission) with timeout and error handling. Refactors processing.py to support multi-provider output parsing and wraps synchronous I/O in asyncio.to_thread. Fixes Gemini MIME type, batch failure status propagation, and Langfuse 429 rate-limit handling.

Changes

Multi-provider Evaluation Batch (Gemini + Async Celery)

Layer / File(s)	Summary
Gemini MIME type, batch failure status, and Langfuse rate-limit handling `backend/app/core/batch/gemini.py`, `backend/app/core/batch/operations.py`, `backend/app/crud/evaluations/langfuse.py`	Fixes JSONL MIME type to `"application/jsonl"` for Gemini uploads, adds `provider_status="failed"` to batch creation error payloads, and adds a special-case log branch for HTTP 429 in Langfuse trace creation.
Multi-provider JSONL builders and `start_evaluation_batch` rewrite `backend/app/crud/evaluations/batch.py`	Removes the OpenAI-only `build_evaluation_jsonl` and old `start_evaluation_batch` signature; adds `build_openai_evaluation_jsonl`, `build_google_evaluation_jsonl`, and a provider-driven `start_evaluation_batch` that normalizes the provider string, dispatches to the correct JSONL builder and client, and raises on empty JSONL.
Multi-provider output parsing and `asyncio.to_thread` refactor `backend/app/crud/evaluations/processing.py`	Adds `_get_batch_provider`, `_extract_gemini_usage`, and extends `parse_evaluation_output` with a `provider_name` parameter for Gemini response parsing. Wraps all long-running synchronous calls in `asyncio.to_thread`, broadens failure state detection, and adds `langfuse.flush()` per project.
`execute_evaluation_batch_submission` Celery service `backend/app/services/evaluations/batch_job.py`	New service function that fetches an EvaluationRun, resolves config, calls `start_evaluation_batch`, and handles gevent/Celery timeouts and general exceptions by marking the run failed and re-raising.
Celery task registration and enqueue helper `backend/app/celery/tasks/job_execution.py`, `backend/app/celery/utils.py`	Registers `run_evaluation_batch_submission` as a low-priority bound task with gevent timeout, and adds `start_evaluation_batch_submission` to enqueue it with propagated trace context.
`validate_and_start_batch_evaluation`: Celery queuing and multi-provider gating `backend/app/services/evaluations/evaluation.py`	Introduces `_SUPPORTED_BATCH_PROVIDERS`, switches provider validation to set membership, replaces synchronous inline batch start with Celery queuing via `start_evaluation_batch_submission`, and marks the run failed without raising on queue errors.
Tests: JSONL builders, processing, orchestration, service `backend/app/tests/api/routes/test_evaluation.py`, `backend/app/tests/crud/evaluations/test_processing.py`, `backend/app/tests/crud/evaluations/test_langfuse.py`, `backend/app/tests/services/evaluations/test_evaluation_service_s3.py`	Refactors OpenAI JSONL tests to use plain dicts; adds Gemini JSONL, processing, and provider-dispatch test suites; adds 429 rate-limit skip test; and adds `TestValidateAndStartBatchEvaluation` for gating, queue failure, and success.

Sequence Diagram(s)

sequenceDiagram
    participant API as API Route
    participant EvalSvc as validate_and_start_batch_evaluation
    participant DB as Database
    participant CeleryQ as start_evaluation_batch_submission
    participant Worker as run_evaluation_batch_submission
    participant BatchSvc as execute_evaluation_batch_submission
    participant BatchCRUD as start_evaluation_batch

    API->>EvalSvc: POST /evaluations (provider, config_id)
    EvalSvc->>EvalSvc: check provider in _SUPPORTED_BATCH_PROVIDERS
    EvalSvc->>DB: create EvaluationRun (status=pending)
    EvalSvc->>CeleryQ: enqueue(project_id, job_id, config_id, trace_id)
    CeleryQ-->>EvalSvc: celery task_id
    EvalSvc-->>API: return eval_run

    Note over Worker,BatchSvc: Async Celery worker picks up task
    Worker->>BatchSvc: execute_evaluation_batch_submission(project_id, job_id, ...)
    BatchSvc->>DB: fetch EvaluationRun
    BatchSvc->>DB: resolve config by UUID
    BatchSvc->>BatchCRUD: start_evaluation_batch(langfuse, session, eval_run, params, provider)
    BatchCRUD->>BatchCRUD: normalize provider, build JSONL, create batch provider
    BatchCRUD->>DB: update eval_run (batch_job_id, status, total_items)
    BatchSvc-->>Worker: {"success": true, "batch_job_id": ...}

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

ProjectTech4DevAI/kaapi-backend#714: Introduced the same Celery task + start_* helper pattern in job_execution.py and celery/utils.py that this PR extends with run_evaluation_batch_submission and start_evaluation_batch_submission.
ProjectTech4DevAI/kaapi-backend#685: Both PRs touch GeminiBatchProvider.upload_file MIME handling in backend/app/core/batch/gemini.py.
ProjectTech4DevAI/kaapi-backend#800: Introduced the gevent_timeout decorator consumed by the new run_evaluation_batch_submission Celery task added in this PR.

Suggested labels

enhancement

Suggested reviewers

AkhileshNegi
Prajna1999

Poem

🐇 Hop hop, the batch queue grows,
Two providers now, the data flows!
Gemini joins the OpenAI crew,
Async tasks dispatch the JSONL queue.
asyncio.to_thread — no blocking in sight,
This bunny submits batches left and right! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding Gemini batch integration to the evaluation system, which is the primary objective of this PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/eval-gemini-integration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

… polling

github-actions · 2026-06-15T06:28:36Z

OpenAPI changes ⚪ No API surface changes

Note

This PR does not modify the API contract.

_{main ↔ b6c873e1 · generated by oasdiff}

…xtLLMParams class

codecov · 2026-06-15T08:57:37Z

Codecov Report

❌ Patch coverage is 96.72897% with 14 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/crud/evaluations/processing.py	94.04%	5 Missing ⚠️
backend/app/celery/utils.py	20.00%	4 Missing ⚠️
backend/app/celery/tasks/job_execution.py	50.00%	3 Missing ⚠️
backend/app/crud/evaluations/batch.py	96.49%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

… logic in evaluation batch processing

…un handling

coderabbitai

🧹 Nitpick comments (3)

backend/app/crud/evaluations/langfuse.py (1)

158-170: 💤 Low value

Consider using WARNING level for rate-limit logs.

Rate-limit responses (429) are expected transient conditions, not errors in the application. Using logger.error may cause alert fatigue. logger.warning would be more appropriate since the code handles this gracefully by continuing to the next item.

💡 Suggested change

             if getattr(e, "status_code", None) == 429:
-                    logger.error(
+                    logger.warning(
                         f"[create_langfuse_dataset_run] Langfuse rate limit (429) | "
                         f"item_id={item_id}"
                     )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/crud/evaluations/langfuse.py` around lines 158 - 170, In the
exception handler of the create_langfuse_dataset_run function, the rate-limit
(429) response is being logged at the ERROR level, but this is an expected
transient condition that should not trigger alerts. Change the logger.error call
in the if block that checks for status_code == 429 to logger.warning instead,
since the code handles this gracefully by continuing to the next item.

backend/app/tests/api/routes/test_evaluation.py (1)

611-818: 💤 Low value

Consider adding type hints to mock parameters for full guideline compliance.

The coding guideline requires type hints on all function parameters. While the existing test suite doesn't type-hint mock parameters, adding them would align with the guideline. For example:

from unittest.mock import MagicMock

def test_example(self, mock_fetch: MagicMock, mock_map: MagicMock) -> None:

As per coding guidelines: **/*.py: Always add type hints to all function parameters and return values in Python

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/tests/api/routes/test_evaluation.py` around lines 611 - 818, All
test methods in the TestBatchEvaluationJSONLBuilding class should comply with
the coding guideline requiring type hints on all function parameters. Add type
hints to the self parameter for all test methods (test_build_batch_jsonl_basic,
test_build_batch_jsonl_with_tools, test_build_batch_jsonl_minimal_config,
test_build_batch_jsonl_skips_empty_questions,
test_build_batch_jsonl_multiple_items,
test_build_batch_jsonl_temperature_included_when_explicitly_set,
test_build_batch_jsonl_temperature_excluded_when_not_set, and
test_build_batch_jsonl_temperature_zero_included_when_explicitly_set). If any of
these test methods use mock parameters (injected via pytest fixtures or as
function arguments), add appropriate type hints such as MagicMock from
unittest.mock to those parameters as well.

Source: Coding guidelines

backend/app/services/evaluations/batch_job.py (1)

21-30: ⚡ Quick win

Add type hints to function parameters and return value.

Per coding guidelines, all function parameters and return values need type hints. The task_instance parameter and **kwargs lack type annotations. Consider using Any for task_instance (matching the Celery task pattern) and a TypedDict or more specific return type.

+from typing import Any
+
 def execute_evaluation_batch_submission(
     project_id: int,
     job_id: str,
     task_id: str,
-    task_instance,
+    task_instance: Any,
     organization_id: int,
     config_id: str,
     config_version: int,
     **kwargs,
-) -> dict:
+) -> dict[str, Any]:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/services/evaluations/batch_job.py` around lines 21 - 30, Add
missing type hints to the execute_evaluation_batch_submission function. Annotate
the task_instance parameter with Any type to match Celery task patterns, add
type hints to the **kwargs parameter (also using Any), and consider defining the
return type more specifically than just dict by using a TypedDict or a more
precise type annotation that reflects what the function actually returns.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@backend/app/crud/evaluations/langfuse.py`:
- Around line 158-170: In the exception handler of the
create_langfuse_dataset_run function, the rate-limit (429) response is being
logged at the ERROR level, but this is an expected transient condition that
should not trigger alerts. Change the logger.error call in the if block that
checks for status_code == 429 to logger.warning instead, since the code handles
this gracefully by continuing to the next item.

In `@backend/app/services/evaluations/batch_job.py`:
- Around line 21-30: Add missing type hints to the
execute_evaluation_batch_submission function. Annotate the task_instance
parameter with Any type to match Celery task patterns, add type hints to the
**kwargs parameter (also using Any), and consider defining the return type more
specifically than just dict by using a TypedDict or a more precise type
annotation that reflects what the function actually returns.

In `@backend/app/tests/api/routes/test_evaluation.py`:
- Around line 611-818: All test methods in the TestBatchEvaluationJSONLBuilding
class should comply with the coding guideline requiring type hints on all
function parameters. Add type hints to the self parameter for all test methods
(test_build_batch_jsonl_basic, test_build_batch_jsonl_with_tools,
test_build_batch_jsonl_minimal_config,
test_build_batch_jsonl_skips_empty_questions,
test_build_batch_jsonl_multiple_items,
test_build_batch_jsonl_temperature_included_when_explicitly_set,
test_build_batch_jsonl_temperature_excluded_when_not_set, and
test_build_batch_jsonl_temperature_zero_included_when_explicitly_set). If any of
these test methods use mock parameters (injected via pytest fixtures or as
function arguments), add appropriate type hints such as MagicMock from
unittest.mock to those parameters as well.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d78ee05c-da68-4a57-8a2c-a6e4fe2ac00b

📥 Commits

Reviewing files that changed from the base of the PR and between fbe5e56 and 5c2435b.

📒 Files selected for processing (13)

backend/app/celery/tasks/job_execution.py
backend/app/celery/utils.py
backend/app/core/batch/gemini.py
backend/app/core/batch/operations.py
backend/app/crud/evaluations/batch.py
backend/app/crud/evaluations/langfuse.py
backend/app/crud/evaluations/processing.py
backend/app/services/evaluations/batch_job.py
backend/app/services/evaluations/evaluation.py
backend/app/tests/api/routes/test_evaluation.py
backend/app/tests/crud/evaluations/test_langfuse.py
backend/app/tests/crud/evaluations/test_processing.py
backend/app/tests/services/evaluations/test_evaluation_service_s3.py

feat: Add Langfuse flush to prevent thread accumulation in evaluation…

ad191d3

… polling

vprashrex changed the title ~~feat: Add evaluation batch submission functionality~~ Evaluation: Gemini batch integration (text) May 22, 2026

vprashrex linked an issue May 22, 2026 that may be closed by this pull request

Evaluation: Add Gemini model support #879

Open

Merge branch 'main' into feat/eval-gemini-integration

859e5ad

Refactor JSONL building tests to use openai_params dict instead of Te…

783643a

…xtLLMParams class

vprashrex added 4 commits June 15, 2026 14:40

Update provider names to include "google-aistudio" and adjust related…

1603867

… logic in evaluation batch processing

Merge branch 'main' into feat/eval-gemini-integration

3f59523

Add tests for Google evaluation JSONL building and Langfuse dataset r…

53d54bc

…un handling

Refactor test data formatting in Google evaluation JSONL builder

5c2435b

vprashrex self-assigned this Jun 17, 2026

vprashrex requested review from AkhileshNegi and Prajna1999 and removed request for AkhileshNegi June 17, 2026 05:32

vprashrex added ready-for-review enhancement New feature or request labels Jun 17, 2026

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

Merge branch 'main' into feat/eval-gemini-integration

5ced362

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation: Gemini batch integration (text)#867

Evaluation: Gemini batch integration (text)#867
vprashrex wants to merge 9 commits into
mainfrom
feat/eval-gemini-integration

vprashrex commented May 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 20, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vprashrex commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Target issue is #879

Summary

Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OpenAPI changes ⚪ No API surface changes

Uh oh!

codecov Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vprashrex commented May 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading

github-actions Bot commented Jun 15, 2026 •

edited

Loading

codecov Bot commented Jun 15, 2026 •

edited

Loading