[Experiment] code-review: BCQuality integration arm (live skills)#715
Draft
gggdttt wants to merge 18 commits into
Draft
[Experiment] code-review: BCQuality integration arm (live skills)#715gggdttt wants to merge 18 commits into
gggdttt wants to merge 18 commits into
Conversation
Adds a bcquality config section (default disabled) and a Python module that clones BCQuality at a pinned SHA, filters it per enabled-layers/knowledge globs, builds task-context, and a skills/entry.md bootstrap prompt -- replicating how microsoft/BCApps consumes microsoft/BCQuality today. Not yet wired into the agent; no effect on existing categories.
- ExperimentConfiguration: add bcquality flag - copilot agent: live BCQuality branch (clone CWD, --add-dir repo, skip static injection) - add 23 unit tests for codereview_bcquality module
…line arm - Extract the 6 faithful domain checklists (accessibility/performance/privacy/ security/style/upgrade) verbatim from BCApps 30e2b18ca3^ (the version BCApps shipped before adopting BCQuality), NOT the benchmark-tuned experiment snapshot - AGENTS.md: add review section routing /review through the 6 domain checklists - Enables a faithful before/after comparison: vanilla < old inline < live BCQuality - Inert by default (instructions.enabled=false); arm activated via config toggle
…/ BCQuality arms)
…iment Leaderboard
…nistic severity mapping, relocate bcquality module to agent/shared)
…er entry); surface git stderr on failure
…to BCQuality bootstrap prompt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Experiment Description
Enable the live BCQuality integration arm for the
code-reviewcategory — the "after" side of the BCApps #8700 change. Instead of static in-repo checklists, the agent consumes microsoft/BCQuality at runtime: clone (pinned SHA) → filter → route throughskills/entry.md, then emit the BC-Benchreview.jsonschema.This is the counterpart to the inline-knowledge (pre-#8700) arm (
experiment/code-review/inline-knowledge). Together they let us compare: vanilla < inline knowledge < live BCQuality.Configuration Changes
instructions.enabled: true)skills.enabled: true)bcquality.enabled: true— code-review-only switch. The filtered BCQuality clone becomes the Copilot CWD (knowledge read before the diff); the repo under review is granted via--add-dir; static instruction injection is skipped. No effect on bug-fix / test-generation.Key pieces:
config.yaml:bcquality:section (repo + pinnedrefSHA, enabled-layers, disabled-skills, knowledge allow/deny globs, task-context dimensions).enabled: trueon this branch.agent/shared/codereview_bcquality.py:clone_bcquality(pinned SHA, shallow),filter_clone(mirrorsInvoke-BCQualityFilter.ps1, writes_filter-report.json), task-context writer, bootstrap prompt routing throughskills/entry.md.copilot/agent.py: live branch wiring (clone as CWD,--add-dir, hooks into the clone).types.py:ExperimentConfiguration.bcqualityflag → routes results to the Experiment Leaderboard.Agent & Model
Hypothesis / Expected Outcome
Consuming BCQuality's live knowledge base should match or exceed the pre-#8700 inline checklists on finding quality (precision/recall/F1 vs gold), since it carries the same domain knowledge plus ongoing BCQuality updates and explicit knowledge-backed routing. Expected ordering: vanilla < inline knowledge < live BCQuality.
Notes
codereview.jsonlentries targetmicrosoft/BCApps.experiment/code-review/*scheme (supersedes the former [Code-review]: live BCQuality consumption + faithful pre-#8700 old-inline baseline arm #696).