Skip to content

feat: surface unrecoverable container errors during pod wait#13

Open
Longwt123 wants to merge 4 commits into
mainfrom
feature/surface-container-errors
Open

feat: surface unrecoverable container errors during pod wait#13
Longwt123 wants to merge 4 commits into
mainfrom
feature/surface-container-errors

Conversation

@Longwt123

Copy link
Copy Markdown
Collaborator

背景

K8s 模式下,当 CI job 引用了不存在的镜像、或没有权限拉取的镜像时,Pod 会卡在 Pending。此前 waitForPodPhases 只会一直 backoff 直到超时,最终只报一句笼统的 phase 错误,GitHub Actions 用户看不到真正的失败原因。

改动

  • 在轮询 Pod 状态时,检测 init 容器和普通容器的不可恢复 waiting 原因(ImagePullBackOffErrImagePullInvalidImageNameCreateContainerConfigErrorCreateContainerError)。
  • 命中后立即快速失败,并带上「容器名 + 原因 + K8s 原始 message」,让具体错误透传到 Actions 日志。
  • getPodPhase 重构为 readPod + parsePodPhase,以便复用 Pod 对象做容器检查。
  • 新增单元测试覆盖 parsePodPhasegetContainerErrorswaitForPodPhases

测试

  • tsc --noEmit 通过
  • 新增的 12 个单元测试全部通过

注:tests/k8s-utils-test.tsshould return object with containerPath and runnerPath 是 Windows 路径/正则的既有问题,在干净的 main 上同样失败,与本次改动无关。

🤖 Generated with Claude Code

Longwt123 and others added 2 commits June 22, 2026 16:02
When a CI job references a non-existent image or one it lacks permission
to pull, the pod stays in Pending and waitForPodPhases previously timed
out with only a generic phase-status message. GitHub Actions users had
no indication of the real cause.

Detect unrecoverable container waiting reasons (ImagePullBackOff,
ErrImagePull, InvalidImageName, CreateContainerConfigError,
CreateContainerError) on both init and regular containers, and fail fast
with the container name, reason, and Kubernetes message so the error is
visible in the Actions log.

Refactor getPodPhase into readPod + parsePodPhase so the pod object can
be inspected for container errors, and add unit tests covering
parsePodPhase, getContainerErrors, and waitForPodPhases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r reasons

Extend the error feedback mechanism introduced in the previous commit with:

- describePodFailure(): aggregates pod phase, conditions, container statuses
  and Warning events into a single human-readable diagnostic string.
  Never throws — safe to call from any error path.

- describePodWarningEvents(): best-effort retrieval of recent Warning K8s
  events (requires optional "events" RBAC permission). Degrades gracefully
  when the permission is missing.

- getUnrecoverableWaitingReasons(): allows operators to extend the built-in
  fast-fail whitelist via the ACTIONS_RUNNER_K8S_UNRECOVERABLE_WAITING_REASONS
  environment variable without a code change. Built-in defaults cannot be
  removed.

- All three failure paths in waitForPodPhases now attach full diagnostics:
  1. Non-backoff phase (e.g. Failed) — includes pod details + events
  2. Unrecoverable container error (e.g. ImagePullBackOff) — fail-fast with
     diagnostics instead of waiting for timeout
  3. Timeout — includes pod details so the user can see WHY the pod never
     became ready

- README: document the optional "events" permission and the new env var.

- Tests: 8 new test cases (19 total) covering describePodFailure,
  describePodWarningEvents, getUnrecoverableWaitingReasons, and edge cases
  such as forbidden events API and unreadable pods.
@opensourceways-bot

Copy link
Copy Markdown

Welcome To opensourceways Community

Hey @Longwt123 , thanks for your contribution to the community.

Bot Usage Manual

I'm the Bot here serving you. You can find the instructions on how to interact with me at Here . That means you can comment below every pull request or issue to trigger Bot Commands.

@opensourceways-bot

Copy link
Copy Markdown

CLA Signature Pass

Longwt123, thanks for your pull request. All authors of the commits have signed the CLA. 👍

The project pins prettier@2.6.2 in package-lock.json, but the previous
commit was formatted with prettier 3.x which has different line-wrapping
rules for template literals. Re-format with prettier 2.6.2 to pass CI.
@opensourceways-bot

Copy link
Copy Markdown

CLA Signature Pass

Longwt123, thanks for your pull request. All authors of the commits have signed the CLA. 👍

v2.329.0 was deprecated by GitHub and rejected at the broker level with
"Runner version v2.329.0 is deprecated and cannot receive messages.",
causing runner pods to crash-loop immediately after connecting.

Also fix Dockerfile layer ordering: switch to root before COPY so that
the subsequent chown is not run as the unprivileged runner user.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@opensourceways-bot

Copy link
Copy Markdown

CLA Signature Pass

Longwt123, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants