design-proposal: per-tenant Keycloak realms for tenant Kubernetes cluster OIDC#24
design-proposal: per-tenant Keycloak realms for tenant Kubernetes cluster OIDC#24IvanHunters wants to merge 4 commits into
Conversation
…ster OIDC Companion design proposal to cozystack/cozystack#3044. Central decision: per-tenant Keycloak realm (tenant-<ns>) as the identity unit for tenant Kubernetes cluster OIDC, with per-cluster public clients providing audience-bound token isolation. The platform-admin cozy realm and tenant user directories serve different populations and trust models; this proposal keeps them separate while delivering per-cluster isolation through audience binding rather than a separate directory. Covers realm provisioning via the EDP Keycloak operator, propagation through the cozystack-basics namespace-values channel, per-cluster KeycloakClient + view/admin KeycloakRealmGroups, KamajiControlPlane oidc flag wiring, an OIDC kubeconfig Secret exposed via cozyrds, plus lifecycle, rollout, security, failure modes, testing, and alternatives considered (including the flat cozy + per-cluster-client path). Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a design proposal for implementing per-tenant Keycloak realms to back OIDC authentication for tenant Kubernetes clusters in Cozystack. The feedback suggests clarifying that the bootstrap Job runs in the management cluster rather than inside the tenant cluster itself. Additionally, it warns that attempting to delete ClusterRoleBindings inside the tenant cluster during a pre-delete hook is redundant and could block Helm release deletion if the control plane becomes unreachable.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Grafana OIDC for tenants is a separate, unrelated initiative and was incorrectly listed here as an in-flight companion. The Rollout section is shortened accordingly. Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
…hentication config The chart in cozystack#3044 was reworked to wire the apiserver via a mounted AuthenticationConfiguration (apiserver.config.k8s.io/v1beta1) rather than the legacy four --oidc-* flags. The proposal is updated to match: - Layer 5 now describes the three-piece wiring: --authentication-config flag + extraVolumeMounts on apiServer + extraVolumes on deployment, sourced from a per-cluster Secret carrying the structured config. - Goals gain a bullet covering structured (not flag-based) authn so the forward path to BYO-OIDC and private-CA issuers is explicit up front. - Context > KamajiControlPlane reflects that extraVolumes / extraVolumeMounts are no longer 'matters for the follow-up', they are used by this proposal. - Open questions > BYO-OIDC narrows to the remaining design surface (chart API for a BYO JWTAuthenticator, audience composition). - Alternatives considered gains the flag-based path entry with the rationale for rejecting it (one-issuer cap, no private-CA, future rewrite cost). Signed-off-by: IvanHunters <xorokhotnikov@gmail.com>
Andrei Kvapil (kvaps)
left a comment
There was a problem hiding this comment.
Thanks for writing this up — the doc is thorough and the "Alternatives considered" section is genuinely useful as the record for the per-tenant-realm direction. The structured --authentication-config delivery mechanism (Layer 5) and its rationale (multi-issuer, private-CA, additive BYO) are the right call and match where we want to land.
My main concern is alignment with today's two OIDC syncs, which settled the phasing after this doc was written. The agreed shape was:
- Phase 1: authenticate tenant-cluster users via the existing platform
cozyrealm + per-cluster public client for audience isolation, plus the ability to pass a BYO OIDC/authentication config into the cluster. No groups, no auto-provisioning into Keycloak. API as a selector, e.g.oidc: System | CustomConfig | None. - Phase 2 (deferred, may not be needed): per-tenant Keycloak realm / "managed IdP", as a separate, fully-designed product (
oidc: ... | KeycloakRealm | ...).
As written, the proposal does the inverse of that split, so I'd ask to realign it before the #3044 implementation lands:
- Identity unit (Overview / Rollout). The doc picks per-tenant realm
tenant-<ns>as the unit to implement now. Per the syncs that's Phase 2. Phase 1 is the sharedcozyrealm + per-cluster client. - "Flat
cozyrealm + per-cluster public client" (Alternatives, L278) is rejected — but that's the option we chose for Phase 1. Note the rejection somewhat misframes it: Phase 1 does not put tenant end-users intocozy. It authenticates the platform users that already exist there, and BYO covers a company's external users. Could you re-evaluate it as the Phase-1 baseline rather than a rejected alternative? - BYO-OIDC is marked non-goal/deferred (L18, Non-goals) — it's part of Phase 1. Passing a custom OIDC/authn config into the cluster is one of the two Phase-1 deliverables, not a follow-up. Layer 5 already mounts a structured config, so this is mostly a framing/scope change.
- Groups + auto-RBAC (Layer 4). Two
KeycloakRealmGroups per cluster + autoClusterRoleBindings is exactly the group/auto-provisioning machinery the syncs put out of Phase 1 ("no groups"; ausers:map with a role is enough). Auto-creating clients/groups in the IdP at cluster-creation is the "deep integration = a product that needs serious design" we flagged — and it breaks the customer-Keycloak case where the tenant does not want us spawning groups in their realm. Suggest dropping it from Phase 1. - Billing / central-Keycloak load is unaddressed. All
tenant-<ns>realms live on the platform masterClusterKeycloak(L72). That's the load/billing blocker raised on the call. Please add a section covering it and the alternative of running Keycloak as an External App inside the tenant namespace (billing becomes trivial, it's just API + DB pods), which was the proposed mitigation.
Factual fixes about the current codebase (verified against main):
- L81 — it's
packages/system/cozystack-basics/, notpackages/core/..., and the emission lives intemplates/cozystack-values-secret.yaml; there is no_namespace.tpl. - L85 —
_namespace.dns-suffix/_namespace.kubelet-image-credential-provider-configare not propagated today. The actual_namespace.*keys arehost, etcd, ingress, gateway, monitoring, seaweedfs, schedulingClass. The real precedent ishost/ingress. - L82 — there is no tenant-hierarchy walk-up today; propagation is single-level parent lookup (
packages/apps/tenant/templates/namespace.yaml). Recursive inheritance (Layer 2) is new mechanism, not "the same channel that already exists" — worth calling out as new work. - L46, L278 — the
cozygroups are not<ns>-admins/<ns>-users. They are<tenant>-view,<tenant>-use,<tenant>-admin,<tenant>-super-admin(packages/apps/tenant/templates/keycloakgroups.yaml). This also affects the "group collision" argument. - L35 —
KamajiControlPlanein cozystack iscontrolplane.cluster.x-k8s.io/v1alpha1(packages/apps/kubernetes/templates/cluster.yaml).tcp.kamaji.clastix.iois theTenantControlPlanegroup — a different CRD. - L33 —
apps/kuberneteshas nooidcvalue at all today (not a "no-op field"). - L17 vs Layer 5 — Scope says
--oidc-*wiring, Design uses--authentication-config. Please make these consistent.
None of this is a knock on the analysis — it's solid and I'd keep it as the Phase-2 design record. The ask is to re-frame the doc around the Phase-1/Phase-2 split we agreed on, fix the codebase references, and add the billing section. Happy to pair on the Phase-1 scope if useful.
|
Strong +1 to the Phase-1 / Phase-2 split Andrei Kvapil (@kvaps) laid out. Adding the reasoning underneath it, because I think the split follows from a few distinctions the doc currently collapses, and naming them makes clear what belongs in Phase 1 and why Phase 2 "may not be needed." 1. Two orthogonal tasks, and "BYO for what"Treat every consumer of identity as choosing its own issuer: the Cozystack platform (console/API) is one consumer; each managed service (managed Kubernetes, managed Grafana, …) is another. BYO then splits by target, and the two are not the same mechanism:
This also dissolves single-org-vs-multi-org as the deciding factor — it isn't one. A single enterprise running Cozystack next to an existing MS AD / LDAP may not want For this proposal that means Phase 1 is precisely: let a managed cluster select its issuer — 2. The per-tenant hosted realm is a third, distinct thing: managed-IdP-as-a-serviceStanding up a realm per tenant that the platform provisions, hosts, and that tenants self-administer means Cozystack is offering managed IdP-as-a-service — an entire product category (Cognito, Auth0, WorkOS exist solely to host an IdP for you). It is neither of the Phase-1 tasks above, and bundling it into the kube-apiserver OIDC PR under-scopes and under-designs it. That's the case for a separate, fully-specified Phase 2. And the population it's meant to serve doesn't require it. The relevant split is within the tenant's workforce:
This is the everyday pattern: only ops has cloud-console access, while a larger slice of the workforce uses the systems hosted on top. Both are Phase-1 cases with no new realm. Ops are already in 3.
|
|
Sharing how we run multi-tenancy on our side, since the realm-per-tenant question is directly relevant to us. Our tenancy shape We run a multi-level tenant tree: We already rely on the shared On realm-per-tenant: what does it actually buy? From our seat, a dedicated realm per tenant looks too heavy for the value it returns:
If the real driver behind a dedicated realm is BYO-OIDC / per-customer IdP brokering, then Keycloak Organizations are more than sufficient and a much lighter fit:
Honest caveat: Organizations are membership + brokering + attributes, not a hard security boundary (clients/roles stay realm-level). So this complements the RBAC-group model rather than replacing it. For our use case that trade-off is fine: we don't need realm-level config isolation, we need per-customer identity + IdP. Organizations is a per-realm feature flag, which we've started exposing upstream in cozystack/cozystack#3031. One design point for nested tenancy The single-level parent lookup works for a 2-level tree (project -> organization) but is fragile if nesting goes deeper. Could the realm/Organization owner be an explicit marker on the owning tenant, with inheritance resolving to that ancestor, rather than relying on walk-up depth? Net: phase 1 (issuer selector + per-cluster client) fits us well. For phase 2, it's worth weighing whether Organizations cover the same goals at a fraction of the operational cost before committing to realm-per-tenant, especially if BYO-OIDC is the main driver. |
Apply OIDC design-sync decisions and review feedback (@lllamnyp, @mattia-eleuteri): make Phase 1 a per-cluster issuer selector (System|CustomConfig|None) with a per-cluster client and structured authentication config, no auto-provisioned Keycloak groups. Move the per-tenant realm to a deferred Phase 2 and add Keycloak Organizations as the lighter Phase-2 option to evaluate first. Add the design principles (BYO-for-what; couple at provisioning, decouple at ownership; ops/dev is authorization not directory; identity boundary is the organization). Correct codebase references (cozystack-basics path and emitted _namespace keys, single-level vs recursive inheritance, cozy group names, KamajiControlPlane apiVersion, authentication-config vs --oidc-* flags). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
Andrei Kvapil (kvaps)
left a comment
There was a problem hiding this comment.
Thanks Timofei Larkin (@lllamnyp) and mattia-eleuteri — exactly the input we needed, and the proposal has been updated to fold it in (just pushed).
What changed:
- Reframed around the agreed phasing. Phase 1 is now a per-cluster issuer selector
oidc: System | CustomConfig | None+ per-cluster client + structuredAuthenticationConfiguration, with no auto-provisioned Keycloak groups (RBAC via ausers:map → cluster-local RoleBindings). The per-tenant realm moved to a deferred Phase 2. - Timofei Larkin (@lllamnyp)'s distinctions are first-class now: the "BYO for what" split (managed-service direct trust =
CustomConfig,cozynot in path; vs federating intocozy), "couple at provisioning, decouple at ownership", and ops-vs-dev being authorization rather than directory. - mattia-eleuteri's points are in: the identity boundary is the organization, not every tenant; Phase 2 now leads with Keycloak Organizations (one shared
cozyrealm + per-org brokering, cozystack/cozystack#3031) as the lighter option to evaluate before a per-tenant realm; and an explicit owner marker + inheritance instead of walk-up depth. The per-tenant realm is retained as Phase 2 Option B with its costs written down. - Also corrected the codebase references (cozystack-basics path + emitted
_namespacekeys, single-level vs recursive inheritance,cozygroup names,KamajiControlPlaneapiVersion,--authentication-configvs--oidc-*).
Phase 1 (#3044) is the scope now. Phase 2 is explicitly not a finalized design — the doc records it as a problem space with candidate options (Organizations vs per-tenant realm), to be settled in its own proposal with a dedicated review round. Approving the Phase-1 record; shout if anything still reads wrong.
|
mattia-eleuteri good catch on pointing out Organizations, that's exactly the missing AWS-style per-account boundary I didn't consider enough. They look a much lighter fit than realm-per-tenant for the BYO-OIDC driver. +1 on the explicit owner marker for nested trees too. |
|
Re-reviewed. The phased reframing matches how this should be operated, so the direction LGTM from our side. Phase 1 scope (issuer selector + per-cluster client + structured authn-config) and Phase 2 being deferred with Organizations-first both fit. Two things on the RBAC layer I'd like to open up rather than block on. 1. Phase 1 RBAC: reuse the existing group concept rather than a new per-user model? The
Neither is obviously right, hence raising it. Our lean is (b) for the ops/dev separation, but the trade-off is real and we'd value your read. 2. Granular RBAC is its own proposal. The fine-grained, per-resource control the 3. Net: 👍 on the direction and Phase 1 scope. The RBAC unit (group vs. enumerated users, and which groups) is the main thing I'd like to converge on. |
Introduce a flat selector — `mode: System | CustomConfig | None` — plus a `users[]` list and a `customConfig` block for tenant-supplied AuthenticationConfiguration. Default `None` preserves today's behavior. The `System` mode trusts the platform `cozy` realm via a per-cluster public client; `CustomConfig` accepts an inline `AuthenticationConfiguration` or a Secret reference (mutually exclusive). The `users[]` list drives per-user ClusterRoleBindings inside the tenant cluster, with two roles: `admin` and `view`. Regenerates deepcopy, JSON schema, README, and the kubernetes-rd cozyrds embedded schema. Implements Phase 1 API of the per-tenant OIDC design proposal (cozystack/community#24). Signed-off-by: Ivan Okhotnikov <ivan.okhotnikov@aenix.io>
Add the operator-facing guide for the new selector — what the three modes mean, what the chart provisions, how a user logs in, what the failure modes are, and what is intentionally out of scope (per-tenant realms, federation, cross-cluster SSO). The architectural rationale points back to cozystack/community#24. Update the chart NOTES.txt to print the OIDC kubeconfig retrieval snippet, the per-cluster client/issuer values, and the listed `users[]` roster — only when OIDC is on, otherwise the previous admin-kubeconfig hint is unchanged. Signed-off-by: Ivan Okhotnikov <ivan.okhotnikov@aenix.io>
Introduce a flat selector — `mode: System | CustomConfig | None` — plus a `users[]` list and a `customConfig` block for tenant-supplied AuthenticationConfiguration. Default `None` preserves today's behavior. The `System` mode trusts the platform `cozy` realm via a per-cluster public client; `CustomConfig` accepts an inline `AuthenticationConfiguration` or a Secret reference (mutually exclusive). The `users[]` list drives per-user ClusterRoleBindings inside the tenant cluster, with two roles: `admin` and `view`. Regenerates deepcopy, JSON schema, README, and the kubernetes-rd cozyrds embedded schema. Implements Phase 1 API of the per-tenant OIDC design proposal (cozystack/community#24). Signed-off-by: Ivan Okhotnikov <ivan.okhotnikov@aenix.io>
Add the operator-facing guide for the new selector — what the three modes mean, what the chart provisions, how a user logs in, what the failure modes are, and what is intentionally out of scope (per-tenant realms, federation, cross-cluster SSO). The architectural rationale points back to cozystack/community#24. Update the chart NOTES.txt to print the OIDC kubeconfig retrieval snippet, the per-cluster client/issuer values, and the listed `users[]` roster — only when OIDC is on, otherwise the previous admin-kubeconfig hint is unchanged. Signed-off-by: Ivan Okhotnikov <ivan.okhotnikov@aenix.io>
Summary
Design proposal companion to cozystack#3044.
The proposal answers the architectural decision raised in that PR's review: what is the identity unit for tenant Kubernetes cluster OIDC. This document picks per-tenant Keycloak realm (
tenant-<ns>), with per-cluster public clients providing audience-bound token isolation, and explains why that is structurally different from using the flatcozyrealm. It is intentionally written to compare both options head-on under Alternatives considered, so the trade-off is on the record before the implementation PR lands.What the proposal covers
tenant-<ns>owned by the tenant admin, separate from the platform-admincozyrealm. Provisioned declaratively viaTenant.spec.oidc=true.KeycloakClientand aKeycloakClientScopeper tenant cluster inside the tenant's realm;id_token.audmatches the per-cluster--oidc-client-id, so tokens are non-replayable across clusters.view+adminKeycloakRealmGroupper cluster, bound inside the tenant cluster to upstreamClusterRole/viewandClusterRole/cluster-adminrespectively. No blanket cluster-admin grant; the static admin kubeconfig remains as the documented break-glass path.cozystack-basics's static_namespace.oidc-realmemission, the same channel that already propagates DNS suffix and ingress class. Explicitly no Helmlookupis used.kubernetes-<cluster>-oidc-kubeconfigSecret in the management cluster's tenant namespace, exposed viacozyrdsspec.secrets.include, carrying a ready-to-use kubeconfig with akubectl oidc-loginexec auth block.--authentication-configonKamajiControlPlane, RFC 8693 token exchange via a custom credential plugin, and acozystackCLI helper. All called out as separate future proposals.cozyrealm + per-cluster client, BYO-OIDC only, EKS-style flat IAM + webhook, RFC 8693 token exchange plugin, embedding into the existing admin-kubeconfig Secret, single global client. Each with a short rationale for the rejection.Status
Draft. Looking for review on the identity-unit decision and on the open questions before the cozystack#3044 implementation lands.