Skip to content

Add Environment Validator TSG: AzStackHci_Connectivity_Test_Dns (Connectivity Test DNS)#307

Open
1008covingtonlane wants to merge 9 commits into
Azure:mainfrom
1008covingtonlane:tsg-connectivity-test-dns
Open

Add Environment Validator TSG: AzStackHci_Connectivity_Test_Dns (Connectivity Test DNS)#307
1008covingtonlane wants to merge 9 commits into
Azure:mainfrom
1008covingtonlane:tsg-connectivity-test-dns

Conversation

@1008covingtonlane

Copy link
Copy Markdown
Collaborator

What

Adds a public remediation TSG for the Environment Validator check AzStackHci_Connectivity_Test_Dns (display name Test DNS), and indexes it in the EnvironmentValidator README.

The check

On every node, for each DNS server configured on every up network adapter, the check resolves an external public name (microsoft.com, an A record) and expects at least one record back. It is Critical: when it fails and no proxy is configured, the validator also stops the remaining connectivity tests, so it commonly blocks a pending deployment or update. It runs at pre-deployment readiness, deployment, add-node, and the pre-update health check.

What the guide covers

The six required sections:

  1. Where it appears / confirm it - result files on the cluster share, Get-SolutionUpdate, the AzStackHciEnvironmentChecker event log (Event ID 17205), and the Azure portal Updates tab.
  2. Failure signatures - the real Detail strings (Queried dns server <ip> for microsoft.com on <node>. Result returned 0 A records. Expected at least 1. and No DNS server configured), plus the proxy self-skip behavior.
  3. Identify the affected nodes - an all-nodes one-liner.
  4. Consequences - blocks updates/deployment; impairs cloud-managed lifecycle.
  5. Remediation - a per-node fix: verify the node's configured DNS servers, re-point them if wrong, or fix the upstream DNS server (forwarder / reachability / port 53). Risk-labeled, with rollback.
  6. Verification - re-run Invoke-SolutionUpdatePrecheck and confirm resolution.

It leads with an at-a-glance box (owner, impact, effort, downtime) and frames ownership as a customer network/DNS task, explicitly not a Microsoft or OEM issue.

Notes for reviewers

  • On newer Azure Local builds this external-DNS test moved into a dedicated DNS validator reported as AzStackHci_DNS_Test_External_Hostname_Resolution; the cause and fix are identical, only the validator name differs. The guide notes this. This TSG covers the AzStackHci_Connectivity_Test_Dns name still emitted by deployed clusters.
  • The failure signatures, severity, probe behavior, and remediation were validated against the validator behavior and observed failures on real Azure Local clusters.
  • The first H1 is the canonical validator name so the validator-to-TSG map picks it up.

…ectivity Test DNS)

Adds a public remediation guide for the AzStackHci_Connectivity_Test_Dns
Environment Validator check, and indexes it in the EnvironmentValidator README.

The check verifies that each node can resolve an external public name (microsoft.com,
an A record) against every DNS server configured on its up adapters. Severity is
Critical, and a DNS failure also halts the remaining connectivity tests, so it
commonly blocks a pending deployment or update. The guide covers how to confirm the
failure (result files, Get-SolutionUpdate, event log, portal), the real failure
signatures, how to identify the affected nodes, the consequences, a per-node
remediation (verify and re-point the node DNS client, or fix the upstream DNS server
forwarder/reachability), and how to re-validate. It frames ownership as a customer
network/DNS task, not a Microsoft or OEM issue, and leads with an at-a-glance
summary.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 27, 2026 10:54

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Environment Validator remediation Troubleshooting Guide (TSG) for the AzStackHci_Connectivity_Test_Dns check (Test DNS) and links it from the EnvironmentValidator index so it can be discovered from the component README.

Changes:

  • Added a new TSG describing how to confirm, diagnose, remediate, and verify failures of AzStackHci_Connectivity_Test_Dns.
  • Updated the EnvironmentValidator README to include the new TSG link.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
TSG/EnvironmentValidator/Troubleshooting-Connectivity-Test-Dns.md New TSG covering failure signatures, remediation steps, and verification for the DNS connectivity validator.
TSG/EnvironmentValidator/README.md Adds an index entry pointing to the new DNS TSG.

Comment thread TSG/EnvironmentValidator/Troubleshooting-Connectivity-Test-Dns.md Outdated
Comment thread TSG/EnvironmentValidator/Troubleshooting-Connectivity-Test-Dns.md
1008covingtonlane and others added 2 commits June 27, 2026 07:52
…s a row in Step 3)

Two robustness fixes from the bot review on PR Azure#307:
- Option A: guard $base and $latest for null (HealthCheck folder missing after the
  fallback, or no result JSON yet) and print a friendly Warning instead of throwing on
  the common 'no results found' case.
- Step 3: reuse the same ClusterStorage $base fallback, and emit an explicit PASS or
  NO DATA row per node so a node is never silently treated as passing when its result
  simply could not be read; soften the conclusion to match.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@1008covingtonlane 1008covingtonlane left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed this end to end from the customer's standpoint (the network/DNS admin who hits it on a blocked update). This is a strong, well-built guide and is close to ready. A few notes, one worth correcting before merge.

What works well (customer lens)

  • The at-a-glance box (owner, impact, effort/downtime, "do not guess DNS IPs") sets expectations in four lines and correctly frames this as a customer network/DNS task, not a Microsoft or OEM issue.
  • Four discovery entry points (result files, Get-SolutionUpdate, the 17205 event log, the portal Updates tab) all converge on the same Detail, so whatever made the customer notice it, they land in the right place.
  • The all-nodes one-liner in Step 3 emits explicit PASS / NO DATA / failing rows, so a node is never silently treated as passing when its result file could not be read. That is exactly the right defensive shape.
  • Remediation is risk-labeled (LOW per-node re-point, MEDIUM upstream-server change) with rollback called out, and it correctly separates the node-client fix from the upstream-forwarder fix and the proxy self-skip path.

1. [Should fix] The "newer builds" successor name does not match the name the validator actually emits.
The guide says the test on recent builds is reported as AzStackHci_DNS_Test_External_Hostname_Resolution. Live telemetry on a current build (2607) reports the successor result name as AzStackHci_DNS_ExternalDnsResolution (validator Invoke-AzStackHciDNSValidation, include Test-ExternalDnsResolution). A customer on a newer build will grep for the exact name this note gives them, so an incorrect name sends them looking for a result that does not exist. Recommend changing the cited name to AzStackHci_DNS_ExternalDnsResolution. (The Test_External_Hostname_Resolution form looks like it was conflated with the include/test name Test-ExternalDnsResolution.)

2. [Minor] The successor signature has a retry suffix.
On the newer DNS validator the same failure line carries an attempt counter, for example ...for microsoft.com on V-HOST1 (Attempt: 3/3). Result returned 0 A records. Expected at least 1. The legacy connectivity name this TSG documents does not. Worth one clause in the Step 2 signatures note so a newer-build customer still recognizes the string (the guide already says the IP, node, and count vary; the (Attempt: n/3) suffix is the one extra variation).

3. [Optional] Link the network-requirements doc.
The at-a-glance and Step 5 reference "the public deployment network-requirements documentation" (and the validator's own Remediation field points there), but the guide does not link it. Adding the Learn link inline would save the customer a search, consistent with the sibling connectivity TSGs.

H1 leads with the canonical AzStackHci_Connectivity_Test_Dns so the validator-to-TSG map resolves it; failure signature, severity, proxy self-skip, and the port-53 reachability step all match the observed validator behavior. With finding 1 corrected this is good to merge.

…+ (Attempt: n/3), link network requirements

Review feedback on Azure#307, grounded in live EnvironmentValidatorResult telemetry:

- The newer-build successor is reported under TWO names that both appear on current
  builds: AzStackHci_DNS_ExternalDnsResolution (3,348 nodes, sol-builds up to
  12.2607) and AzStackHci_DNS_Test_External_Hostname_Resolution (5,366 nodes, up to
  12.2606). Name both so a customer greps the right one for their build, instead of a
  single name that is a dead end on the other builds.
- The dedicated DNS validator resolves management.azure.com (not microsoft.com),
  retries, and stamps an (Attempt: n/3) suffix with per-node bullets. Documented in
  the overview and added as a failure-signature variant (placeholder IP/node).
- Made the network-requirements reference a concrete inline link to the Azure Local
  firewall/network requirements doc (verified 200).

Lint grade A; fences balanced.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@1008covingtonlane

Copy link
Copy Markdown
Collaborator Author

Addressed the review in c4fdb39, grounded in live EnvironmentValidatorResult telemetry (last 10 days):

  • Successor name (material): verified against telemetry before editing. There are actually two dedicated-DNS-validator names live on current builds, not one: AzStackHci_DNS_ExternalDnsResolution (3,348 nodes, solution builds up to 12.2607) and AzStackHci_DNS_Test_External_Hostname_Resolution (5,366 nodes, up to 12.2606). They coexist (different validator paths), so rather than swap one name for the other I now list both and tell the reader to search for either. Thanks for the catch, it was the right pull even though the resolution landed on "name both."
  • (Attempt: n/3) suffix (minor): confirmed. The dedicated validator also resolves management.azure.com (not microsoft.com), retries, and lists each failing node as its own bullet. Documented in the overview and added as a failure-signature variant (placeholder IP/node).
  • Network-requirements link (minor): made it a concrete inline link to the Azure Local network/firewall requirements doc (verified 200).

Static lint still grade A, fences balanced.

@1008covingtonlane 1008covingtonlane left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed at c4fdb39. All three points are addressed, and the resolution on the material one is better than what I suggested.

  • Successor name: the telemetry-grounded "name both" is the right call. I had recommended swapping to AzStackHci_DNS_ExternalDnsResolution, but since both AzStackHci_DNS_ExternalDnsResolution and AzStackHci_DNS_Test_External_Hostname_Resolution are live across current builds on different validator paths, listing both and telling the reader to search for either is more complete than a single swap. Good pull on verifying against telemetry before editing.
  • Dedicated-validator signature: nicely done. The overview now calls out the management.azure.com hostname, the retry, and the per-node bullet layout, and Step 2 carries a dedicated fenced variant with the (Attempt: n/3) suffix, so a newer-build customer recognizes their exact string.
  • Network-requirements link: now a concrete inline link (confirmed it resolves 200).

Markdown is sound (fences balanced), prose stays clean. From the reviewer side this reads ready to merge.

1008covingtonlane and others added 2 commits June 27, 2026 12:53
Persona-panel feedback (13-reader usability panel, avg 4.5/5): the Remediation
Step 2 fix used <ManagementAdapter>/<dns1>/<dns2> placeholders a low-context
follower or beginner could not resolve. Add a short note + a Get-NetIPConfiguration
one-liner to identify the management adapter (match the node's management IP to an
InterfaceAlias), and state where the correct DNS values come from (the deployment's
documented management DNS, the same the healthy nodes use). The Set command itself
is unchanged. Resolves the panel's single recommended change (4 personas).

Lint grade A; fences balanced.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…izon cause, portal-lag note)

The tsg-validation-harness 13-persona usability panel (overall 4.6/5) surfaced a
3-way tie of top-overlap improvements; all are applied here, plus the live grade
card's own evidence-backed finding:

- Front-load the fix (generalist, partner-SI, field lens): a 'Most common fix (start
  here)' lead-in at the top of Remediation so a repeat professional sees the usual
  fix before the full diagnosis ladder.
- Glossary (new-grad, CSS engineer, accessibility lens): a short 'Terms used here'
  note defining A record, forwarder, and WinHTTP proxy for the least-expert reader.
- Add a cause (network engineer, OEM, deep-systems lens): name internal-only /
  split-horizon DNS zones shadowing external resolution as a distinct failure mode in
  step 4.
- Verify-the-fix portal lag (from the live run: the HealthCheckResult JSON was 64 min
  old): note that the portal and HealthCheckResult refresh only on a full health check
  / Invoke-SolutionUpdatePrecheck, so confirm on-node with Resolve-DnsName.

Lint grade A; fences balanced; no placeholder leaks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@1008covingtonlane

Copy link
Copy Markdown
Collaborator Author

Applied usability feedback from a structured 13-reader persona review of this TSG (overall 4.6/5) in 35aaaf5. The review produced a 3-way tie of most-requested changes; all three are applied, plus a finding from a live validation run:

  • Front-load the fix (generalist sysadmin, partner/SI, field-urgency lens): a Most common fix (start here) lead-in at the top of Remediation, so a repeat professional sees the usual fix before the full diagnosis ladder.
  • Glossary (new-grad, CSS engineer, accessibility lens): a short Terms used here note defining A record, forwarder, and WinHTTP proxy for the least-expert reader.
  • Name an additional cause (network engineer, OEM field, deep-systems lens): internal-only / split-horizon DNS zones shadowing external resolution, added as a distinct option in step 4.
  • Portal-lag note (from a live run where the HealthCheckResult JSON was 64 min old): the portal and the cluster-wide health file refresh only on a full health check / Invoke-SolutionUpdatePrecheck, so confirm the fix on-node with Resolve-DnsName rather than waiting on the portal.

Lint grade A, fences balanced, no placeholder leaks.

1008covingtonlane and others added 3 commits June 27, 2026 14:37
Persona-panel feedback (IT director + CSAM, the planning-focused readers): add a
rough time-to-resolve so a leader can set an SLA. ~15-30 min per affected node for a
DNS-client fix; longer if an upstream DNS server change must be coordinated with its
owner. Lint A.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ean)

Persona-panel work item (the new-grad reader, the only below-5 row): the inline
'Terms used here' block sat in the Remediation flow and the reader also wanted
split-horizon defined. Move the definitions to a '## Glossary' at the end of the
guide (out of power users' way) and replace the inline block with a one-line pointer.
Expanded: adds DNS server/resolver, conditional forwarder, and split-horizon, alongside
A record, forwarder, and WinHTTP proxy. Lint A; anchor link resolves.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ory image)

Persona-panel work item (the OEM/field-engineer reader, the last below-5 row): add a
short 'Where a node's DNS comes from' note after the ownership paragraph, so an OEM can
check their imaging process. Clarifies that node DNS is set at deployment from the
management network settings and is not baked into the factory image, so the image should
leave DNS to deployment rather than pin a stale value. Lint A.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants