Add Environment Validator TSG: AzStackHci_Connectivity_Test_Dns (Connectivity Test DNS)#307
Conversation
…ectivity Test DNS) Adds a public remediation guide for the AzStackHci_Connectivity_Test_Dns Environment Validator check, and indexes it in the EnvironmentValidator README. The check verifies that each node can resolve an external public name (microsoft.com, an A record) against every DNS server configured on its up adapters. Severity is Critical, and a DNS failure also halts the remaining connectivity tests, so it commonly blocks a pending deployment or update. The guide covers how to confirm the failure (result files, Get-SolutionUpdate, event log, portal), the real failure signatures, how to identify the affected nodes, the consequences, a per-node remediation (verify and re-point the node DNS client, or fix the upstream DNS server forwarder/reachability), and how to re-validate. It frames ownership as a customer network/DNS task, not a Microsoft or OEM issue, and leads with an at-a-glance summary. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a new Environment Validator remediation Troubleshooting Guide (TSG) for the AzStackHci_Connectivity_Test_Dns check (Test DNS) and links it from the EnvironmentValidator index so it can be discovered from the component README.
Changes:
- Added a new TSG describing how to confirm, diagnose, remediate, and verify failures of
AzStackHci_Connectivity_Test_Dns. - Updated the EnvironmentValidator README to include the new TSG link.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
TSG/EnvironmentValidator/Troubleshooting-Connectivity-Test-Dns.md |
New TSG covering failure signatures, remediation steps, and verification for the DNS connectivity validator. |
TSG/EnvironmentValidator/README.md |
Adds an index entry pointing to the new DNS TSG. |
…s a row in Step 3) Two robustness fixes from the bot review on PR Azure#307: - Option A: guard $base and $latest for null (HealthCheck folder missing after the fallback, or no result JSON yet) and print a friendly Warning instead of throwing on the common 'no results found' case. - Step 3: reuse the same ClusterStorage $base fallback, and emit an explicit PASS or NO DATA row per node so a node is never silently treated as passing when its result simply could not be read; soften the conclusion to match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1008covingtonlane
left a comment
There was a problem hiding this comment.
Reviewed this end to end from the customer's standpoint (the network/DNS admin who hits it on a blocked update). This is a strong, well-built guide and is close to ready. A few notes, one worth correcting before merge.
What works well (customer lens)
- The at-a-glance box (owner, impact, effort/downtime, "do not guess DNS IPs") sets expectations in four lines and correctly frames this as a customer network/DNS task, not a Microsoft or OEM issue.
- Four discovery entry points (result files,
Get-SolutionUpdate, the 17205 event log, the portal Updates tab) all converge on the sameDetail, so whatever made the customer notice it, they land in the right place. - The all-nodes one-liner in Step 3 emits explicit
PASS/NO DATA/ failing rows, so a node is never silently treated as passing when its result file could not be read. That is exactly the right defensive shape. - Remediation is risk-labeled (LOW per-node re-point, MEDIUM upstream-server change) with rollback called out, and it correctly separates the node-client fix from the upstream-forwarder fix and the proxy self-skip path.
1. [Should fix] The "newer builds" successor name does not match the name the validator actually emits.
The guide says the test on recent builds is reported as AzStackHci_DNS_Test_External_Hostname_Resolution. Live telemetry on a current build (2607) reports the successor result name as AzStackHci_DNS_ExternalDnsResolution (validator Invoke-AzStackHciDNSValidation, include Test-ExternalDnsResolution). A customer on a newer build will grep for the exact name this note gives them, so an incorrect name sends them looking for a result that does not exist. Recommend changing the cited name to AzStackHci_DNS_ExternalDnsResolution. (The Test_External_Hostname_Resolution form looks like it was conflated with the include/test name Test-ExternalDnsResolution.)
2. [Minor] The successor signature has a retry suffix.
On the newer DNS validator the same failure line carries an attempt counter, for example ...for microsoft.com on V-HOST1 (Attempt: 3/3). Result returned 0 A records. Expected at least 1. The legacy connectivity name this TSG documents does not. Worth one clause in the Step 2 signatures note so a newer-build customer still recognizes the string (the guide already says the IP, node, and count vary; the (Attempt: n/3) suffix is the one extra variation).
3. [Optional] Link the network-requirements doc.
The at-a-glance and Step 5 reference "the public deployment network-requirements documentation" (and the validator's own Remediation field points there), but the guide does not link it. Adding the Learn link inline would save the customer a search, consistent with the sibling connectivity TSGs.
H1 leads with the canonical AzStackHci_Connectivity_Test_Dns so the validator-to-TSG map resolves it; failure signature, severity, proxy self-skip, and the port-53 reachability step all match the observed validator behavior. With finding 1 corrected this is good to merge.
…+ (Attempt: n/3), link network requirements Review feedback on Azure#307, grounded in live EnvironmentValidatorResult telemetry: - The newer-build successor is reported under TWO names that both appear on current builds: AzStackHci_DNS_ExternalDnsResolution (3,348 nodes, sol-builds up to 12.2607) and AzStackHci_DNS_Test_External_Hostname_Resolution (5,366 nodes, up to 12.2606). Name both so a customer greps the right one for their build, instead of a single name that is a dead end on the other builds. - The dedicated DNS validator resolves management.azure.com (not microsoft.com), retries, and stamps an (Attempt: n/3) suffix with per-node bullets. Documented in the overview and added as a failure-signature variant (placeholder IP/node). - Made the network-requirements reference a concrete inline link to the Azure Local firewall/network requirements doc (verified 200). Lint grade A; fences balanced. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Addressed the review in c4fdb39, grounded in live
Static lint still grade A, fences balanced. |
1008covingtonlane
left a comment
There was a problem hiding this comment.
Re-reviewed at c4fdb39. All three points are addressed, and the resolution on the material one is better than what I suggested.
- Successor name: the telemetry-grounded "name both" is the right call. I had recommended swapping to
AzStackHci_DNS_ExternalDnsResolution, but since bothAzStackHci_DNS_ExternalDnsResolutionandAzStackHci_DNS_Test_External_Hostname_Resolutionare live across current builds on different validator paths, listing both and telling the reader to search for either is more complete than a single swap. Good pull on verifying against telemetry before editing. - Dedicated-validator signature: nicely done. The overview now calls out the
management.azure.comhostname, the retry, and the per-node bullet layout, and Step 2 carries a dedicated fenced variant with the(Attempt: n/3)suffix, so a newer-build customer recognizes their exact string. - Network-requirements link: now a concrete inline link (confirmed it resolves 200).
Markdown is sound (fences balanced), prose stays clean. From the reviewer side this reads ready to merge.
Persona-panel feedback (13-reader usability panel, avg 4.5/5): the Remediation Step 2 fix used <ManagementAdapter>/<dns1>/<dns2> placeholders a low-context follower or beginner could not resolve. Add a short note + a Get-NetIPConfiguration one-liner to identify the management adapter (match the node's management IP to an InterfaceAlias), and state where the correct DNS values come from (the deployment's documented management DNS, the same the healthy nodes use). The Set command itself is unchanged. Resolves the panel's single recommended change (4 personas). Lint grade A; fences balanced. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…izon cause, portal-lag note) The tsg-validation-harness 13-persona usability panel (overall 4.6/5) surfaced a 3-way tie of top-overlap improvements; all are applied here, plus the live grade card's own evidence-backed finding: - Front-load the fix (generalist, partner-SI, field lens): a 'Most common fix (start here)' lead-in at the top of Remediation so a repeat professional sees the usual fix before the full diagnosis ladder. - Glossary (new-grad, CSS engineer, accessibility lens): a short 'Terms used here' note defining A record, forwarder, and WinHTTP proxy for the least-expert reader. - Add a cause (network engineer, OEM, deep-systems lens): name internal-only / split-horizon DNS zones shadowing external resolution as a distinct failure mode in step 4. - Verify-the-fix portal lag (from the live run: the HealthCheckResult JSON was 64 min old): note that the portal and HealthCheckResult refresh only on a full health check / Invoke-SolutionUpdatePrecheck, so confirm on-node with Resolve-DnsName. Lint grade A; fences balanced; no placeholder leaks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Applied usability feedback from a structured 13-reader persona review of this TSG (overall 4.6/5) in 35aaaf5. The review produced a 3-way tie of most-requested changes; all three are applied, plus a finding from a live validation run:
Lint grade A, fences balanced, no placeholder leaks. |
Persona-panel feedback (IT director + CSAM, the planning-focused readers): add a rough time-to-resolve so a leader can set an SLA. ~15-30 min per affected node for a DNS-client fix; longer if an upstream DNS server change must be coordinated with its owner. Lint A. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ean) Persona-panel work item (the new-grad reader, the only below-5 row): the inline 'Terms used here' block sat in the Remediation flow and the reader also wanted split-horizon defined. Move the definitions to a '## Glossary' at the end of the guide (out of power users' way) and replace the inline block with a one-line pointer. Expanded: adds DNS server/resolver, conditional forwarder, and split-horizon, alongside A record, forwarder, and WinHTTP proxy. Lint A; anchor link resolves. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ory image) Persona-panel work item (the OEM/field-engineer reader, the last below-5 row): add a short 'Where a node's DNS comes from' note after the ownership paragraph, so an OEM can check their imaging process. Clarifies that node DNS is set at deployment from the management network settings and is not baked into the factory image, so the image should leave DNS to deployment rather than pin a stale value. Lint A. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
What
Adds a public remediation TSG for the Environment Validator check
AzStackHci_Connectivity_Test_Dns(display name Test DNS), and indexes it in the EnvironmentValidator README.The check
On every node, for each DNS server configured on every up network adapter, the check resolves an external public name (
microsoft.com, an A record) and expects at least one record back. It is Critical: when it fails and no proxy is configured, the validator also stops the remaining connectivity tests, so it commonly blocks a pending deployment or update. It runs at pre-deployment readiness, deployment, add-node, and the pre-update health check.What the guide covers
The six required sections:
Get-SolutionUpdate, theAzStackHciEnvironmentCheckerevent log (Event ID 17205), and the Azure portal Updates tab.Detailstrings (Queried dns server <ip> for microsoft.com on <node>. Result returned 0 A records. Expected at least 1.andNo DNS server configured), plus the proxy self-skip behavior.Invoke-SolutionUpdatePrecheckand confirm resolution.It leads with an at-a-glance box (owner, impact, effort, downtime) and frames ownership as a customer network/DNS task, explicitly not a Microsoft or OEM issue.
Notes for reviewers
AzStackHci_DNS_Test_External_Hostname_Resolution; the cause and fix are identical, only the validator name differs. The guide notes this. This TSG covers theAzStackHci_Connectivity_Test_Dnsname still emitted by deployed clusters.