800G-OSFP

AI Data Center Optics: Validation & Testing Handbook 2026

Jun 29, 2026 Rajesh Shekhawat

AI Data Center · Operator Handbook

AI Data Center Optics: Validation & Testing Handbook

What test equipment vendors don't write. What transceiver vendors don't write. What you need — from incoming sample to RMA evidence package.

For Tier-2 AI data center operators, neoclouds, and system integrators.

23Chapters · 5 parts

19Reference tables

15Reference cards

Get the field pack (PDF) Start reading ↓

For the engineers commissioning and running optical interconnects in AI data center fabrics — 100G through 1.6T transceivers, AOC, AEC, ACC, DAC.

Three different organizations test the modules in your fabric, and each one is doing a different job. The manufacturer tests every unit before shipment, against the spec it published. The platform owner — NVIDIA LinkX, your switch vendor's compatibility program, the hyperscaler reference design — qualifies a SKU on a representative host. And then your engineer plugs the module into your fabric, in your rack, with your cabling, under your workload.

Between those three layers is a gap. Manufacturer testing tells you the unit met spec at one moment in a tester's lab. Platform qualification tells you the SKU has been demonstrated on a representative platform. Neither tells you whether this module, in your fabric, under your workload, will run reliably for the next four to five years. That last question is yours, and it's the question this handbook answers.

Closing the operator's gap is not optional work. It is the work.

Standards-aware, practice-grounded, honest about the gap between them. Where IEEE, OIF, OCP, ANSI, IEC, and Telcordia specify a procedure, we cite it. Where the procedure is impractical at AI-fabric scale or budget, we say so plainly — and we describe what experienced operators actually do instead. Those passages are marked.

Why this handbook exists#

A lot is published in this space. Test-equipment vendors — Keysight, VIAVI, Anritsu, MultiLane — publish for the engineer running a $200,000 BERT in a manufacturing line, validating modules against the IEEE specification. Transceiver vendors — FS, NADDOD, ourselves included — publish product catalogs and tier-100G troubleshooting guides. Switch vendors — Cisco, Juniper, Arista, NVIDIA — publish their own qualified compatibility lists, which tell you which optics they've tested on their switches.

None of that is written for you. The Tier-2 AI data center, the neocloud, the system integrator delivering a 1,024-GPU SuperPOD reference, the enterprise running production inference — there is no working playbook for what testing looks like from the operator's side. The validation procedures, the sample-size math, the bring-up checks, the production telemetry, the triage discipline — operators figure it out from scratch, share it informally between teams, and carry the institutional knowledge in senior engineers' heads.

This handbook is what we wrote because no one else did. Written from the operator's side of the conversation. AI-data-center-specific throughout — SuperPOD references, ConnectX NICs, NCCL acceptance, neocloud reproducibility, the form-factor traps that catch deployments at receiving. Practical, honest about what the standards say versus what works at scale, structured to be flipped open at any chapter when you need that day's answer.

A note on the numbers: this handbook contains specific figures throughout — equipment prices, lead times, defect-rate benchmarks, fleet AFR data, named vendor models. They are indicative, not authoritative. Pricing varies with configuration, region, and channel. Lead times shift with supply-chain conditions. Confirm current details with vendors before any procurement decision.

What’s inside

IFoundationJust enough grounding to make every validation and testing decision in the rest of the handbook make sense — no more, no less.Ch 1–2 IISample testing & incoming inspectionThe operator-side testing program — why operators run their own, with what, and what the data actually justifies.Ch 3–7 IIIValidation at bring-upFrom the moment a module arrives at the dock to the moment the fabric is handed to the workload team — the practical playbook for commissioning AI fabric interconnects.Ch 8–15 IVValidation in productionFrom the moment a link enters production until the moment it is retired — the telemetry, alerting, triage, and vendor-management discipline that keeps a fabric healthy.Ch 16–19 VReferenceThe reference you'll come back to during commissioning, triage, and procurement. Four chapters of consolidated tables, standards citations, definitions, and sources.Ch 20–23

Part One

Foundation

Just enough grounding to make every validation and testing decision in the rest of the handbook make sense — no more, no less.

Chapter One

What you're testing#

Before any validation or testing decision makes sense, you need a tight working map of the territory: the platforms you'll encounter, the protocols that shape every test you'll run, the interconnect types whose failure modes you're trying to catch, and the form factors that determine what plugs into what. This chapter is that map. It is intentionally compressed; the reader who needs the broader landscape can find it in NVIDIA's reference designs, OCP papers, and any number of vendor handbooks. The reader here came for validation and testing depth, and that begins in Chapter 3.

1.1 · The deployment landscape in 2026#

An AI fabric is not just a set of switches. It is a stack: GPU systems at one end, NICs and host SerDes connecting compute to fabric, switches in the middle, optics and cables stitching it all together, and a reference design or deployment pattern shaping the whole thing. Every layer in this stack imposes constraints on what testing looks like. We start with the switching layer because it shapes most validation decisions downstream, but the rest of the stack is just as relevant — and we cover it in this section.

The switches

An AI fabric in 2026 sits on top of one of a handful of switching platforms, and the platform shapes every operational decision downstream. The brands you'll encounter:

NVIDIA Quantum and Quantum-X. InfiniBand-native. Quantum-2 (NDR, 400G) shipped in volume from 2022; Quantum-X800 (XDR, 800G) shipped in 2024 and is the current platform for most large GPU clusters. Quantum-X Photonics, NVIDIA's first co-packaged-optics InfiniBand switch, is shipping to a small set of named hyperscalers as of late 2025. If your fabric is InfiniBand, you're almost certainly on NVIDIA Quantum.

NVIDIA Spectrum-X. NVIDIA's Ethernet-for-AI line. Spectrum-4 (51.2 Tbps, 400G/800G) shipped from 2023 and is the volume product for AI Ethernet today. Spectrum-X800 began shipping in 2024 with adaptive routing and packet-spraying tuned for AI traffic patterns. Spectrum-X Photonics — the Ethernet CPO version — is announced for second half of 2026.

Arista 7060X and 7800R. The 7060X6 (51.2 Tbps, 64×800G) is the volume Arista AI-fabric switch in 2026. EOS is the operating system across the line; Arista's compatibility programs are documented, and their published positions on linear pluggable optics (LPO) and on third-party transceiver support are operator-friendly relative to most peers.

Cisco Nexus and Silicon One. Silicon One G200 powered the Nexus 9300 generation through 2025; G300 launched in early 2026 with 200G-per-lane SerDes and explicit LPO and CPO support. Cisco's qualification programs are stricter than Arista's by default; expect more pairwise certification work if you're building on Cisco.

White-box SONiC. The hyperscaler choice — Microsoft, Alibaba, Meta, ByteDance — and increasingly the choice of well-engineered Tier 2 operators. SONiC runs on Broadcom Tomahawk silicon (Tomahawk 4, 5, and 6) and on Cisco Silicon One-based white boxes. Switch hardware comes from Celestica, Edgecore, Quanta, Inventec, Wiwynn, and others. SONiC is the most flexible platform for multi-vendor interconnect deployments and the one whose validation story depends most on what you build yourself.

One thing more: Broadcom Tomahawk silicon underlies a substantial fraction of every brand listed above. Tomahawk 4 (25.6 Tbps), Tomahawk 5 (51.2 Tbps), and Tomahawk 6 (102.4 Tbps, with the Davisson CPO variant shipping in October 2025) are inside Arista 7060X, many SONiC white boxes, and historically many Cisco Nexus designs. When two switches "from different vendors" share a Tomahawk core, their per-port behavior is more similar than the brand difference suggests.

Table 1 · Platform Landscape

The five platform families you'll meet in 2026

Compressed reference. Detailed compatibility matrices and validated configurations are vendor-specific and change frequently — confirm the current matrix at procurement time.

Platform family	Volume product (2026)	Native protocol	Operator-relevant traits
NVIDIA Quantum-X	Quantum-X800 (XDR 800G)	InfiniBand	Tightest qualification; LinkX is NVIDIA's first-party validation list, but properly validated third-party optics interoperate; CPO variant for 1H-2026 hyperscalers only
NVIDIA Spectrum-X	Spectrum-4 (51.2 Tbps)	Ethernet (RoCEv2)	Adaptive routing & packet spraying tuned for AI; SuperNIC pairing with ConnectX-7/8 mandatory for spec performance
Arista	7060X6 (51.2 Tbps)	Ethernet	Permissive third-party optics policy; documented LPO support; EOS minima per SKU published
Cisco Nexus / Silicon One	Nexus 9364E-SG2X (G200)	Ethernet	Stricter qualification gates by default; %SFP-4-UNSUPPORTED_SENSE warnings unless override configured
White-box SONiC	Tomahawk-5/6 boxes (multi-vendor)	Ethernet (RoCEv2)	Most operator flexibility; least vendor handholding; CMIS support varies by sonic-platform package

The GPU systems and reference designs

The other side of every fabric link is a GPU system or a NIC in a host. The shape of that system constrains how you test what plugs into it. The reference designs and platforms an operator encounters in 2026:

NVIDIA SuperPOD reference architectures. The DGX SuperPOD is the canonical deployment pattern most enterprise and Tier-2 buyers ask system integrators to deliver. SuperPOD reference designs are released for each GPU generation: the DGX H100 SuperPOD (8-GPU H100 nodes, 256 to 1,024 GPUs typically), the DGX H200 SuperPOD, the DGX B200 / GB200 SuperPOD, and the GB300 NVL72 system. Each reference defines the InfiniBand topology (rail-optimized leaf-spine), the cable lengths, the optic SKUs validated by NVIDIA LinkX, and the management network. If your customer says "we're deploying a SuperPOD," they mean a specific NVIDIA reference architecture with specific validated optics and cables — not a free-for-all.

NVIDIA HGX baseboards in OEM systems. HGX is the GPU baseboard that goes into Dell PowerEdge XE9680 / XE9712, Supermicro SuperBlade and 8U SuperServer, HPE Cray XD, Lenovo ThinkSystem SR685a, and many others. The system vendor adds chassis, cooling, NICs, and BMC; the GPU baseboard is the same NVIDIA-supplied module. Validation expectations sit between SuperPOD (highly prescribed) and white-box (free-for-all) — the system vendor publishes validated configurations, and customers expect testing against those.

Custom GPU servers. A growing share of large operators design their own GPU systems — Meta's Grand Teton, Microsoft's custom AI servers, ByteDance's designs. These don't use HGX baseboards; they integrate GPU silicon directly with custom backplanes. The validation surface is whatever the operator defines internally, which means the testing program is more bespoke and there's less industry-shared knowledge to lean on.

Neocloud and AI-cloud reference patterns. Operators like CoreWeave, Lambda, Crusoe, Together AI, Voltage Park, Nebius, and a long tail of regional providers buy GPU systems (mostly DGX or HGX-based), assemble them into clusters, and rent capacity. Most neoclouds are deploying SuperPOD or near-SuperPOD reference architectures because customer expectations are set by NVIDIA reference benchmarks. Their testing concern is reproducibility: every cluster they bring up has to produce the same MLPerf-style numbers a customer can compare against. That puts testing emphasis on per-rail parity and fabric-readiness validation.

System integrator (sysint) deployments. Many enterprise AI buyers do not buy direct from NVIDIA or assemble their own. They buy through a system integrator — World Wide Technology, Penguin Computing, Mark III, ePlus, Connection — that delivers a turnkey cluster against a reference design. The sysint owns bring-up validation; the enterprise owns production validation; the reference design owns the test plan. If you're a sysint, the validation in this handbook is what you're selling.

The NICs and host SerDes

Optics and cables don't plug into switches in isolation; on the GPU side they plug into NICs (or SuperNICs, in NVIDIA's vocabulary). The NIC matters for testing because the host-side electrical channel — the SerDes between NIC and module — is half the link, and host-side problems present as module problems on every diagnostic an operator runs.

The NICs you'll meet:

NVIDIA ConnectX-7 — 400G InfiniBand and Ethernet, the volume NIC for H100 / H200 / B100 generations. Single-port and dual-port variants. Used in DGX systems and HGX-based OEM servers.

NVIDIA ConnectX-8 SuperNIC — 800G, paired with B200 / GB200 / GB300 systems. The "SuperNIC" branding indicates Spectrum-X-aware features (adaptive routing acknowledgment, performance isolation primitives). For 800G Spectrum-X fabrics, ConnectX-8 SuperNIC pairing is required for spec performance, not optional.

NVIDIA BlueField-3 / BlueField-4 DPUs — data-processing units that combine NIC functionality with on-board ARM compute. Used where workload offload (storage protocols, security, telemetry) lives on the NIC rather than the host CPU. Validation surface includes the DPU's own firmware and OS.

Broadcom Thor / Thor 2 NICs — Broadcom's Ethernet AI NIC line, used in many non-NVIDIA Ethernet AI fabrics and in some hyperscaler builds.

AMD Pensando NICs — AMD's NIC offering, used in some enterprise AI and increasingly in MI300-series accelerator deployments.

For testing purposes, the most important fact about any NIC is its host-side SerDes specification — the electrical interface between NIC ASIC and pluggable module. Mismatches between what the module expects on its host-side electrical input and what the NIC drives are responsible for a substantial fraction of "bad module" reports that turn out to be host configuration issues. Section 1.4 returns to this when we cover the form-factor traps; Chapters 6 and 12 cover what you actually do about it during testing and bring-up.

Putting the layers together

The rest of this handbook covers fiber, optics, cables, switches, and validation procedures. That coverage assumes one of the deployment patterns above. Where the testing emphasis differs by deployment type — and it differs more than most validation references admit — we say so explicitly. Section 1.5 below is the cross-reference: testing emphasis by deployment scenario.

1.2 · Ethernet vs InfiniBand — and what the choice means for testing#

Two fabric protocols dominate AI deployments in 2026. The choice has real implications for what testing emphasizes.

InfiniBand (NVIDIA Quantum-X, Quantum-2). Lossless by design. Tightly coupled with NVIDIA NICs (ConnectX-7, ConnectX-8). Most SuperPOD reference deployments. Testing emphasizes: NCCL all-reduce throughput, ibdiagnet topology validation, congestion behavior under sustained collective operations. Compatibility is narrow but well-documented — NVIDIA's LinkX list is the first-party qualification reference, and if your switch + NIC + optic combination appears on it you can lean on qualification work already done; third-party optics validated for the same combination interoperate equally well, the validation work just falls to you.

Ethernet with RoCEv2 (NVIDIA Spectrum-X, Arista 7060X / 7800R, Cisco Nexus, white-box SONiC). Lossless via PFC and ECN — but only if PFC and ECN are configured correctly. Testing emphasizes: PFC frame counters under load, ECN mark rates, tail-latency stability, bandwidth utilization fairness across ECMP. Compatibility is broader (more switch and optic vendors qualify) but requires more configuration discipline. The fabric is only as lossless as your PFC tuning.

Within either protocol, the validation discipline in this handbook is the same. The differences appear in the fabric-readiness phase (Chapter 14) where the test workloads, counter sets, and pass/fail criteria are protocol-specific.

1.3 · Interconnect types at the depth needed to test them#

An AI fabric uses three categories of interconnect, and they have different validation profiles:

Optical transceivers — pluggable modules with electrical input, optical output. Reach 100m–10km depending on the optic class. Most AI fabrics use SR8 or DR8 in 800G OSFP for short-reach (rack and row), and 2xFR4 for the longer hops. Validation: incoming inspection, DOM baseline, FEC behavior, thermal margin, mating cycle.

DAC / passive copper — direct-attach copper cables. Reach 0.5–2m at 800G. No electronics, no DOM telemetry. Validation: visual inspection, link-up confirmation, BER under PRBS. Less can go wrong; less to monitor in production.

AEC / ACC / AOC — active electrical and active optical cables. AEC carries the longest reach in copper at 800G (3–7m), making it the right choice for the dense GPU-to-NIC short links where the cable bundle would otherwise block airflow. ACC is similar at lower power. AOC is short-reach optical inside a fixed-length assembly. Validation: DOM telemetry from the active electronics, link integrity under thermal stress, mating-cycle limits at the connector.

For most Tier-2 AI deployments, validation effort is dominated by transceivers (most defects, most variability) followed by AEC for the dense GPU-side bundles. Pure DAC validation is the lightest because there's the least electronics to fail.

1.4 · Form factors and what they imply for testing#

The form factor — OSFP, QSFP-DD, QSFP-DD800, OSFP1600, OSFP-XD — determines mechanical compatibility, thermal envelope, electrical interface, and which switches a module fits in. For 2026 AI deployments at 800G the volume choice is OSFP (NVIDIA Quantum-X / Spectrum-X) with QSFP-DD800 in QSFP-DD-legacy environments. At 1.6T it's OSFP1600 cage-compatible with OSFP800, with OSFP-XD in hyperscaler-only Ethernet builds.

Reference Card 13 · Form factor selection

Form factor decision matrix — 2026

Eight form factors that matter in 2026 across 100G–1.6T. Where each one wins, where each one loses, and which ones the market is actually consolidating around for new deployments.

2026 form factor matrix

Form factor	Speed	Status (2026)	Where it wins · where it loses
QSFP28 4 × 25G NRZ · 3.5–4.5 W	100G	Volume, mature	The 100G workhorse · still huge installed base · breakouts from QSFP-DD remain common
QSFP-DD 8 × 50G or 4 × 100G PAM4 · 7–14 W	200–400G	Volume, mature	The 400G volume form factor · cage-compatible with QSFP-DD800 for in-place upgrade
QSFP-DD800 8 × 100G PAM4 · 14–17 W	800G	Volume	800G in QSFP form · backward-compatible cages · lighter thermal envelope than OSFP
OSFP (IHS) 8 × 100G PAM4 · 15–18 W	800G	NVIDIA standard	NVIDIA Quantum-X · Spectrum-X choice · larger thermal envelope · LinkX-validated
OSFP-RHS 8 × 100G PAM4 · 15–18 W	800G	Limited	For switches with cage-mounted heat sinks · NOT interchangeable with IHS
OSFP1600 8 × 200G PAM4 · ~30 W	1.6T	NVIDIA 1.6T	Cage-compatible with OSFP800 · NVIDIA InfiniBand 1.6T standard
OSFP-XD 16 × 100G or 16 × 200G · up to 40 W	1.6T now, 3.2T future	Hyperscaler	Hyperscaler 1.6T Ethernet choice · 3.2T-ready · NOT interchangeable with OSFP
SFP56 / SFP112 1 lane PAM4 · 1.5–3.5 W	50G / 100G	Niche	Storage fabric · breakout legs · single-lane uplinks · low-power applications

ConsolidatingOSFP for 800G, OSFP1600 for 1.6T

NVIDIA's choice for both Quantum-X and Spectrum-X — making it the volume choice for AI fabrics broadly. OSFP1600 is cage-compatible with OSFP800, smoothing in-place 800G→1.6T upgrades.

Second-placeQSFP-DD800 for QSFP-stack continuity

Where existing QSFP-DD 400G deployments need to upgrade in place without re-cabling. Real but minority share at 800G; expected to fade as 1.6T deployments use OSFP1600.

Hyperscaler-onlyOSFP-XD at 1.6T Ethernet

Most large hyperscaler 1.6T contracts of late 2025 went OSFP-XD. Larger envelope · 16 lanes · 3.2T-ready. Outside hyperscalers, OSFP1600 dominates 1.6T procurement.

The principle

For Tier 2/3 AI deployments in 2026: OSFP at 800G, OSFP1600 at 1.6T. For QSFP-DD legacy: QSFP-DD800 in place.

The form-factor question is upstream of every other procurement decision because it determines what cages exist in your switch and what's mechanically compatible. Get this wrong and the rest of the procurement matrix is irrelevant. Confirm switch cage compatibility, IHS vs RHS heat-sink expectation, and 1.6T form-factor target (OSFP1600 vs OSFP-XD) before a single optic is ordered. The traps live here.

Vitex Validation Handbook · Card 13/15vitextech.com

The IHS / RHS distinction within OSFP catches deployments out repeatedly. OSFP IHS (integrated heat sink, finned top) is for switches that don't have cage-mounted thermal solutions — most TOR and spine switches. OSFP-RHS (riding heat sink, flat top) is for switches with cage-mounted heat sinks — DGX H100 and similar GPU systems. They are physically incompatible in each other's cages. A breakout cable connecting an OSFP IHS switch port to an OSFP-RHS GPU port needs the right connector type at each end.

For testing purposes: confirm cage type (IHS / RHS) before any optic ships, confirm power-class enablement on the host, and confirm the firmware on both ends supports the data-path mode you intend to run (single port 800G vs dual 2x400G breakout vs other configurations).

Card 13 in the reference card portfolio carries the full 2026 form-factor decision matrix; we don't reproduce it here. The validation point: form factor selection is upstream of every other compatibility decision. Get this wrong and the rest of validation is wasted effort.

1.5 · Testing by deployment scenario#

The single most useful question to ask before sizing a validation program: what kind of deployment is this? Six common scenarios cover most of what an operator, system integrator, or neocloud will encounter. The validation discipline is the same across all of them; what changes is what you emphasize, what the customer expects to see, and where the time goes.

Reference Card 15 · By deployment scenario

Validation, by deployment scenario

SuperPOD, HGX-OEM, Neocloud, Hyperscaler, Enterprise inference, Cluster expansion. Same discipline; different emphasis. The cross-reference operators use to scope their program.

Scenario 01

NVIDIA SuperPOD reference

Sysints, neoclouds, large enterprises

Emphasis

NVIDIA reference conformance
LinkX-validated optics only
Rail-pair parity
NCCL all-reduce sustained

Customer expects

SuperPOD validation report · MLPerf-comparable benchmark · ibdiagnet topology pass

Scenario 02

HGX-based OEM cluster

Dell · Supermicro · HPE · Lenovo

Emphasis

OEM compatibility list adherence
Pairwise interop testing
Thermal margin under workload
Per-rail bandwidth parity

Customer expects

OEM-stamped validation report · NCCL throughput · per-rail BW parity within 5%

Scenario 03

Neocloud / GaaS

CoreWeave-class · regional providers

Emphasis

Reproducibility across clusters
MLPerf-comparable results
Minimum-flap fabric for SLA
Per-cluster fabric-readiness

Customer expects

Customer-facing acceptance benchmark · fleet-wide telemetry baseline available

Scenario 04

Hyperscaler custom build

Microsoft · Meta · Alibaba · Google · ByteDance

Emphasis

Statistical AFR tracking
FW-update regression testing
Per-SKU long-term aging data
Internal vendor scorecards

Internal teams expect

SRE acceptance · internal AFR scorecard · vendor business-review data

Scenario 05

Enterprise inference

Banks · retailers · healthcare

Emphasis

Tail latency stability (p99.9)
QoS class isolation
Mixed workload behavior
Storage-fabric throughput

Customer expects

Application-level latency SLO compliance · security audit trail

Scenario 06

Cluster expansion / retrofit

Existing operators · mixed generations

Emphasis

Cross-generation interop
FEC negotiation across vendors
Firmware drift management
No-regression evidence

Customer expects

Existing workloads still meet baseline · mixed-gen reach demonstrated

The principle

Match your validation program to your scenario.

The single most common testing-program mistake is applying the wrong shape of program to the wrong scale of deployment. A hyperscaler-shaped program (statistical fleet management, AFR scorecards, FW regression suites) is overkill for a Tier-2 cluster bring-up — and underkill against a customer expecting an NVIDIA SuperPOD acceptance test. A SuperPOD-style program is overkill for an enterprise inference cluster where the real test is tail-latency stability under mixed workload. The validation discipline in this handbook is one program. The emphasis you put on each phase shifts based on which row of this card you're in.

Vitex Validation Handbook · Card 15/15vitextech.com

Table 2 · Testing Emphasis by Deployment Scenario

Six common deployment patterns and what each one means for the validation program

For operators choosing where to invest validation hours; for system integrators scoping validation deliverables; for neoclouds setting customer expectations on cluster bring-up.

Scenario	Who deploys it	Validation emphasis	What the customer expects to see
NVIDIA SuperPOD reference	Sysints (WWT, ePlus, Mark III), neoclouds, large enterprises	Conformance to NVIDIA reference; optics validated for the reference switch/NIC/cable combination; rail-pair parity	NVIDIA SuperPOD validation report; MLPerf-comparable benchmark; NCCL test pass; ibdiagnet topology validation
HGX-based OEM cluster	Sysints delivering Dell PowerEdge XE / Supermicro / HPE Cray clusters	OEM-vendor compatibility list adherence; pairwise interop testing; thermal margin under sustained workload	OEM-stamped validation report; NCCL all-reduce sustained throughput; per-rail bandwidth parity within 5%
Neocloud / GPU-as-a-service	CoreWeave-class providers, regional neoclouds, AI-cloud startups	Reproducibility across clusters; MLPerf-comparable results; minimum-flap fabric for billable SLA	Customer-facing acceptance benchmark (NCCL, MLPerf, customer's own model); fleet-wide telemetry baseline
Hyperscaler custom build	Microsoft, Meta, Alibaba, Google, ByteDance — internal teams	Statistical AFR tracking against fleet baseline; firmware-update regression testing; per-SKU long-term aging	Internal SRE acceptance; not externally visible. Internal AFR scorecard the dominant metric.
Enterprise inference cluster	Banks, retailers, healthcare deploying AI inference workloads	Latency-tail stability (p99.9, p99.99); QoS class isolation; mixed-workload behavior	Application-level latency SLO compliance; storage-fabric throughput evidence; security audit trail
Cluster expansion / retrofit	Existing operators adding capacity; mixed generations of GPU and switch	Cross-generation interop (e.g. ConnectX-7 with new ConnectX-8 fabric); FEC negotiation across vendors; firmware drift management	No-regression evidence: existing workloads still meet baseline after expansion; mixed-generation reach demonstrated

A few specific notes for each scenario

If you're delivering a SuperPOD reference: the NCCL test suite is not optional — your customer will run it on day one, and if your fabric doesn't pass all-reduce at expected throughput, you have a delivery problem. NVIDIA's LinkX list is the first-party validation reference, and a customer may ask where your optics sit relative to it. That is a question to be ready for, not a requirement that the optics carry NVIDIA's own part numbers: properly validated third-party optics interoperate in Quantum-X and Spectrum-X fabrics when they are tested for the specific switch, NIC, and cable-length combination they are deployed in, with thermal and interop behavior confirmed. The discipline that makes this defensible is the same discipline in this handbook — sample-test your optics on the actual switch + NIC pair before deployment, not just on a reference platform, and have the validation record ready when the customer asks.

If you're a neocloud: reproducibility is your product. A customer renting a 1,024-GPU partition expects the throughput their model achieved on their last partition. Variance between clusters in your fleet is a customer-facing problem, not just an SRE problem. That makes per-cluster fabric-readiness validation (Chapter 14) the most leveraged validation activity in your program — because it's what catches the cluster that's 8% slower than its neighbors before the customer does.

If you're a system integrator: the validation report is your deliverable. Customers don't see the work, they see the report. The handbook's validation-report schema in Chapter 15 is structured deliberately for this case — every field is something a customer wants to see signed off, and the format is something a sysint can hand to a customer at acceptance.

If you're running enterprise inference: your fabric stress-test is not all-reduce; it's thousands of concurrent small flows under PFC. The validation emphasis shifts away from peak training throughput and toward tail-latency stability and QoS-class isolation. Storage fabric becomes a first-class concern, not a footnote.

If you're expanding an existing cluster: the dominant risk is cross-generation interop. A new switch generation may negotiate FEC differently with your existing NIC generation; a new optic SKU may negotiate at a different power class than your switches expect. Allocate validation time specifically to these boundary cases. The Cisco / Arista / SONiC compatibility matrices change between major releases — re-confirm every time.

In practice

The single most common testing-program mistake we see in the field: applying a hyperscaler-shaped validation program to a Tier-2 deployment, or applying a Tier-2 program to a SuperPOD delivery. The hyperscaler program is built around statistical fleet management, which is overkill at small scale and underkill against a customer expecting an NVIDIA-shaped acceptance test. The Tier-2 program is built around bring-up and per-link health, which doesn't cover the fleet-wide AFR tracking a hyperscaler needs. Match your program to your scenario. Section 1.5 is the right starting point for that decision.

← PreviousIntroductionNext chapter →Ch 2 · Reading the module

Chapter Two

Reading the module#

Every test you run on a transceiver, every validation check you perform on a deployed link, and every triage decision you make in production rests on one foundation: the data the module reports about itself. That data lives in the module's EEPROM and, on modern modules, in the CMIS data model layered on top. Knowing how to read it — and what to ignore — is the single most leveraged skill in this handbook. It also separates engineers who can diagnose a failing link from engineers who can only swap modules and hope.

Reference Card 08 · Module anatomy

800G OSFP, component by failure mode

What's inside an 800G OSFP transceiver, what each component does, and which ones tend to fail in production. Failure ranking is directional, drawn from field FFA disclosures and vendor RMA patterns.

Anatomy of an 800G OSFP DR8 module

Failure-mode breakdownrelative frequency · field FFA + RMA

Component	Relative frequency	Pattern
Connector / endface	Highest	Contamination, wear
Laser (EML / VCSEL)	High	Aging, thermal stress
TIA / driver	Moderate	Lane-imbalance defects
Optical sub-assembly	Moderate	Alignment drift
Firmware	Moderate	Mismatch, regression
DSP	Lower	Rare; usually NFF
PCB / assembly	Lower	Rare; manufacturing
Edge connector	Lowest	Mating-cycle wear

Relative ordering, not measured shares — directional, drawn from field FFA disclosures and vendor RMA patterns. Connector and endface issues consistently rank first.

Most commonEndface contamination

The single most common module-related incident category. Usually presents as RX power drop with the link still up. Inspect, clean, re-mate before any module swap — most "failed modules" are channel issues.

Slow-agingLaser bias rising

The laser ages — bias current rises to maintain output power as the device degrades. Bias slope is the leading indicator; it shifts 3–6 months before TX power drops measurably.

MisdiagnosedFirmware / host config

DPInit-stuck modules look like hardware failures but are usually firmware mismatch or host SerDes config. Check before RMA — vendors NFF a substantial fraction of returns and operators eat the freight.

Vitex Validation Handbook · Card 08/15vitextech.com

2.1 · The CMIS data model#

CMIS organizes module data into a memory map with a familiar two-byte addressing scheme: a Page selector, then a byte offset within the page. The Page selector distinguishes the kind of data you're reading; the offset addresses a specific field. Pages are organized into Banks for modules whose lane count exceeds what fits in a single page.

The pages an operator interacts with most often, and what each contains:

Table 3 · CMIS Pages — what each contains

The pages an operator reads, writes, or diagnoses against

CMIS defines many more pages than appear here. This is the operator's working set — the pages you'll hit during sample testing, bring-up, and triage. Page numbers are CMIS Rev 5.x.

Page	Name	What it contains, and why you read it
`00h` Lower	Module identifier & status	The first thing you read on any module. Vendor name, part number, serial number, date code, module state, firmware version, alarm/warning flags. If this page reads anything unexpected, the rest doesn't matter.
`00h` Upper	Module advertising	What the module says it can do: media types, lane configurations, supported applications, rated max power. The ApSel (Application Selection) table lives here — the menu of configurations the module advertises.
`01h`	Application advertising (extended)	For modules with many supported applications (LPO modules, modules with multiple FEC modes), the additional Application Descriptor entries that didn't fit on Page 00h.
`02h`	Module thresholds & calibration	Per-unit calibration coefficients for the DOM telemetry, plus alarm and warning thresholds for temperature, voltage, TX/RX power, and laser bias. If two "different" units have identical Page 02h calibration, the report is fake.
`10h`	Lane control	Active control of per-lane state — datapath enable/disable, output enable, tx_disable. Where you go to take individual lanes up and down for testing.
`11h`	Per-lane state & flags	Per-lane datapath state, per-lane fault flags, per-lane CDR lock. The page that tells you which lane has the problem.
`13h`	PRBS generator & checker	The operator-portable PRBS test interface. Configure pattern, enable generator, enable checker, read error counts. This page is how you run a Bit Error Rate Test from a switch CLI without owning a $50K BERT instrument.
`14h–17h`	VDM (Versatile Diagnostic Monitoring)	The diagnostic gold mine. Per-lane Pre-FEC BER statistics (current, min, max, average), per-lane SNR, per-lane historical bins, per-host-lane and per-media-lane telemetry. This is where modern diagnostics live.
`9Fh, A0h–AFh`	CDB (Command Data Block)	The command-and-control surface. Firmware update commands, vendor extensions, vendor-specific message exchanges. Where firmware-mismatch problems become visible.

2.2 · The Datapath State Machine#

A CMIS module is not a stateless device. It moves through a defined state machine from the moment power is applied to the moment it carries traffic at full rate. Understanding the state machine matters because the most common module failure mode in 2026 is a module that is stuck in an intermediate state — link doesn't come up, no error message that obviously identifies why, and the engineer ends up replacing modules and cables in sequence trying to find the problem.

The CMIS Datapath State Machine has five operational states an operator should recognize:

The five operational states. A module that won't advance from DPInit to DPActivated is the most common module-side failure pattern in 2026 — and almost always reveals firmware mismatch, ApSel mismatch, or host SerDes misconfiguration before it ever reveals a defective module.

DPDeactivatedThe module is powered but the datapath is intentionally disabled. This is the resting state after power-on, before the host instructs the module to come up.

DPInitThe module has been instructed to bring up the datapath and is initializing — firmware booting on the lane, CDR locking, equalization converging. This is where stuck modules get stuck.

DPDeinitMirror of DPInit, used during graceful shutdown. Less commonly observed during operations.

DPTxOnInitialization completed; transmitter is enabled but the receiver is still in early-state operation. Transient.

DPActivatedThe operational target. Transmitter and receiver both running at line rate, datapath fully enabled, telemetry valid. If your module is in DPActivated and you still have problems, the problem is downstream of the module.

A module that fails to advance from DPInit to DPActivated almost always indicates one of three causes: (1) a CMIS firmware mismatch where the module's firmware advertises a feature or behavior the switch's CMIS parser doesn't recognize, (2) an Application Selection (ApSel) request from the host that the module's advertising doesn't actually support — host trying to configure an 8×100G mode on a module that only advertises 4×200G, for instance, or (3) host-side SerDes configuration that doesn't match what the module expects on its electrical input.

None of these will tell you, in a clean error message, what is wrong. The engineer who knows the state machine reads Page 11h, sees the DPInit state, and immediately knows where to look. The engineer who doesn't replaces three modules before figuring it out.

In practice

The CMIS spec describes the DPSM as a clean state machine with well-defined transitions and timing. Modules are expected to advance from DPDeactivated to DPActivated within a few seconds of host instruction.

Real modules in the field do not always do this. A small but non-zero fraction of new SKU rollouts — particularly first-production-lot AECs and the early lots of any new LPO or 1.6T module — exhibit DPInit transitions that take 20–30 seconds, or that occasionally fail and retry. The host's typical bring-up timeout is 10–15 seconds. The result is a module that the host reports as faulty, that works correctly when the host's timeout is increased, and that is not actually defective. Before you RMA modules from a new SKU rollout, check the bring-up timeout on the host and look at the DPSM transitions over a longer window. Most platforms expose this via syslog with debug enabled.

2.3 · VDM — the diagnostic gold mine#

If you read only one CMIS page during ongoing operations, read the VDM pages. Versatile Diagnostic Monitoring is where modern modules expose the data that lets you diagnose problems before they become outages.

The fields VDM exposes that you cannot get anywhere else:

Per-lane Pre-FEC BER statistics — not just the current value, but the running minimum, maximum, and average over a configurable window. The trend in these fields is what catches a degrading link weeks before it crosses a threshold.
Per-lane SNR — for PAM4 modules, the signal-to-noise ratio per lane is a leading indicator of channel margin. SNR drops typically precede BER rises by hours to days.
Historical bins of pre-FEC BER — the module accumulates how often the BER has fallen into each of several bins over time, giving you a distribution rather than a snapshot.
Host-side and media-side per-lane telemetry separated — letting you isolate problems to one side of the channel.
Lane equalizer settings — what the DSP is doing to compensate for channel impairments. A lane whose equalizer is at extreme settings is a lane that's working hard.

The combination of VDM and the standard DOM telemetry is the operator's diagnostic toolkit. Without VDM you have aggregate counters and a single-point BER sample. With VDM you have per-lane trends and statistical distributions. The difference between a fabric that catches problems and a fabric that surprises you is largely the difference between operators who use VDM and operators who don't know it's there.

2.4 · Firmware as a validation event#

One last note before we close out the foundation. Module firmware is updateable in the field on every modern CMIS module, via the CDB pages. This is mostly a feature — vendors fix bugs, improve performance, and address newly-identified compatibility issues by shipping firmware updates. But it has implications for operations.

A module's behavior under your validation tests is specific to its firmware version at the time of testing. A module that passes sample testing on firmware version 1.2.4 may behave differently on firmware 1.3.0. The implication: every firmware update is, in principle, a re-validation event. Most of the time the new firmware behaves equivalently or better; occasionally it doesn't.

In practice

Nobody re-runs full sample testing every time a vendor pushes a firmware update. The realistic discipline: maintain a small re-test sample (5–10 modules) and run an abbreviated validation pass against each new firmware version before deploying it to the production fleet. Watch DOM, FEC counter behavior, and DPSM transitions over a 24-hour window. If anything looks different from baseline, hold the rollout. We come back to firmware as a re-validation trigger in Chapter 19.

← Previous chapterCh 1 · What you're testing Next chapter →Ch 3 · Why operators run their own testing

Take the bench-side version to the data hall

The field pack is the printable kit cut from this handbook — the five-check procedure, DOM threshold tables, the symptom→cause matrix, sampling tables, and the vendor scorecard. Built to live on a clipboard, not in a browser tab.

Get the field pack →

Part Two

Sample testing & incoming inspection

The operator-side testing program — why operators run their own, with what, and what the data actually justifies.

Chapter Three

Why operators run their own testing#

Sample testing is the program that closes the operator's gap before procurement commits. This chapter is about the why — the four real questions a sample-testing program has to answer, the decision between an in-house program and a contract lab, and the moment when running your own testing pays for itself versus the moment it doesn't. The how comes in Chapters 4 through 8.

3.1 · When sample testing pays for itself#

Running a serious sample-testing program costs money: equipment, engineer time, lab space, the modules themselves. The investment makes sense in some contexts and not others.

Reference Card 14 · AFR at scale

What 0.68% AFR means at scale

Alibaba's published per-link AFR translated to wall-clock time-between-incidents at four cluster scales. The math operators feel in their gut once they see it.

Per-link AFR translated by cluster scale

assumes 2 access links per GPU · 0.68% AFR per link (Alibaba HPN, SIGCOMM 2024)

Small cluster

1K GPUs

2,048 access links

~14 incidents / yr

One link down roughly every 26 days

Mid cluster

4K GPUs

8,192 access links

~56 incidents / yr

One link down roughly every 6.5 days

Large cluster

16K GPUs

32,768 access links

~223 incidents / yr

One link down roughly every 39 hours

Hyperscale

100K GPUs

204,800 access links

~1,393 incidents / yr

One link down every ~6 hours

What this means operationally

1K GPU clusters can handle link incidents manually — one engineer, ticket per event, replace within a day
4K–16K GPU clusters need automated alerting, tier-1 NOC coverage, and a routinized RMA / spare-parts workflow
100K+ GPU clusters need fleet-management software, statistical AFR tracking against vendor baselines, and proactive retirement programs
Workload impact — at 16K GPUs, a link flap during all-reduce stalls 256 GPUs for the duration; the workload-level cost of these incidents is non-trivial

The principle

AFR scales linearly with link count. Validation discipline doesn't.

Doubling cluster scale doubles annual link incidents. Doubling validation effort does not — it scales sub-linearly because automated 5-check protocols, statistical sampling, and trend-based alerting all amortize across the larger fleet. The economic case for serious validation strengthens as cluster size grows. At 100K GPUs, the question is not whether to invest in validation; it's how much you'd be willing to spend to avoid 1,400 unscheduled link incidents per year.

Vitex Validation Handbook · Card 14/15vitextech.com

Sample testing pays for itself when:

You're evaluating a new vendor. The cost of a bad vendor decision is measured in production failures and emergency replacement spend; the cost of testing 30 modules to learn whether to make the bet is small by comparison.
You're evaluating a new SKU from a known vendor. A vendor whose 100G product was excellent does not automatically have a great 800G product. Sample-test every new SKU.
You're deploying at scale. The break-even calculation runs in your favor as deployment size grows. A 50-module deployment can absorb mistakes; a 5,000-module deployment cannot.
You're at a step-change in technology. First 800G deployment, first LPO deployment, first 1.6T deployment. Step-change technology has higher first-lot variance; sample testing is how you discover it before production.
The application is high-stakes. An AI training fabric where a fabric-induced training-job interruption costs tens of thousands of dollars per hour justifies more aggressive testing than a development environment.

Sample testing does not pay for itself when:

The deployment is small and the SKU is mature. Buying twenty 100G QSFP28 modules for a storage fabric expansion, from a vendor you've used for three years, on a platform you know well — sample testing is overhead.
You have an accepted contract lab partner. If a contract lab tests on your behalf and produces reports you trust, replicating their work in-house is cost without benefit.
You're under a vendor support agreement that explicitly covers your use case. If the vendor has tested your specific configuration and contractually warrants it, your testing program is verifying their claims rather than doing the original work.

3.2 · In-house, contract lab, or trust-but-verify#

The three structural options for an operator-side testing program:

Full in-house program. You buy the equipment, hire or assign engineers, build the protocols, and run the testing yourself. Highest fixed cost; lowest marginal cost per test; full control over methodology; full ownership of the data. The right choice for operators above roughly 5,000 ports of fabric or with multiple concurrent vendor evaluations per year.

Contract lab. You ship samples to a third party — there are several reputable optical test labs in North America and Asia — who runs an agreed protocol and delivers a report. Lower fixed cost; higher per-test cost; methodology limited to what the lab offers; data ownership negotiated. The right choice for operators who need testing rigor occasionally rather than continuously.

Trust-but-verify. You rely on the vendor's factory test reports as primary evidence and run a lightweight verification on a small subset — switch-resident BERT, basic DOM read, a brief burn-in — to confirm the report's claims are not fabricated. Lowest cost; lowest assurance; the right choice when you have established trust with a vendor and the deployment doesn't justify a full program.

Most operators end up running a hybrid: full in-house testing for new vendors and step-change SKUs, contract-lab testing for occasional rigor on important decisions, trust-but-verify for routine reorders from established suppliers. The breakdown depends on operator size and risk tolerance, not on a universal answer.

In practice

The published methodology assumes you have a rigorously equipped optical lab with TDECQ-capable scopes, environmental chambers, and a dedicated test bench.

Most Tier 2 and Tier 3 operators don't. The realistic posture: own enough in-house capability to run switch-resident BERT, DOM reads, and ambient-condition burn-in — which catches roughly 70% of vendor-quality issues — and retain a contract-lab relationship for the remaining 30% when something requires TDECQ measurement, stressed-receiver testing, or environmental-chamber work. The in-house lab pays for itself on volume; the contract lab pays for itself on rigor when it matters. Operators who try to build a full hyperscaler-grade lab from scratch usually under-invest in the equipment that catches most defects and over-invest in equipment they use twice a year.

← Previous chapterCh 2 · Reading the module Next chapter →Ch 4 · Sample size — the math, and why "five units" doesn't pass

Chapter Four

Sample size — the math, and why "five units" doesn't pass#

Every sample-testing program comes back to one number: how many units do you actually need to test? The answer most operators converge on without thinking — five or ten — is a number with no statistical meaning. This chapter walks through what the standards actually require, what hyperscalers actually do, and what a defensible Tier 2/3 program looks like.

If you observe zero defects in a sample of size n, the upper 95% confidence bound on the true defect rate is approximately 3/n. Sampling five units and seeing them pass tells you very little. Sampling thirty tells you something. Sampling one hundred tells you something useful.

4.1 · The standard everyone should know#

ANSI/ASQ Z1.4 — formerly MIL-STD-105 — is the sampling standard the rest of the manufacturing world uses for incoming-inspection decisions on lot acceptance. The General Inspection Level II tables map lot size to sample size to accept/reject criteria at a given Acceptable Quality Level (AQL). At AQL 1.0, a 3,200-unit lot requires inspecting 125 units; you accept if 3 or fewer fail and reject at 4 or more. Reference Card 02 shows the relevant AQL tables visually.

Reference Card 02 — Why 'five units' doesn't pass · Vitex AI DC Optics Validation Handbook 2026

Reference Card 02 · Sample-testing statistics

Why "five units" doesn't pass. And "thirty" is the floor.

If you observe zero defects in a sample of size n, the upper 95% confidence bound on the true defect rate is approximately 3/n. Apply it.

The 3-over-n ruleupper 95% CI ≈ 3 / n— Hanley & Lippman-Hand, 1983

Useless

5units

Cannot defend any vendor claim

95% CI on defect rate≤ 45.1 %

Z1.4 AQL 2.5% verdictN/A — too small

Cpk confidence ±±0.7

Engineer time~15 hours

Five passes in a row are statistically consistent with a vendor whose true defect rate is 30–40%. You cannot tell the difference.

Defensible floor

30units

First-article qualification

95% CI on defect rate≤ 9.5 %

Z1.4 AQL 2.5% verdictPass at 0 fail

Cpk confidence ±±0.24

Engineer time~90 hours

The minimum sample for a statistically defensible vendor decision. Industry-standard floor for AQL 2.5%.

Strong evidence

100units

First-production-lot 100% inspect

95% CI on defect rate≤ 2.95 %

Z1.4 AQL 1.0% verdictPass at ≤ 2 fail

Cpk confidence ±±0.13

Engineer time~50 hr (lite protocol)

Hyperscaler-aligned coverage. Statistically supports a sub-3% defect-rate claim with high confidence.

Z1.4 General Inspection II · AQL 1.0Sample sizes by lot size

Lot size	Sample size	Accept ≤	Reject ≥
501 — 1,200	80	2 fail	3 fail
1,201 — 3,200	125	3 fail	4 fail
3,201 — 10,000	200	5 fail	6 fail
10,001 — 35,000	315	7 fail	8 fail

The principle

"Five units" is not a sample size. It's a vibe.

A vendor whose true defect rate is 30% has a 17% probability of giving you five passes in a row. The cost of a bad vendor decision dwarfs the cost of testing 25 more modules. Run the floor — 30 units, full first-article protocol — and you have a defensible decision. Anything less is guessing dressed in a number.

Vitex Validation Handbook · Card 02/15vitextech.com

The reason Z1.4 matters: it's defensible. A vendor can't argue "you only tested 5 units of my 3,200-unit lot, that's not statistically meaningful" if you tested 125 — that's the standard. The same vendor will accept the rejection at 4 failures because they understand the statistical floor. Operators who size sample-testing programs against Z1.4 produce decisions that survive vendor escalation. Operators who run "five units we had handy" produce decisions that don't.

4.2 · The brutal arithmetic at n=5#

Let's translate the standard's reluctance into something concrete. Suppose you order 5 sample modules from a new vendor and all 5 pass your tests. What does that tell you about the vendor's actual defect rate?

The exact binomial 95% confidence interval for "0 failures in 5 trials" admits population defect rates from 0% up to roughly 45%. That is, your data is statistically consistent with a vendor whose actual production has a 45% defect rate. Five passes in a row is not improbable from such a vendor — it has roughly a 5% chance of giving you exactly that result. You cannot tell the difference between a vendor with a 1% defect rate and a vendor with a 30% defect rate from a sample of 5.

The same arithmetic, presented differently for sample sizes that show up in operator practice:

Table 4 · 95% confidence on defect rate by sample size

What "all units passed" actually tells you, statistically

Upper bound of the exact binomial 95% confidence interval for the population defect rate, given zero failures observed in the sample. The statistical claim "the vendor's defect rate is below X" is supportable only at the sample sizes that make X small.

Sample size	Upper 95% CI on defect rate	Operator interpretation
5	45.1%	Tells you almost nothing
10	25.9%	Still consistent with a bad vendor
20	13.9%	Not yet defensible at any meaningful AQL
30	9.5%	Beginning to be defensible at AQL 2.5
50	5.8%	Defensible at AQL 2.5; not at AQL 1.0
100	2.9%	Defensible at AQL 2.5; marginally at AQL 1.0
200	1.5%	Defensible at AQL 1.0
300	1.0%	Defensible at AQL 0.65
3,000	0.10%	Defensible at hyperscaler-grade levels

The implication is straightforward: if you want to make any defensible statement about a vendor's defect rate, you need at least 30 samples, and ideally 100. Below that, you are running a smoke test and calling it qualification.

4.3 · What hyperscalers actually do#

Hyperscaler optical-procurement programs run statistical sampling at scale. The pattern: AQL 1.0 sampling per Z1.4 General Inspection II at every receiving lot, with full first-article protocols at vendor onboarding (typically 200–500 units across multiple lots) and ongoing Cpk tracking against published baselines. Their published failure data — Alibaba, Microsoft, Meta — derives from this discipline; the data exists because the discipline exists. Tier-2 operators don't need the same depth, but the pattern is the model: structured sampling, statistical pass/fail criteria, baseline tracking. The next section describes the Tier-2 adaptation.

4.4 · The realistic Tier 2/3 sampling pattern#

You are not Microsoft. You don't have a 50,000-module fleet, you don't have an in-house optical lab, and you don't have engineers whose full-time job is qualification. What does a defensible sample-testing program look like at your scale?

The realistic pattern, which we have seen converge across Tier 2 cloud operators and well-engineered Tier 3 AI startups:

For each new vendor or new SKU: a first-article qualification of 30 units. Thirty units is the smallest sample size that gives statistically defensible coverage at AQL 2.5%, the minimum acceptable AQL for production AI fabric. Test all 30 units to a documented protocol covering identification, DOM, FEC, thermal corner, and 72-hour burn-in. Zero failures across 30 units gives you a defensible 9.5% upper-bound on defect rate and an actionable distribution to compare against future lots.

For first production lots (the first 100–500 units after qualification): 100% inspection on a lightweight protocol. Identification, DOM read, link-up verification, FEC baseline. No burn-in, no thermal corner. Catches gross failures before they reach deployment.

For follow-on lots (ongoing): sampling at AQL 4.0 General Inspection Level II. This is looser than what hyperscalers run, and it is honest about your scale. At AQL 4.0, a 1,200-unit lot requires a 32-unit sample with a 3-acceptance number. The lightweight protocol from first-production-lot inspection is the right test set.

Annually: re-qualification of representative units from the active fleet. Pull 10–20 units that have been deployed for 12 months, re-test them against the original first-article protocol, and trend the deltas. This is your check on whether the vendor's quality is drifting over time.

This program is not hyperscaler-grade. It does not catch every issue a million-port fleet would. It does, however, provide statistically defensible vendor evaluation, scales with operator size, and fits into a budget that small ops teams can defend. The full implementation guidance — what equipment, what tests, what protocols — is the rest of Part 2.

In practice

Most Tier 2/3 ops teams cannot do 30-unit first-article qualification on every new SKU. The reality is that there are too many SKUs and too few engineers. The compromise that we see actually work: treat first-article qualification as a per-vendor activity, not a per-SKU activity. Run the full 30-unit protocol on the first SKU you buy from a new vendor. For subsequent SKUs from the same vendor, run a 10-unit abbreviated protocol — a "trust the vendor's baseline, verify the SKU" sample. If you've built confidence in the vendor's quality system, you can spread the testing budget across more SKUs without sacrificing defensibility on the ones that matter.

4.5 · The decision: how many samples for your program#

Bringing the whole chapter together:

5 samples is engineering smoke testing. Useful for quick-and-dirty checks; not a procurement decision.
10 samples is informal supplier evaluation. Useful for ranking vendors against each other on a quick screen; not for absolute quality claims.
30 samples is the floor for defensible Tier 2/3 first-article qualification at AQL 2.5%.
100 samples is the comfortable floor for confident Cpk-based decisions, and what we'd recommend for any high-stakes vendor evaluation.
300+ samples approaches hyperscaler-grade coverage at AQL 0.65%. Generally only practical for first-production-lot 100% inspection rather than first-article evaluation.

The next chapter answers the obvious follow-up: what equipment do you need to run the testing on those 30 or 100 modules?

← Previous chapterCh 3 · Why operators run their own testing Next chapter →Ch 5 · Test equipment — what you need, what you can skip

Chapter Five

Test equipment — what you need, what you can skip#

The honest answer to "what equipment do I need" is "less than the test-equipment vendors will tell you, and more than the budget owner will want to hear." This chapter is the equipment hierarchy: the minimum that lets you run a real testing program, the recommended tier for serious Tier 2/3 operations, and the equipment that earns its place only at hyperscaler scale or in a contract lab.

Note on the numbers in this chapter

This chapter names specific instruments, vendors, and price points. The instruments are real and the prices reflect operator-reported acquisition cost as of mid-2026. They are indicative, not authoritative: pricing varies with configuration, region, channel, and used vs. new market. Treat the figures as ballparks for budget planning. Confirm current pricing with the manufacturer or an authorized reseller before any procurement decision.

5.1 · Equipment by test purpose#

Before the tier breakdown, here is the equipment landscape mapped to what each instrument actually catches. Most operators inherit equipment recommendations from vendors with an interest in selling them. The honest mapping is below.

Reference Card 03 · Equipment

Test equipment by purpose

What each instrument actually catches, what you can substitute, and what to skip. Indicative pricing — verify with manufacturer.

Equipment decision treeStart here, before you spec

Will you test more than 200 modules per year?

No · < 200 / yr

Contract Lab

Higher per-test cost, no fixed overhead. NIST-traceable reports.

Yes · ≥ 200 / yr

In-House Lab

Higher fixed cost, lowest marginal. Full data ownership.

Either way, fiber inspection scope is never optional.

Instrument matrix8 instruments · 4 columns of truth

Instrument · cost (USD)	Catches · when to skip
Fiber inspection scope $1.5–3K	Endface contamination & damage Skip / substitute when: Never skip. Highest defect-catch ROI.
Optical power meter $1–3K	Independent TX power verification Skip / substitute when: Module DOM is acceptable for routine work
OTDR $8–25K	Fiber-plant integrity, splice loss Skip / substitute when: One-time deployment validation — borrow or contract
BERT instrument $80–150K	True PRBS BER at PHY Skip / substitute when: CMIS Page 13h switch-resident PRBS catches 90%+ free
TDECQ oscilloscope $80–200K	PAM4 transmitter quality, eye Skip / substitute when: Contract-lab access cost-effective for most operators
Environmental chamber $30–80K	Programmable thermal cycling Skip / substitute when: Switch fan-control creates 60–70°C corner free
Mating-cycle fixture $5–10K	Connector durability over insertions Skip / substitute when: In-house actuator + linear stage builds equivalent < $1K
Optical spectrum analyzer $30–60K	Wavelength verification on WDM Skip / substitute when: Skip unless deploying CWDM/DWDM at volume

The three tiers

Tier 1~$5–15K

Minimum Viable

Catches the majority of vendor defects. Skips precision optical measurement.

Tier 2~$30–60K

Serious Operator

Adds OTDR, thermal control, mating fixture. Tier 2/3 floor.

Tier 3$100K+

Hyperscaler / Contract Lab

BERT, TDECQ, full chamber, OSA. Most operators access via contract.

Build · Buy · Contract

Fiber inspection scopeBuy

Power meter, OTDRBuy

Mating-cycle fixtureBuild

Thermal corner setupBuild

BERT, TDECQ scopeContract

Environmental chamberContract

Optical spectrum analyzerSkip*

The principle

A sub-$50K lab does 90% of what a serious validation program needs.

CMIS Page 13h switch-resident PRBS catches the same defects as a $100K BERT for almost every case. The difference is for the 10% of marginal cases — that's what contract-lab access exists for. Operators who try to build a full hyperscaler-grade lab from scratch usually under-invest in the equipment that catches most defects and over-invest in equipment they use twice a year.

~$48KTotal budget for the practical Tier-2 lab. See Chapter 5 for the line items.

Vitex Validation Handbook · Card 03/15vitextech.com

Table 5 · Equipment × Test Purpose Matrix

What each instrument enables, what it doesn't, what you can substitute

Marketing positions every instrument as essential. Operationally, only some are. The right column is the operator's leverage — knowing what you can substitute saves money without sacrificing rigor.

Instrument	Primary purpose	Indicative cost (USD)	Substitute or skip when…
Fiber inspection scope	Endface contamination & damage detection	$1.5K – 3K	Never skip. Cheapest instrument with the highest defect-catch ROI.
Optical power meter	Independent TX power verification	$1K – 3K	Module DOM is acceptable for routine work; meter is for vendor disputes.
OTDR	Fiber-plant integrity, splice loss, structural fault	$8K – 25K	Owned cabling — can borrow/contract for one-time validation.
BERT instrument	True PRBS bit error rate at PHY layer	$80K – 150K	CMIS Page 13h switch-resident PRBS catches 90%+ of BER defects free.
TDECQ oscilloscope	PAM4 transmitter quality, eye diagram	$80K – 200K	Contract-lab access is more cost-effective for most operators.
Environmental chamber	Programmable thermal cycling −40 to +85°C	$30K – 80K	Switch fan-speed control creates 60–70°C corner without chamber.
Mating-cycle fixture	Connector durability over repeated insertions	$5K – 10K commercial	In-house actuator + linear stage builds equivalent for under $1K.
Optical spectrum analyzer	Wavelength verification on WDM/CWDM	$30K – 60K	Skip unless deploying CWDM/DWDM at meaningful volume.

5.2 · What you can do with just a switch and hosts#

Before you buy any optical test equipment, understand what your existing infrastructure can already do. Modern switches and the modules themselves expose a remarkable amount of test capability at zero additional equipment cost.

Identification and EEPROM verification. Every NOS exposes commands to read the module's EEPROM. You can verify vendor, part number, serial number, calibration data, and Application Selection table without any external equipment.

DOM telemetry. The module reports TX power, RX power, temperature, voltage, and laser bias current per lane. The reading is the module's own measurement, calibrated by the vendor — but cross-checking against an external power meter (Tier 1 equipment) catches the rare case where the module's calibration is wrong.

FEC counters and pre-FEC BER. Modern NOS exposes per-lane FEC counter data. Run line-rate traffic for several minutes and you have a real BER measurement, accumulated over enough bits to be statistically meaningful.

PRBS testing via CMIS Page 13h. Every CMIS-compliant module supports a generator and checker on Page 13h. From the switch CLI, you can configure a PRBS pattern (typically PRBS31Q for 100G/lane PAM4), enable generation on the local lane and checking on the remote lane, and read error counts directly from the module. This is functionally equivalent to a $100,000 BERT instrument for most use cases — the difference being that an instrumented BERT can capture eye diagrams and run stressed-eye analyses that the in-module checker can't.

RDMA bandwidth testing. If your test hosts have ConnectX or compatible NICs, the perftest suite (ib_send_bw, ib_write_bw, ib_read_bw) drives line-rate traffic and reports throughput, latency, and per-flow consistency. This is the test that confirms a link will perform under workload — not just survive.

Burn-in. A test bench running PRBS traffic for 72 hours is a complete burn-in setup. The 72 hours is the hard part; the equipment is incidental.

Together, these capabilities — using only a switch, two hosts, basic patch cables, and a fiber inspection scope — catch what we estimate to be roughly 70% of vendor-quality issues an instrumented lab would catch. The remaining 30% — TDECQ measurement, stressed-receiver testing, accelerated aging in a controlled chamber, eye-diagram analysis at the physical layer — are real but bounded. They are also exactly the cases where a contract-lab relationship pays for itself.

5.3 · The realistic Tier 2/3 lab#

Putting the budget reality together: a sub-$50K lab does 90% of what a serious operator-side testing program needs to do. That's a real number, not aspirational. The lab consists of:

2 switches matching deployment platform (~$15K each, used or refurbished from secondary market — $20K total)
2 hosts with NICs (~$5K each — $10K)
VIAVI FBP-P5000i fiber inspection scope ($2K)
VIAVI MP-60 power meter ($2K)
OTDR — VIAVI OTDR-1 entry-level ($8K)
Cleaning supplies and references ($1K)
Patch cables — known-good characterized set ($2K)
Mating-cycle fixture, in-house built (~$500)
Test rack, environmental sensors, cabling ($3K)

Total: roughly $48,500. Plus the engineer time to set it up and operate it.

What this lab does not do: TDECQ measurement (send to contract lab when needed), stressed-receiver eye analysis (same), full environmental chamber thermal cycling (same), formal qualification certification documents (no operator-side lab provides these — you go to a NIST-traceable lab if a customer demands it).

What this lab does do: every sample-testing protocol described in Chapter 7. Every bring-up validation procedure described in Part 3. Field failure analysis on the modules that matter. Cross-vendor interop testing in a controlled environment. The bulk of what an operator-side testing program is for.

5.4 · When to use a contract lab#

A contract-lab relationship complements rather than replaces in-house capability. You use the contract lab for the cases where:

The test requires equipment you don't have. TDECQ measurement is the canonical example — almost no Tier 2/3 operator owns a TDECQ-capable oscilloscope, but specific procurement decisions sometimes require it.
You need third-party documentation. A NIST-traceable test report carries weight in vendor disputes that an in-house report does not. If you anticipate contesting an RMA or escalating a vendor quality issue, contract-lab documentation is worth its cost.
You need a reference comparison. Some contract labs maintain historical databases of measured modules across vendors; sending your samples for comparison against their database can yield insights your own data cannot.
The deployment is high-stakes and you want defense-in-depth. First 800G deployment, first 1.6T deployment, first deployment of a vendor with no track record. Contract-lab validation supplements your own testing.

Reputable optical test labs in North America include those operated by major test-equipment vendors (VIAVI, Keysight have services arms), academic labs at universities with photonics programs, and specialized commercial labs. Costs run from a few thousand dollars for a quick characterization to tens of thousands for a full qualification battery on a sample lot.

← Previous chapterCh 4 · Sample size — the math, and why "five units" doesn't pass Next chapter →Ch 6 · Reading factory test reports critically

Chapter Six

Reading factory test reports critically#

A factory test report is the vendor's claim about a unit. It is not third-party verification; it is the vendor grading their own homework. A serious operator reads factory test reports the way an experienced editor reads a press release — looking for what the document is and is not saying, and looking for the tells that indicate the document was generated rather than measured. This chapter is the field guide.

6.1 · What a serious factory test report contains#

The minimum content a factory test report must contain to be useful:

Reference Card 06 — Four words on every spec sheet · Vitex AI DC Optics Validation Handbook 2026

Reference Card 06 · Vendor vocabulary

Four words on every spec sheet, decoded

Compliant · Supported · Tested · Qualified. The same module can be all four — or only one — depending on what the vendor means. Here's what each one actually claims, and what to ask to verify.

Word 01 · Weakest

Compliant

The unit measures inside the spec window — TDECQ, OMA, ER, BER — at one point in time on one tester.

Doesn't mean

Tested with your switch
Tested with your NIC
Tested at your temperature
Reproducible across lots

Word 02 · Mixed

Supported

The vendor lists the module on a compatibility document. Could be an active interop test, could be a paperwork exercise.

Ask

What test plan?
Sample size?
Last verified when?
Firmware combinations?

Word 03 · Stronger

Tested

A specific test was run. Whether it covers your conditions depends entirely on what the test plan was.

Ask

Tested for what?
Under what conditions?
How many units?
Failure rate observed?

Word 04 · Strongest

Qualified

Listed on a formal qualification list (NVIDIA LinkX, Cisco Compatibility Matrix, Arista TOI) with documented test methodology.

Verify

Document version
Specific FW combination
Specific cable length
Specific switch SKU

Strength of claim — at a glance

If the vendor says…	You can rely on…	You should still…
Compliant with IEEE 802.3-2022	Spec compliance at one tester	Sample-test on your platform
Supported on Cisco Nexus 9364	Vendor lists the combination	Verify firmware versions match
Interop-tested with ConnectX-7	Pairwise test happened	Confirm conditions match yours
Qualified on NVIDIA LinkX	Formal NVIDIA test passed	Verify cable length / FW row
Drop-in compatible	Marketing language only	Treat as untested
Field-proven	Marketing language only	Ask for fleet data

The principle

"Compatible" is what marketing says. "Qualified on the specific configuration you're deploying" is what engineering needs.

The four words exist on a strength gradient. Compliant is the floor — the unit met spec once, on one tester. Qualified on a formal list, against your specific switch + NIC + cable + firmware combination, is what protects you from the failure modes that show up at scale. Anything in between, ask.

Vitex Validation Handbook · Card 06/15vitextech.com

Per-unit serial number binding. Every measurement in the report must be traceable to a specific unit by its serial number. A report that gives "batch averages" or "lot summaries" without per-unit data is worthless for operator-side verification. The serial number on the physical module must match the serial number on the report exactly, character-for-character.

The full set of IEEE-spec measurements for the variant. For a 400GBASE-DR4 module, that means TDECQ per lane, OMA outer per lane, SECQ per lane, stressed receiver sensitivity per lane, extinction ratio per lane, and pre-FEC BER per lane at the relevant test points. For 800G, the equivalent set at 100G/lane. Missing any of these is a red flag.

Measurement conditions. Temperature at which the measurement was taken (case temperature, not ambient — the spec is at the module's case), supply voltage, traffic pattern (PRBS31Q is standard for PAM4), and any reference equipment used. A report that omits the conditions is not a verifiable measurement.

Calibration provenance. The test equipment used must be calibrated, and the report must reference the calibration. A report from a tester whose last calibration was six years ago is not trustworthy regardless of the numbers it contains.

Pass/fail criteria for each measurement. What did the vendor compare the measurement against? IEEE 802.3 spec? OCP spec? Their own internal spec? The criteria matter because the same TDECQ value can be a pass against one criterion and a fail against another.

Date of test. Manufacturing date is on the module; test date should be on the report. They should be close together; a unit tested months before shipping is suspicious.

6.2 · The OCP Optics Telemetry Specification as a procurement floor#

In 2025, the Open Compute Project published the Optics Telemetry Software Specification Rev 0.9, which defines what factory test data should be embedded in the module itself for in-deployment retrieval. The specification mandates:

Non-volatile circular event buffers in the module that retain operational history
Summary flags accessible to the host for remote-end diagnostics
CDB-based remote register reads for vendor and host integration
Per-unit factory test data accessible via standardized CDB commands

The OCP Optics Telemetry Spec is operator-friendly in a way that previous specs were not — it puts operator-relevant test data in a place the operator can read it. We recommend operators specify OCP Optics Telemetry Spec Rev 0.9 (or later) compliance as a procurement requirement. This single contractual line item resolves multiple problems: it standardizes what factory data is available, it ensures the vendor's test program is rigorous enough to populate the spec's required fields, and it makes per-unit verification straightforward post-deployment.

6.3 · The red flags#

Reference Card 07 enumerates seven tells of fabricated factory test reports — identical timestamps across "different" units, identical calibration coefficients, suspiciously round numbers, zero variance, missing metadata, pass-only results, and absent test conditions. The card is the bench artifact; engineers reviewing 30 reports for procurement decisions can apply it directly. The principle: real reports have realistic noise and varied timestamps; fabricated reports look uncannily clean.

Reference Card 07 · Vendor documents

Reading a factory test report critically

What a real per-unit test report contains, what a fabricated one looks like, and the seven tells experienced operators check first when reviewing a batch of vendor reports.

Real report · what to expect

// 800G OSFP DR8 · Unit Test Report Serial: ABCD2406A47291 Vendor P/N: V0-8DR8-FA Test date: 2026-04-18 14:23:07 UTC Tester ID: ANR-MP1900A-S/N 3471 Operator: emp-2841 Cal date: 2026-03-12 // Per-lane measurements TX_pwr_L0: -1.42 dBm TX_pwr_L1: -1.51 dBm TX_pwr_L2: -1.38 dBm ... (8 lanes) TDECQ_L0: 1.84 dB OMA_L0: 1.21 dBm BER_L0: 4.7e-7 @ 28°C Result: PASS

Signs it's real

Per-unit data, not summarized
Timestamps vary unit-to-unit
Tester S/N + cal date present
Per-lane variance is realistic
Test conditions documented

Fabricated report · seven tells

Tell	What to look for
Identical timestamps	50 units "tested" within 30 seconds — physically impossible
Identical calibration coefficients	Page 02h thresholds copied across "different" units
Round numbers	BER values like 1.0e-7 across many units (real data is noisy)
Zero variance	TX power exactly identical lane-to-lane and unit-to-unit
Missing metadata	No tester S/N, no operator ID, no cal date
Pass-only	Vendor's report shows 100% pass; their RMA rate suggests otherwise
No test conditions	Temperature, humidity, supply voltage all absent

If two reports from the same lot have identical timestamps to the second, you're looking at a copy.

The 60-second scan

A trained operator can spot most fabricated reports in under a minute, scanning 30 reports for the patterns above. The exercise is worth the time before any procurement commitment of meaningful volume.

Open 30 reports from one batch in tabs / a CSV

Sort by timestamp — flag any cluster within seconds of each other

Check Page 02h calibration: should be unique per unit

Look at TX power distribution: variance < 0.05 dB across 30 units is a tell

Confirm tester S/N is present and consistent (real labs use 1–3 testers per line)

Spot-check 5 units for full metadata population

If anything looks wrong, ask for raw oscilloscope captures on 5 random units

The principle

A factory test report is evidence of the testing process — not a substitute for your own.

Real reports have realistic noise, varied timestamps, complete metadata. Fabricated reports look uncannily clean. Either way, the report is information about the vendor's QA discipline; it's not a replacement for sample-testing on your own platform. Operators who skip sample testing because "the factory report passed" are trusting documents over evidence — and the evidence costs roughly 30 modules to gather.

Vitex Validation Handbook · Card 07/15vitextech.com

6.4 · The 60-second scan#

Real operators do not study factory test reports in detail; they scan them in roughly 60 seconds. The scan looks for:

Reference Card 05 · Reliability

Vendor MTBF claims, decoded

What "5,000,000 hours MTBF" means, what it doesn't, and what observed fleet AFR actually looks like in published data from operators running these modules at scale.

What MTBF actually measures

MTBF is computed from accelerated-life-test (ALT) data on a sample population, extrapolated using an Arrhenius model that assumes failure rates double for every 10°C of operating temperature.

A claim of "5,000,000 hours MTBF" derives from a few hundred units run for a few thousand hours each, with the result projected outward. Nobody tests an optic for 570 years.

The model is internally consistent. It just doesn't predict the failure modes that dominate real fleet AFR — endface contamination, mating-cycle damage, firmware regressions, host-side SerDes interactions, thermal hot-spots in specific rack rows.

What fleet AFR actually shows

Source	Observed AFR	Implied MTBF
Alibaba HPN (SIGCOMM 2024) per access link	0.68 % / yr	~ 1.3 M hours
Microsoft (Azure paper, 2023) 100× optical-to-copper ratio	~ 0.3 — 1 % / yr	~ 0.9 — 3 M hours
Meta LLaMA 3 (paper Table 5) 35 of 419 interruptions network/cable	~ 8 % of incidents	n/a
Vendor MTBF claim (typical) datasheet boilerplate	0.175 % / yr	5 M hours

Vendor MTBF claims and observed fleet AFR diverge by 1–2 orders of magnitude across published operator data.

How to read a vendor MTBF claim

What to ask	Why it matters
Sample size in the ALT	30 units is typical; 200+ is meaningful. Ask.
Test duration	1,000 hr at 85°C is industry floor. Less = less.
Failure-mode breakdown	Vendors who run real FFA can tell you. Most can't.
Field-return data, last 12 months	NFF rate, root-cause mix, vintage analysis
FIT (failures in time)	1 FIT = 1 failure per 10⁹ hours; convert to AFR

The principle

Use vendor MTBF for relative comparison only. Use your own observed fleet AFR for absolute reliability planning.

A vendor claiming 5M-hour MTBF is not lying — they computed correctly against their model. The model just doesn't tell you what 16,000 modules in your fabric will do over four years. Your fleet's observed AFR will. Track it, share it back to the vendor at quarterly business reviews, and make procurement decisions on it — not on the spec sheet.

Vitex Validation Handbook · Card 05/15vitextech.com

Per-unit serial binding present?If the report doesn't show serial-by-serial data, stop. Request a per-unit report or treat the lot as untested.

Measurements include the spec-required set?For 400G/800G PAM4: TDECQ, OMA, ER, stressed Rx sensitivity, pre-FEC BER. If the set is incomplete, the report is incomplete.

Visible spread across units?Look at three to five values across different serials for the same parameter. Real spread is visible; identical values across units is a flag.

Conditions documented?Test temperature, voltage, traffic pattern. Missing conditions means unverifiable measurements.

Pass criteria explicit?What was the report comparing against? IEEE? OCP? Vendor internal? If the criterion isn't stated, "pass" is meaningless.

A report that fails any of those five checks needs a follow-up conversation with the vendor before the lot is accepted. A report that passes all five gets the assumption of good faith — though in-house verification of a small sample is still appropriate.

6.5 · Challenging a suspicious report#

What do you do when a report has multiple red flags? The professional path:

First, assume good faith and ask. Vendors occasionally have legitimate reasons for unusual report formats — testing was done by a contractor whose templates are different, the vendor's QA system was migrated, the specific measurement was made on a sample basis rather than per-unit. Ask. The answer either resolves your concern or makes the situation clearer.

Second, request the underlying data. A vendor whose factory test reports are real has the underlying measurement data — instrument log files, calibration records, raw test results. Asking to see it is reasonable. Vendors who refuse to share underlying data while continuing to claim the reports are accurate are signaling something.

Third, run your own sample testing. The whole point of operator-side testing is that you don't have to take the vendor's word. If a report is suspicious and you've ordered samples, your sample data is the truth-finder. Run the tests, compare the results to the report's claims, and let the data speak.

Fourth, escalate or walk away. If the data confirms the report was misleading, you have a vendor quality issue that goes beyond a single shipment. Most operators have a documented vendor management process for these situations; if you don't, this is the case that motivates building one.

In practice

The standards-aligned procurement path is to require ISO 9001 certification, demand audit rights, and maintain a formal supplier quality management process documented per ISO/TS 16949 or equivalent.

Most Tier 2/3 operators don't run formal supplier-quality programs. The realistic alternative: maintain a per-vendor working file with examples of past test reports, baseline data from your own sample testing, and a noted history of any issues. When a suspicious report arrives, you have a baseline to compare against. When a vendor's report quality changes — gets thinner, gets vaguer, starts showing red flags — you notice. The discipline doesn't require certifications; it requires keeping notes.

← Previous chapterCh 5 · Test equipment — what you need, what you can skip Next chapter →Ch 7 · Running the tests

Chapter Seven

Running the tests#

The actual test protocols — what to run, in what order, with what equipment. This chapter is the procedural core of Part 2. The protocols here are designed for the realistic Tier 2/3 lab described in Chapter 5, with notes on what changes if you have a contract lab or full instrumentation. Every test is calibrated against the standards it relates to; where the standards-prescribed procedure is impractical, we describe the alternative most operators actually use.

Reference Card 01 · Validation

The 8 categories of operator-side testing

A complete validation program runs across this taxonomy. Each test answers a different question, at a different point in the module's lifecycle.

01 · IdentificationSample · Bring-up

Module Identification

Provenance and conformance to order. Catches mis-shipped SKUs and unexpected firmware.

ReadsCMIS Page 00h

CapturesVendor · P/N · S/N · FW

02 · DOM baselineSample · Bring-up

DOM Baseline

Optical and electrical operating point. The reference for every future telemetry alert.

CapturesTX · RX · bias · temp

Resolutionper lane

03 · BER measurementSample · Bring-up

BER Measurement

Physical-layer bit error rate. The truth-test for every claim about link quality.

MethodCMIS Page 13h PRBS

Target< 1×10⁻⁶ pre-FEC

04 · Thermal cornerSample only

Thermal Corner

Stability across the operating envelope. Catches modules that pass at room temp but fail under load.

Method70°C case via fan ctrl

WatchBER drift · DOM drift

05 · Mating cycleSample only

Mating Cycle

Connector durability over 50+ insertions. Catches connector wear and contamination resilience.

PassIL creep < 0.3 dB

Cycles50× minimum

06 · Burn-inSample · Bring-up

Burn-In

Infant-mortality detection. 72 hours of sustained traffic catches what a snapshot can't.

Duration72 hr line-rate

WatchFEC accumulation

07 · Fabric-levelBring-up

Fabric-Level

Multi-link interaction in deployment. ECMP, RDMA, PFC behavior under real workload.

MetricsCV · BW · pause

TargetsCV < 0.10

08 · ContinuousProduction

Continuous Telemetry

Drift and aging across the deployment lifetime. Trend, not snapshot, is what catches failures.

Cadence30s — 5min

Retention13 months

Lifecycle phase legend

Sample

Pre-procurement — first-article and lot testing

Bring-up

Every link, before workload

Production

Every link, every day

First-article only

Tests 04, 05 run during initial vendor qualification only.

Always-on

Test 08 runs continuously for the link's production lifetime.

Standard floor

Tests 01, 02, 03, 06 apply to every module, every time.

The principle

Skipping a test category moves the cost of failure forward in time.

Each phase you skip costs 10–100× more to recover from.

Vitex Validation Handbook · Card 01/15vitextech.com

7.1 · The first-article qualification protocol#

For a new vendor or new SKU, this is the full 30-unit protocol that produces defensible data for a procurement decision. Total elapsed time: 5–7 days for the testing, plus burn-in time. Engineer time: 2–3 days of active work spread across the elapsed window.

Phase 1: Receiving and identification (Day 1, ~2 hours)

For each of the 30 units:

Visual inspection — packaging intact, module body undamaged, latches functional
Read EEPROM via switch CLI — vendor name, part number, serial number, date code
Verify identification matches the order (vendor and part number) and the unit (serial number on label matches serial in EEPROM)
Read CMIS Page 02h calibration data — store for later red-flag analysis
Read Application Selection table from Page 00h — record advertised configurations
Endface inspection at 200× via fiber inspection scope — record any contamination found before first mate

Outcome: a per-unit identification record, a clean baseline endface state, and any red-flag detections logged.

Phase 2: Initial DOM and link characterization (Day 1, ~3 hours)

Set up the test bench: two switches matching deployment platform, characterized fiber jumpers between them, both modules of a link from the same lot. For each pair (15 pairs from 30 units):

Insert modules, confirm Datapath State Machine reaches DPActivated
Read DOM — temperature, TX power per lane, RX power per lane, voltage, laser bias per lane
Compare DOM-reported values to external power meter measurement on a representative subset of lanes (catches calibration faults)
Configure traffic — line-rate PRBS31Q from one host to the other, both directions
After 5 minutes of sustained traffic, read FEC counters — pre-FEC BER per lane, corrected codewords, uncorrectable codewords
Read VDM Pages 14h–17h for per-lane SNR and BER statistics
Stop traffic, read interface counters — CRC, FCS, alignment errors should be zero

Outcome: a per-pair characterization record covering identification, initial DOM at room temperature, and a 5-minute FEC baseline. Any pair that fails to reach DPActivated, shows DOM outside expected ranges, or accumulates uncorrectable codewords in the 5-minute window is flagged for individual investigation.

Phase 3: Thermal corner testing (Day 2, ~6 hours)

Test the same 15 pairs at three thermal corners:

Cold corner — module case temperature in the 0–5°C range. Achievable by reducing rack ambient or by running the modules at idle (they self-cool to near ambient).
Hot corner — module case temperature in the 65–70°C range. Achievable by reducing switch fan speed via NOS commands and elevating ambient toward 30–35°C.
Nominal — module case temperature in the 40–50°C range. The "Day 1" condition.

At each corner, allow the modules to thermally stabilize (15–20 minutes), then re-run the DOM and FEC characterization from Phase 2. The acceptance criterion is not that the values are identical across corners — they won't be, and shouldn't be — but that they remain inside spec at every corner and that the per-lane spread doesn't widen unacceptably.

Failed corners include: a module that drops to a degraded link state, a lane whose pre-FEC BER crosses the action threshold, a temperature reading that doesn't track the corner condition (suggests internal thermal interface fault), or a unit that fails to recover when conditions return to nominal.

Phase 4: Burn-in (Days 3–5, 72 hours, automated)

Configure all 15 pairs with sustained line-rate PRBS31Q traffic. Configure telemetry to log DOM, FEC counters, and interface error counters at one-minute cadence. Allow the burn-in to run unattended for 72 hours.

At the end of 72 hours, the per-pair acceptance is:

Zero uncorrectable codewords across the entire window
Zero CRC or FCS errors
Zero link flaps
Pre-FEC BER trend flat or improving (rising trend ≥ 0.5 decades is a fail even if the absolute value is in spec)
DOM stable within ±0.3 dB on power, ±3°C on temperature at constant ambient

Outcome: 15 pairs each with 72 hours of telemetry, scored against the acceptance criteria. Pairs that fail any criterion are pulled for individual investigation; the rest pass to Phase 5.

Phase 5: Mating cycle and recovery (Day 6, ~3 hours)

For a subset (typically 5 of the 15 pairs, randomly selected): subject the optical interfaces to a controlled number of mating cycles using a fixture or careful manual procedure. The IEC 61300-2-2 standard specifies 500 cycles for original spec; 200 cycles for the 2023 single-mode patchcord revision. Most operators run 100 cycles as a practical compromise — enough to expose poor mechanical quality, not enough to consume excessive engineer time.

After mating cycles, re-inspect the endface for damage and re-run the DOM and FEC characterization. Modules whose endfaces show damage, whose insertion loss has increased meaningfully, or whose DOM values have shifted are flagged.

Phase 6: Statistical analysis and decision (Day 7, ~4 hours)

Compile the data from Phases 1–5 into a per-unit data record and a population summary. Compute Cpk for each measurement against the relevant spec limit. Identify outliers, assess their cause, and produce the procurement decision document — which is what the §4.5 decision framework in Chapter 4 covers.

7.2 · The follow-on lot inspection protocol#

Once a vendor and SKU have passed first-article qualification, follow-on lots are inspected on a lighter protocol.

Per Z1.4 General Inspection Level II at AQL 4.0 (the realistic Tier 2/3 level) for typical lot sizes: sample 32 units from a 1,200-unit lot, 50 units from a 3,200-unit lot. From each sampled unit:

Identification (5 minutes per unit)
DOM read at room temperature (5 minutes)
Link-up and 5-minute FEC baseline (10 minutes)

Total per-unit time: 20 minutes. Per-lot total: ~10–17 hours of engineer-attended time. Catches gross defects, batch-level issues, and any drift from the qualified vendor baseline. Does not catch the marginal cases that first-article qualification catches with its longer protocols, which is why first-article testing comes first.

7.3 · Switch-resident PRBS — the protocol details#

The switch-resident BERT capability deserves its own section because it is the most under-utilized capability in operator-side testing.

Every CMIS-compliant module supports PRBS generation and checking via Page 13h. The host-side commands to drive it vary by NOS, but the underlying capability is universal:

SONiC: sfputil show prbs reads current state; sfputil set prbs configures pattern and enables generator/checker.

NVOS (NVIDIA): nv show interface eth1 transceiver prbs reads state; nv set interface eth1 transceiver prbs pattern PRBS31Q configures.

Cumulus: PRBS support via ethtool extensions and platform-specific utilities.

Arista EOS: platform fap prbs commands on Broadcom-based platforms; equivalent on Cisco-based.

Pattern selection matters. The patterns most relevant to optical testing:

PRBS31Q — 2³¹-1 sequence on PAM4. Worst-case modulation for testing 100G-per-lane and 200G-per-lane PAM4 channels. The standard test pattern for 400G/800G optical link characterization.
PRBS13Q — Shorter sequence, easier for some BERT instruments. Allowed by IEEE 802.3ck for 100G/lane PAM4 BERT.
SSPRQ — Short Stress Pattern Random Quaternary. The official TDECQ test pattern. Stresses the channel differently than PRBS31Q.
PRBS31 (NRZ) — For 25G NRZ links (100G QSFP28). The PAM4 patterns don't apply.

Test duration matters. A PRBS test running for 1 second on an 800G link covers 8×10¹¹ bits — enough to characterize a BER of 10⁻¹¹ with reasonable confidence. Running for 60 seconds covers 4.8×10¹³ bits — enough to characterize 10⁻¹³. For first-article testing, run PRBS for at least 5 minutes per condition; for production verification, 60 seconds is usually sufficient.

7.4 · Thermal corner testing without an environmental chamber#

An environmental chamber costs $30,000 and runs for years. Most Tier 2/3 operators don't have one. The practical alternatives that catch most thermal defects:

The fan-modulation method

Modern switches expose fan control via NOS commands. Reducing fan speed elevates module case temperature — a 32-port 800G switch with full population at 17W per module produces enough heat that reducing fans to 50% drives module case temperature into the 65–70°C range within 15 minutes.

The procedure:

Populate test switch with modules under test
Run line-rate traffic to ensure modules dissipate full power
Reduce switch fan speed to a controlled level (specific NOS command varies)
Monitor module case temperature via DOM
When target temperature is reached and stable, run the test protocol
After test, restore normal fan operation and verify recovery

This catches: thermal interface defects, lane-imbalance issues that worsen at high temperature, FEC margin reductions at temperature, and unit-to-unit variation in thermal behavior. Does not catch: cold-corner issues (the switch can't drive temperatures below ambient), thermal cycling fatigue (you're not cycling repeatedly), and humidity effects.

The HVAC method

If the test rack can be moved to a controlled environment — a closet that runs cold, an outdoor area that runs hot — you can test cold corner and very hot corner without a chamber. Practical, occasionally necessary, but operationally awkward.

When the chamber is necessary

Some tests genuinely require an environmental chamber. Programmable thermal cycling (Telcordia GR-468 §3.3.2.2 specifies -40°C to +85°C cycling with ≥10 minute dwell), accelerated aging tests (running modules at elevated temperature for weeks to compress lifetime), and the standards-compliant version of corner-condition validation. For these, a contract lab is the right choice.

In practice

Telcordia GR-468 specifies thermal cycling between -40°C and +85°C with ≥10 minute dwell at each extreme and ≤1 minute transition, repeated for 100 cycles, with continuous monitoring throughout. The procedure requires a programmable environmental chamber and roughly two weeks of unattended test time.

Realistically, almost no Tier 2/3 operator runs full GR-468 thermal cycling on samples. The compromise that catches a meaningful fraction of cycling-related defects: run 10 thermal cycles via the fan-modulation method, between cold (~20°C — limited by ambient) and hot (~70°C), with 30-minute dwell at each extreme. The temperature range is narrower than GR-468, and the cycle count is lower, but the test stresses the same physical mechanisms — solder joint fatigue, thermal interface degradation, lane-mismatch under temperature change. Defects that survive 10 narrow cycles probably survive deployment; defects that fail 10 narrow cycles definitely fail GR-468. The narrow-cycle test is not a substitute for full GR-468 when GR-468 is required (vendor disputes, regulatory submissions, certification bodies). It is a substitute for "no thermal cycling at all," which is what most operators currently do.

7.5 · Mating cycle testing#

Most optical and copper module failures are not measurement failures — they're mechanical. A mating cycle test exposes mechanical quality: the strength of the latch, the precision of the connector alignment, the durability of the optical endface against repeated mating, and the integrity of the EEPROM contact pins.

The standards: IEC 61300-2-2 defines mating durability testing for optical connectors, with a typical 500-cycle severity for data-center applications. IEC 61753-021-2:2023 reduces this to 200 cycles for single-mode patchcords specifically. EIA-364-1000 covers the controlled-environment mechanical assessment for connector-to-cage mating.

The realistic operator-side test: a fixture that mates and unmates a module 100 times, with periodic verification of:

Endface inspection at 200× (look for damage progression)
Insertion loss measurement (look for IL drift)
EEPROM read consistency (look for contact-pin wear)
Latch retention force (look for mechanical fatigue)

Fixtures can be built in-house from servo motors and a programmable microcontroller for under $1,000, or purchased as commercial test fixtures for $5,000–10,000.

← Previous chapterCh 6 · Reading factory test reports critically Next chapter →Ch 8 · Receiving inspection and physical hygiene

Want this as a printable field pack?

About 15 pages: the procedures and threshold tables from this handbook in a print-friendly format your bring-up team can mark up at the rack. We email it as a PDF.

Get the field pack →

Part Three

Validation at bring-up

From the moment a module arrives at the dock to the moment the fabric is handed to the workload team — the practical playbook for commissioning AI fabric interconnects.

Chapter Eight

Receiving inspection and physical hygiene#

Most production link incidents trace to physical-layer issues that should have been caught at receiving. The economics: a fiber inspection scope costs $1,500, and connector and endface contamination is consistently among the largest single categories of avoidable link issues — catching it at the connector, before a module reaches a switch port, is one of the highest-return checks in the program. The discipline is structured but not complicated.

8.1 · The ten-minute receiving check#

For every shipment of optics or cables, the receiving check before stocking:

Box and ESD bag integrity. Damaged bags or absent ESD packaging is grounds for refusing the shipment — modern modules can be ESD-damaged in handling without showing visible defects.
Spot-quantity check. Count units against the packing list. Note serial-number ranges. Flag any out-of-sequence serials that suggest mixed lots.
Visual inspection of each module. Housing damage, bent connector pins, foreign material on the optical port. Set aside any unit that looks anomalous for closer inspection.
Endface inspection on a 10% sample of modules and 100% of any shipped cables. Use a fiber inspection scope; pass/fail criteria per IEC 61300-3-35 by zone. Reject any unit with visible scratches in the core zone or failure in any zone above the standard's pass criteria.
Read the EEPROM (Page 00h) on a sample of 5–10 modules to confirm vendor, P/N, S/N, firmware version match the packing list and the order. Mis-shipped SKUs catch here, before they reach a deployment.
Tag and store with the receiving record (date, lot, vendor, P/N, S/N range). The tag follows the unit through bring-up and into the validation record.

The full ten-minute discipline is the sample-protocol floor for every lot regardless of vendor or relationship. It catches the obvious problems (damaged in transit, mis-shipped, wrong firmware) before any time is invested in deeper testing.

8.2 · Endface inspection, cleaning, and the first-mate rule#

Endface inspection is the highest-ROI receiving activity. A $1,500 fiber scope catches contamination, scratches, and damage that produce 0.1–0.5 dB additional attenuation per affected lane — enough to push a marginal link into post-FEC errors. The discipline: inspect before first mate, period.

IEC 61300-3-35 defines pass/fail by zone. Core zone (the fiber core itself) — no defects allowed. Cladding zone — limited defect count and size. Contact zone (the area that physically mates) — moderate tolerance. Adhesive and outer zones — wide tolerance. Modern fiber scopes apply this automatically and produce a pass/fail with the captured image.

The cleaning program. When inspection finds contamination — and it will, on shipped units that have been in storage and on connectors that have been mated repeatedly — clean with a click cleaner (cassette type for MPO, pen type for LC) or with optical-grade wipes and isopropyl. Re-inspect after cleaning. Most contamination cleans on the first pass; persistent contamination after cleaning indicates damage and the unit goes for replacement.

The first-mate rule. Inspect every endface before it touches another endface, every time. Once two endfaces have mated, they have exchanged contamination — even if both started clean. The discipline pays back at the first link bring-up that goes cleanly because the connectors were genuinely clean rather than presumed clean.

8.3 · Polarity, APC vs UPC, and IHS vs RHS#

Three physical-compatibility traps catch deployments at receiving repeatedly. None of them require depth to avoid; all require attention.

Polarity. MPO connectors come in three polarity types (A, B, and the legacy C). Polarity-A is straight-through; polarity-B has the keyed positions reversed. The wrong polarity at one end of a link won't physically prevent the connection but will swap TX and RX lanes — link won't come up. Confirm the polarity of every cable matches the deployment design before stocking.

APC vs UPC. Angled physical contact (APC, 8° polish, green ferrule) vs ultra physical contact (UPC, flat polish, blue ferrule). They are not interchangeable. Mating an APC connector to a UPC port produces 1–2 dB additional insertion loss and visible mechanical damage to the ferrule on the first mate. Modern AI fabrics use APC across the board — but legacy fiber plants and some breakout assemblies are UPC. Confirm at receiving.

IHS vs RHS within OSFP. Integrated heat sink (finned top) vs riding heat sink (flat top). Mechanically incompatible in each other's cages. IHS is for switches; RHS is for GPU systems with cage-mounted thermal solutions. Receiving inspection confirms the unit matches the deployment target. Cable assemblies that bridge a switch (IHS) to a GPU port (RHS) need different connectors at each end — verify the cable matches the link spec, not just the speed.

The receiving check that catches all three of these takes the same ten minutes as the basic procedure — the discipline is to actually look, not to add more steps.

← Previous chapterCh 7 · Running the tests Next chapter →Ch 9 · Cabling and fiber plant validation

Chapter Nine

Cabling and fiber plant validation#

An operator who validates modules but not the cabling they plug into is solving half the problem. The fiber plant is the channel; the modules are the endpoints. A degraded plant produces problems indistinguishable from degraded modules — and replacing modules to fix a cabling problem is one of the most expensive ways to learn about your fabric. This chapter covers what to validate before populating modules.

9.1 · Why validate the plant separately#

Most operators inherit cabling. The data center was built years ago, the patch panels were installed by someone who's no longer here, and the fiber plant has been used for previous generations of equipment. Before installing modern 400G or 800G modules, the plant must be validated against the requirements of those modules — not against the requirements of whatever ran on it last.

The risks of skipping plant validation:

OM3 multimode where you assume OM4 — reach is shorter than your design
G.652.D fiber where you need G.657 bend-insensitive — bends will cause loss
Patch panels with degraded connectors — accumulated insertion loss eats your link budget
Splices you didn't know existed — adding 0.3 dB each to your loss budget
Wrong polarity in places — links won't come up
Mixed APC/UPC in places — same problem

The validation program below addresses all of these.

9.2 · OTDR traces — what they tell you#

An Optical Time-Domain Reflectometer (OTDR) sends an optical pulse down a fiber and analyzes the reflections that return. From the reflection pattern, the OTDR reports:

Total fiber length
Per-event insertion loss (each connector, each splice, each patch panel)
Reflectance at each event
Macrobend-induced loss
Localized fiber damage

An OTDR trace looks like a sloped line punctuated by step-changes at events. The slope is fiber attenuation (typical 0.35 dB/km for OS2 at 1310nm, 2.3 dB/km for OM4 at 850nm); the steps are losses at events. A clean trace has small, predictable steps; a problematic trace has unexpectedly large steps, anomalous reflectance peaks, or non-linear slopes that indicate fiber damage.

For each link in your fabric, the OTDR trace should show:

Total length matching design (within reasonable tolerance — fiber routes have actual lengths, not theoretical)
Per-connector loss ≤ 0.5 dB (≤ 0.35 dB for high-grade installations)
Per-splice loss ≤ 0.3 dB (fusion splices)
Total link loss within IEEE-spec budget for the variant you'll deploy
No anomalous reflectance peaks indicating damaged or mismatched connectors

9.3 · Insertion loss budget — the math#

For each interconnect variant, IEEE 802.3 publishes a total channel insertion loss budget. The link must fit inside that budget with margin. Typical budgets:

Table 6 · Channel insertion loss budgets

IEEE 802.3 total link budget by variant

Total channel loss the variant tolerates from end to end, including fiber attenuation, connectors, splices, and patch panels. Operators should reserve 1.0 dB of margin against aging and installation variability.

Variant	Reach	Total budget	Recommended planning
100GBASE-SR4	100m OM4	1.9 dB	One mated pair max with 1.0 dB margin
400GBASE-SR8	100m OM4	1.9 dB	Same as 100G-SR4
800GBASE-SR8	100m OM4	1.8 dB	Tight — 1 dB fiber + 0.3 dB connector + 0.5 dB margin
400GBASE-DR4	500m OS2	4.0 dB	Comfortable for typical structured cabling
800GBASE-DR8	500m OS2	4.0 dB	Same budget as 400G-DR4 per channel
400GBASE-FR4	2km OS2	4.0 dB	Fiber attenuation dominates; install with care
400GBASE-LR4	10km OS2	6.3 dB	Long-reach budget for inter-DC distance

Per-element typical loss values:

OS2 single-mode fiber at 1310nm: 0.35 dB/km
OS2 single-mode fiber at 1550nm: 0.22 dB/km
OM4 multimode fiber at 850nm: 2.3 dB/km
Standard MPO/LC mated connector: 0.35 dB
High-grade mated connector (Panduit Ultra, etc.): 0.10–0.15 dB
Fusion splice: 0.10–0.30 dB
Mechanical splice: 0.30–0.70 dB
Patch panel pass (connector + jumper + connector): 0.75 dB total

9.4 · The plant acceptance protocol#

Before populating a fabric with modules, accept the plant against criteria:

OTDR trace every trunk and every backbone fiber. Verify length, per-event loss, and total loss. For 1,000+ fiber deployments, this is automated tooling work; for smaller, an engineer with a portable OTDR.

Polarity verify a sample of trunks. 20–30 representative trunks across the layout. Document the polarity scheme (Type A trunks, Type B trunks, where each is used).

Inspect a sample of patch panel ports. 5–10% of ports on each patch panel, at 200×. Catches degraded panel connections that have accumulated contamination from previous use.

Connector type verification. Sample a representative number of patch panel ports and verify APC vs UPC matches expectation. For SM parallel optics, APC throughout; for MM, UPC throughout.

Fiber type verification on uncertain runs. If you're not sure whether a particular run is OM3, OM4, or OM5, OTDR backscatter can distinguish them, or a quick check of the cable jacket markings on accessible spans confirms.

9.5 · Common findings and what to do#

What plant validation typically catches:

Patch panel ports with high insertion loss. Sometimes a single port out of dozens shows 0.8 dB loss when neighbors show 0.3 dB. Cleaning may fix it; if not, the panel needs that port repaired or a different port assigned.

Mixed fiber types in a trunk. Inherited cabling sometimes splices OM3 and OM4 in the same run. The OM3 segment limits performance; the workaround is to route around the bad segment if possible.

Polarity mismatches. Catch them in plant validation rather than during commissioning. Pulling 20 trunks because they're polarity-wrong is faster than diagnosing 200 link-up failures one at a time.

Bend losses. A run that shows higher-than-expected attenuation on OTDR often has a tight bend somewhere. Tracing the run physically and relieving the bend resolves it; if the bend is in concealed cable, sometimes the only fix is to re-pull.

In practice

For greenfield deployments where the cable plant is being installed at the same time as the equipment, plant validation can be folded into the installer's acceptance procedure — the cabling contractor provides OTDR traces and polarity reports as part of project closeout. For brownfield deployments where you're adding equipment to existing cabling, plant validation is your job, and it's worth a week of an engineer's time before you start populating modules. Whatever you save by skipping it, you'll spend several times over in commissioning delays.

← Previous chapterCh 8 · Receiving inspection and physical hygiene Next chapter →Ch 10 · Power, thermal, and infrastructure validation

Chapter Ten

Power, thermal, and infrastructure validation#

Modules don't operate in isolation; they operate in racks, drawing power and producing heat that the rack and the data center hall must absorb. Before a fabric goes into production, the infrastructure validation confirms that the rack can handle what you're about to plug in. Skip this and you discover at week three that a rack is thermally over-design and modules are throttling — at which point fixing it requires either re-spreading equipment across more racks or upgrading the room cooling.

10.1 · The PDU headroom check#

Each rack has a power budget defined by its PDU. The compute and switching equipment in the rack must fit inside that budget, with margin for transient loads.

Module-side power contributions for a fully populated 32-port 800G switch:

Switch ASIC: 400–600 W
32 modules at 17 W each (DSP optical): 544 W
Fans: 100–200 W
Power supply overhead: 100 W
Total switch: ~1,150–1,450 W

For a rack with 4 such switches plus compute, the switch contribution alone is roughly 5 kW. Add GPUs, CPUs, and storage, and even moderate-density AI racks reach 15–25 kW. Many older racks were designed for 8–12 kW; populating them with modern 800G fabric exceeds the design.

The validation is straightforward: total your projected per-rack draw, compare to PDU rating, ensure 25% margin (transients, future expansion, derating for elevated temperature). Racks that don't have margin need either equipment redistribution or PDU upgrade before commissioning.

10.2 · Thermal validation under load#

Power becomes heat. Heat is removed by airflow. The validation question: at full populated load, does the rack stay within thermal design?

Module case temperature is the proximate measure. Per Chapter 12's DOM baselines, the module's own thermal limits are typically 70–75°C case temperature. The rack must keep modules well inside that limit even at full load.

The validation procedure:

Populate the rack with target equipment
Run line-rate traffic across all interconnects (PRBS or representative workload)
Monitor module case temperatures via DOM, sampled every minute
Run for at least 30 minutes after temperatures stabilize
Verify no module exceeds 65°C case temperature at sustained load (10°C below limit gives margin for ambient excursions)

If modules exceed the threshold, the issues are usually one of:

Rack airflow obstruction — missing blanking panels, cable bundles blocking exhaust, equipment mounted backward (front-to-back airflow vs back-to-front mismatch)
Room cooling insufficient — the CRAC is undersized for the load you've added
Hot spot at the rack — the rack is in a hot aisle that's recirculating warm exhaust
Module thermal interface defect — single module runs hotter than its peers; replace and re-test

10.3 · Airflow shadowing and cable density at 800G#

At GB200-class rack densities the thermal failure mode is rarely a whole-rack cooling shortfall — the room CRAC is usually sized correctly. It is local. Two effects dominate, and both produce a signature that is easy to misread as a module defect.

Airflow shadowing. A module is not cooled by the rack's average airflow; it is cooled by the air that actually reaches its cage. In a fully populated 800G switch, a dense cable bundle emerging from the faceplate, a thick AEC assembly draped across an exhaust path, or simply the module in the lee of a taller adjacent component can sit in a still-air pocket while the rack-level airflow numbers look fine. The shadowed module runs hotter than its neighbors with no rack-level cause, and a thermal-margin reading taken at the rack inlet will not show it. The only way to catch shadowing is per-module case temperature compared lane-by-lane and slot-by-slot across the populated switch — a single module 8–12 °C above its row peers, with healthy rack ambient, is an airflow-shadow signature.

Cable density. 800G AEC and AOC bundles at GB200 GPU-side densities are physically thick, and a faceplate fully populated with them becomes a partial air dam. The effect is cumulative: each cable is minor, the full bundle is not. Dense bundles raise the case temperature of the modules behind them and, just as important, the modules whose exhaust they block. This is why cable routing is a thermal-validation concern, not only a serviceability one.

Why this matters for link health: thermal-induced BER drift. A module running 10–15 °C hotter than baseline because of shadowing or a cable dam does not usually fail outright. It drifts. Pre-FEC BER on the hot module climbs with case temperature, often enough to cross an action threshold under sustained workload and recede when the workload pauses — a BER excursion that correlates with temperature, not with anything wrong in the optics. An operator who reads that excursion as a marginal module replaces a healthy part, and the replacement, dropped into the same shadowed slot, drifts the same way. At 800G densities this is one of the more common sources of no-fault-found RMAs. The triage rule: when a module shows temperature-correlated BER drift, validate the airflow to that specific slot — clear the cable shadow, confirm the case temperature falls, confirm the BER follows — before considering the module suspect.

Add to the thermal-validation procedure in §10.2: capture per-module case temperature across every populated slot, not a rack aggregate, and flag any module more than roughly 8 °C above its slot-row peers for an airflow-routing check.

10.4 · Ambient conditions#

Ambient temperature and humidity at the rack intake set the floor for everything else. ASHRAE specifies envelope classes (A1 through A4) for data center conditions; AI deployments should generally aim for ASHRAE A1 or A2 (18–27°C intake at 30–60% RH).

Out-of-spec ambient makes every threshold in this handbook unreliable. A module running 30°C above ambient is fine when ambient is 22°C; the same delta is a thermal alarm when ambient is 32°C. Validation that proves a fabric works at one ambient doesn't prove it works at another.

Verify before commissioning:

Rack intake air temperature within 18–27°C (A2 envelope), measured during peak load conditions
Humidity 30–60% RH (avoid below 30% — static discharge risk; avoid above 70% — condensation risk)
No visible dust accumulation on intake surfaces
HVAC system has 25% capacity margin against current load

10.5 · The infrastructure pre-flight checklist#

Before installing modules into a rack:

☐ PDU capacity verified ≥ projected load + 25% margin
☐ All blanking panels in place; no recirculation paths
☐ Cable management does not obstruct airflow exhaust
☐ Cable bundles routed clear of module faceplates; no bundle shadowing a populated cage
☐ Switch airflow direction matches rack layout (front-to-back in cold-aisle/hot-aisle layouts)
☐ Rack intake air temperature at peak load measured and within envelope
☐ Per-module case temperature captured across all populated slots; no module >8 °C above its slot-row peers
☐ Rack-level monitoring active (PDU draw, intake/exhaust temperatures)
☐ HVAC capacity confirmed for added thermal load

Each item that fails the check needs to be remediated before module installation, not after.

← Previous chapterCh 9 · Cabling and fiber plant validation Next chapter →Ch 11 · The five-check bring-up procedure

Chapter Eleven

The five-check bring-up procedure#

Each module that enters production should pass five checks, in order, with each check gating the next. The procedure is the same across all NOS stacks; the commands differ. This chapter is the master procedure with side-by-side NOS commands. The experienced engineer runs all five in about 4 minutes per module.

Reference Card 12 · Project timeline

Bring-up timeline — 3 weeks vs 8 weeks

A 1,024-GPU cluster bring-up runs three weeks done well, eight weeks done badly. Here's where the time goes in each scenario, and what determines which path you're on.

✓3 weeks · validation done wellD0 → D21

Receiving

3 days

Cabling val

4 days

5-check bring-up

5 days · automated

Burn-in 72 hr

3 days

Fabric val

3 days

D21

cluster utilization (cumulative)

D0 · 0 GPU-daysD21 · cluster live

21 days · ~$0in unplanned cluster-idle cost

Why it works: validation runs in parallel where possible, automated where structured (5-check bring-up), and gates the next phase only when the previous phase passes. Customer accepts on D21.

✗8 weeks · validation deferredD0 → D56

Recv

3 days

Rush bring-up

skipped checks

Workload runs

incidents start

Incident chase

10 incidents · 18 days

Re-validate

14 days

cluster utilization — intermittent · ~40% effective

D0 · 0 GPU-daysD56 · cluster live

56 days · ~$1.4–4Min cluster-idle cost @ 1,024 GPUs

Why it doesn't: bring-up was rushed — checks 3 and 4 deferred. Workload starts; incidents start the same day. Engineering chases each one in production, with the customer watching. Re-validation eventually catches up to where it should have started.

Determinant 01Cabling validated up front

A bad fiber plant produces problems that look like module faults. Validating cabling before optics go in saves 60% of bring-up incident-chase time on average.

Determinant 025-check protocol automated

A scripted 5-check protocol against the platform CLI lets one engineer process 50+ links per day. Manual scan = 8 modules per day. The 6× speedup is the program.

Determinant 03Burn-in is non-negotiable

72 hours of sustained workload before customer acceptance catches the infant-mortality population that fails in week 1. Skipping it pushes those failures into customer time.

Vitex Validation Handbook · Card 12/15vitextech.com

11.1 · Why five checks, in this order#

The check sequence is designed so that each step proves a property the next step depends on. Skipping forward produces confusing diagnostics:

Check 1 (presence/identity) gates everything — if the module isn't recognized, nothing else matters
Check 2 (link state/rate) requires presence; if link doesn't come up, DOM and FEC don't apply
Check 3 (DOM) requires link-up; idle modules don't produce meaningful telemetry
Check 4 (FEC) requires DOM showing healthy levels; FEC errors on a module with bad RX power tells you the module, not the link, has a problem
Check 5 (interface counters) is the final gate; clean counters under traffic mean the link is production-ready

11.2 · Check 1 — Presence and identity#

Verify the module is detected, the EEPROM is readable, and the identification matches what was ordered.

Table 7 · Check 1 commands by NOS

Reading module presence and identification

NOS	Command
SONiC	`show interfaces transceiver eeprom Ethernet0`
NVOS	`nv show interface eth1 transceiver`
Cumulus	`ethtool -m swp1`
Arista EOS	`show interfaces Ethernet1 transceiver properties`
Cisco NX-OS	`show interface Ethernet1/1 transceiver details`
Juniper Junos	`show interfaces diagnostics optics et-0/0/0`

Verify in output:

Vendor name matches the order
Part number matches the SKU
Serial number is present and unique (not duplicating another module)
Module type matches expectation (e.g., "800G OSFP DR8" — not a 400G module mistakenly shipped)
Date code reasonable (not stock from years ago)

Failure modes: Module not detected (reseat; verify cage power; check for bent pins). Wrong part identified (CMIS parser update needed on switch). EEPROM corrupted/unreadable (faulty module — RMA).

11.3 · Check 2 — Link state and negotiated rate#

Confirm the module links up at the expected speed, FEC mode, and lane configuration.

Table 8 · Check 2 commands by NOS

Reading link state and negotiated parameters

NOS	Command
SONiC	`show interface status Ethernet0`
NVOS	`nv show interface eth1 link`
Cumulus	`ip link show swp1` + `ethtool swp1`
Arista EOS	`show interfaces Ethernet1 status` + `show interfaces Ethernet1 phy`
Cisco NX-OS	`show interface Ethernet1/1 status`
Juniper Junos	`show interfaces et-0/0/0 extensive`

Verify:

Link state: up
Operational speed matches configured
FEC mode correct (KP4 for 400G/800G PAM4; KR4 or none for 100G NRZ)
Duplex: full
Lane count matches form factor (8 for 800G OSFP; 4 for 400G QSFP-DD)

Failure modes: Link down despite presence (wrong cable polarity, fiber type, or far-end module off). Link up but wrong speed (auto-neg mismatch, firmware mismatch). Link up but wrong FEC (CMIS FEC negotiation failed — common on new AECs).

11.4 · Check 3 — DOM read, per-lane#

Verify optical/electrical telemetry is in healthy ranges per Chapter 12. Critically, per-lane, not aggregate.

Table 9 · Check 3 commands by NOS

Reading DOM per lane

NOS	Command
SONiC	`show interfaces transceiver dom Ethernet0`
NVOS	`nv show interface eth1 transceiver diagnostics`
Cumulus	`ethtool -m swp1 \| grep -E 'Tx power\|Rx power\|Temperature\|Bias'`
Arista EOS	`show interfaces Ethernet1 transceiver detail`
Cisco NX-OS	`show interface Ethernet1/1 transceiver details`
Juniper Junos	`show interfaces diagnostics optics et-0/0/0`

Verify against Chapter 12 baselines:

Temperature in expected range for variant at current ambient
TX power per lane within IEEE min/max, within typical operating range
RX power per lane above sensitivity floor, within saturation ceiling
Lane-to-lane TX spread ≤ 1 dB
Lane-to-lane RX spread ≤ 2 dB
Laser bias within expected range for laser type

11.5 · Check 4 — FEC baseline#

Pre-FEC BER per lane below install target (Chapter 4 of Part 2 referenced these earlier; the operating thresholds for production are in Chapter 12 of this part). Critical: run Check 4 only after at least 5 minutes of line-rate traffic. An idle link shows zero errors which means nothing.

Table 10 · Check 4 commands by NOS

Reading FEC counters and pre-FEC BER

NOS	Command
SONiC	`show interfaces counters fec Ethernet0`
NVOS	`nv show interface eth1 counters fec`
Cumulus	`ethtool --show-fec swp1` + `ethtool -S swp1 \| grep fec`
Arista EOS	`show interfaces Ethernet1 counters fec`
Cisco NX-OS	`show interface Ethernet1/1 counters fec-stats`
Juniper Junos	`show interfaces et-0/0/0 extensive \| find "FEC"`

Verify:

Pre-FEC BER per lane below install target for variant (per Table 14 in Chapter 12)
Corrected codewords non-zero and accumulating — this is expected; zero means no traffic ran
Uncorrectable codewords zero absolute
Per-lane BER spread within 1 decade across lanes — a single lane far above its siblings is a lane-specific defect, not uniform aging

11.6 · Check 5 — Interface counter clean-start#

Final gate: confirm zero CRC, FCS, and symbol errors over a 15-minute traffic window after counter clear.

Procedure: clear counters, run line-rate traffic for 15 minutes, read counters. Acceptance: zero CRC, zero FCS, zero alignment errors, symbol errors below threshold (Chapter 17 covers the exact thresholds).

A link that passes Checks 1–4 but accumulates CRC errors in Check 5 has a subtle channel issue — usually a marginal connector, an MPO with alignment problems, or a cable with internal defect.

11.7 · Breakout-specific procedure#

For breakout deployments (one parent port broken into multiple child legs), each leg is an independent link and gets independent checks.

The procedure: configure the parent port for breakout mode (NOS-specific commands), wait for the child interfaces to appear (typically named with sub-port suffixes — Ethernet0/1, Ethernet0/2 in SONiC; et-0/0/0:0, et-0/0/0:1 in Junos), and run Checks 1–5 on each child interface independently.

Common breakout-specific issues:

Parent port doesn't enter breakout mode — usually a NOS configuration issue, not module
Some legs link, others don't — fan-out fiber assembly or partial breakout firmware mode
FEC negotiation differs across legs — check that the same FEC mode is configured on all four child interfaces and on the far-end equivalents
One leg accumulates errors; others clean — single-lane defect in the parent module

11.8 · The four-minute scan#

The experienced engineer runs the five checks as a tight sequence:

Insert module; wait 30 seconds (Check 1)
Read EEPROM identification — verify or fail (30 seconds)
Wait for link-up; read link state (30 seconds, Check 2)
Read DOM per-lane; eyeball against expected ranges (60 seconds, Check 3)
Start traffic; wait 5 minutes; read FEC and interface counters (Checks 4 and 5 together)

Total: roughly 8 minutes including the 5-minute traffic wait. The engineer can run multiple modules in parallel — start one, walk to next, return after 5 minutes — and process 5–10 modules per hour at pace.

← Previous chapterCh 10 · Power, thermal, and infrastructure validation Next chapter →Ch 12 · DOM baselines — what healthy looks like

Chapter Twelve

DOM baselines — what healthy looks like#

DOM (Digital Optical Monitoring) is the per-lane operational telemetry the module exposes — temperature, TX power, RX power, laser bias current. Healthy values vary by vendor and SKU; what matters operationally is establishing baseline at install and alerting on excursion from baseline rather than against absolute thresholds.

12.1 · Reading DOM output#

DOM lives on CMIS Page 11h (per-lane data) plus Page 00h temperature. Most switch CLIs surface it via a single command — Cisco show interfaces transceiver detail, Arista show interfaces transceiver dom, NVIDIA mlxlink, SONiC sfputil show eeprom. The structure is the same across platforms: per-lane TX power (dBm), RX power (dBm), bias current (mA), plus module-level case temperature (°C). Capture this output at bring-up for every link — that capture is the baseline.

Reference Card 04 · Bench-side

DOM Quick Reference

Healthy zones, warning bands, and alarm thresholds for the four DOM telemetry parameters every operator polls.

Reference module · 800G OSFP IHS 8 lanes · PAM4 100G/lane 14–17 W envelope CMIS 5.x · Page 11h

Module Temperature°C, case tempPage 00h · 1-min poll

Alarm

< 0 °C

Warn

0–10 °C

Healthy

10–65 °C · typical 35

Warn

65–75 °C

Alarm

> 75 °C

Action on alarmInvestigate airflow first. Sustained >70 °C = open thermal-interface ticket on the rack.

TX Power per LanedBm, optical outPage 11h · 5-min poll

Alarm

< −8 dBm

Warn

−8 to −5

Healthy

−5 to +3 dBm · typical −1.5

Warn

+3 to +5

Alarm

> +5

Action on alarmLow TX = laser failure. Alert on Δ > 1 dB drift from install baseline.

RX Power per LanedBm, optical inPage 11h · 5-min poll

Alarm

< −11 dBm

Warn

−11 to −8

Healthy

−8 to +2 dBm · typical −3

Warn

+2 to +4

Alarm

> +4

Action on alarmChannel first — inspect, clean, re-mate connector before any module swap.

Laser Bias per LanemA, drive currentPage 11h · 5-min poll

Alarm

< 15 mA

Warn

15–25 mA

Healthy

25–80 mA · typical 42 at install

Warn

80–100 mA

Alarm

> 100 mA

Action on alarmBias rises before TX power drops — leading aging indicator. Plan retirement.

PRINCIPLE 01

Baseline at install

Vendor thresholds vary. Always baseline at install; alert on excursion from baseline, not from absolute.

PRINCIPLE 02

RX low ≠ module fault

RX power drop usually means channel — endface contamination or connector wear, not module failure.

PRINCIPLE 03

Bias is the leading indicator

Laser bias slope tells you 3–6 months ahead of TX power degradation. Alert on the slope.

PRINCIPLE 04

Trend, not snapshot

A reading inside healthy is fine in isolation. A reading drifting toward warning is the alert.

Vitex Validation Handbook · Card 04/15vitextech.com

The four DOM parameters and their operational meaning:

Module temperature — case temperature of the module package. Healthy range typically 10–65 °C with typical operation around 35 °C in a properly cooled rack. Sustained >70 °C means investigate airflow first; modules don't usually fail thermally before their environment does.
TX power per lane — optical output power. Healthy range typically −5 to +3 dBm depending on optic class. The right-tail aging indicator: TX power drops over time as the laser ages. Alert on Δ > 1 dB drift from install baseline rather than on absolute threshold.
RX power per lane — optical input power received from the far end. Healthy range typically −8 to +2 dBm. RX low does not mean module fault — it usually means channel: endface contamination, connector wear, fiber damage. Inspect / clean / re-mate before any module swap.
Laser bias current per lane — drive current the laser is consuming to maintain output. Healthy range typically 25–80 mA depending on laser type. The leading aging indicator: bias rises gradually over the module's lifetime as the laser device degrades. Bias slope shifts 3–6 months ahead of TX power dropping measurably. Alert on the slope, not the value.

For complete healthy / warning / alarm thresholds visualized as zones, see Reference Card 04 (DOM Quick Reference). The card is the bench-side artifact; this chapter is how to interpret what it shows.

12.2 · Establishing the install baseline#

The install baseline is captured immediately after bring-up validation passes — once the link is up, error-free under PRBS, with traffic flowing in steady state. The baseline is per-lane and per-link; a 32-port switch produces ~256 baseline records (32 modules × 8 lanes plus module-level temperature).

What to capture in the baseline record:

Module S/N, switch ID, port, lane index
Each of the four DOM parameters at the moment of capture
Ambient and rack inlet temperature at capture time
Workload state — should be steady-state for representative DOM, not bring-up transient
Capture timestamp

The baseline becomes the per-link reference for production telemetry. When a module's TX power drifts >1 dB from its baseline, that's the alert — not when it crosses a fleet-wide absolute threshold. Vendor thresholds vary; baseline-relative alerting catches per-module degradation that absolute thresholds miss.

12.3 · Drift detection and the trend#

A reading inside the healthy zone is fine in isolation. A reading drifting toward warning over weeks is the alert. The two trend patterns that matter:

Bias rising. Laser is aging; the device is consuming more current to maintain output. Bias slope >0.5 mA/month sustained over a quarter is the typical alert threshold. Lead time: 3–6 months before TX power drops below alarm. Plan retirement; the link still works fine in the meantime.

RX dropping. Almost always channel — endface contamination accumulating, connector wear, fiber damage. Drop over days suggests contamination event (recent maintenance touched the connector); drop over months suggests gradual wear. Inspect and clean before any module action.

TX power drifting downward without bias drifting up is rare and indicates a module-internal problem worth raising with the vendor. Temperature drifting upward without environmental change usually indicates dust accumulation in airflow paths or fan degradation in the rack — investigate the rack before the module.

← Previous chapterCh 11 · The five-check bring-up procedure Next chapter →Ch 13 · Burn-in for production deployment

Chapter Thirteen

Burn-in for production deployment#

Burn-in exists because some defects manifest only on a timescale of hours to days. A module that passes Chapter 11's five checks at minute zero may begin failing at hour 50 — a thermal interface that needs heating cycles to settle, a marginal solder joint that requires sustained thermal expansion to fail, a firmware state-machine edge case that triggers only after specific event sequences. Burn-in is the discipline that catches these.

13.1 · Why 72 hours is the minimum#

Burn-in duration relates directly to what classes of defect you catch:

Table 11 · Burn-in duration and what each catches

Defect classes by exposure time

Duration	Catches
1 hour	Gross hardware faults that didn't show at link-up
6 hours	Thermal stabilization issues; initial seating settling
24 hours	Day/night thermal cycling; switch fan-speed transitions
48 hours	Slow thermal runaway; accumulated solder-joint stress
72 hours	Latent manufacturing defects — the minimum for AI workload
7 days	Early-life laser aging anomalies; firmware state-machine edge cases

Recommendation by deployment class:

Storage / management fabric: 24 hours minimum
Compute fabric (general): 48 hours minimum
AI training fabric: 72 hours minimum (the standard)
First deployment of new-generation technology (first 800G in your fabric): 7 days for a sample (10–20% of modules), 72 hours for the rest

13.2 · Traffic profile#

Burn-in at idle doesn't burn anything in. Sustained line-rate traffic is required.

For Ethernet fabrics:

PRBS31Q test pattern via switch-resident BERT (best stress for PAM4)
Synthetic iperf3 or trex traffic at line rate
Real application traffic from a test load generator

For InfiniBand / RoCE fabrics:

ib_send_bw running bidirectionally
ib_write_bw for RDMA write stress
Full perftest suite — every test type

What does not count as burn-in: ping traffic, link-up idle, short benchmarks that complete in minutes.

13.3 · Telemetry during burn-in#

One-minute polling of the following, persisted to a time-series store for later analysis:

Per-lane TX power, RX power
Module temperature
Per-lane laser bias current
Per-lane pre-FEC BER
Per-lane corrected codewords (cumulative)
Uncorrectable codewords (cumulative)
Interface CRC, FCS errors (cumulative)
Symbol errors on copper (cumulative)
Link state changes (event-driven)
Ambient rack temperature

13.4 · Acceptance criteria#

At end of burn-in window, all of the following must hold:

Zero uncorrectable codewords across the entire window
Zero CRC / FCS errors
Zero link flaps
Zero temperature excursions above action threshold
Pre-FEC BER trend flat or improving — a link whose pre-FEC BER rose 0.5 decades or more during burn-in is a fail even if final value is in spec
DOM stable within ±0.3 dB on TX power, ±3°C on temperature at constant ambient

A link failing any criterion does not ship to production. The process: remove, root-cause, replace, re-run burn-in on the replacement from hour zero. Do not partial-credit the burn-in clock.

13.5 · The textbook says one-minute polling. Reality says…#

In practice

Burn-in standards prescribe continuous monitoring at one-minute or finer cadence with all metrics logged to non-volatile storage and audited at end of window.

One-minute polling for 72 hours produces 4,320 sample points per metric per module. Across 200 modules and 10 metrics each, that's 8.6 million data points to review. Most operators reasonably summarize: trend graph for visual inspection, automated alert on threshold excursions, percentile statistics rather than raw point-by-point. The discipline is not "look at every sample" — it's "ensure your alerting catches excursions that matter and your summary statistics tell you whether the link drifted." The summary is what gets reviewed; the raw data is retained for forensic investigation if a link later fails.

13.6 · The parallel-burn-in pattern#

Burn-in is wall-clock time, not engineer time. The optimization: run as many modules in parallel as your test infrastructure supports. A bring-up bench that can host 64 modules simultaneously processes 64 modules in 72 hours; sequential burn-in of the same 64 modules would take 192 days.

Parallel burn-in requires:

Sufficient test switches to host the modules — usually limited by switch port count
Telemetry infrastructure that can poll all modules simultaneously without overloading any switch's management plane
Process discipline to start, monitor, and accept many concurrent burn-ins

The realistic Tier 2/3 commissioning bench: 2–4 switches, supporting 64–256 modules in parallel, processing several thousand modules per quarter through 72-hour burn-ins.

← Previous chapterCh 12 · DOM baselines — what healthy looks like Next chapter →Ch 14 · Fabric-level validation

Chapter Fourteen

Fabric-level validation#

Per-link validation proves each link is healthy. Fabric-level validation proves that the links, taken together, deliver what the workload needs: uniform path utilization, predictable performance under collective communication, no surprises when GPUs arrive. This chapter is the procedure for validating the whole fabric — ECMP hash quality, RDMA performance, fabric topology consistency, and the proof-without-GPUs problem.

ECMP hash polarization is the most common fabric-level defect on Ethernet AI fabrics. The aggregate counters look fine; per-next-hop counters tell the story. The coefficient of variation across your hash distribution is the single number that captures health.

14.1 · The CV measurement#

The fundamental ECMP health metric: coefficient of variation of per-path byte counts under representative traffic.

Procedure:

Run representative workload across the fabric (or synthetic traffic that mimics it) for 10–15 minutes
For each ECMP group (e.g., a leaf's uplinks to spines), collect per-link byte counters
Compute: CV = standard deviation of byte counts / mean of byte counts

Acceptance:

CV < 0.15 (15%): Healthy. Distribution is good.
CV 0.15 – 0.30: Warning. Investigate hashing.
CV > 0.30: Fail. ECMP is not distributing properly.

14.2 · Common ECMP failure modes#

Hash polynomial polarization across multiple tiers. Two-tier fabrics where leaf and spine use the same hash polynomial cause flows that took path A at leaf to preferentially take a correlated path at spine — concentrating traffic on a subset. Fix: configure unique hash seeds per switch, or use hash-randomization features.

Insufficient hash field set. Hashing only on IP (3-tuple) produces poor distribution for workloads with few IP pairs but many flows — the AI training case. Fix: configure 5-tuple hashing (IP source, IP dest, source port, dest port, protocol).

Mixed-speed ECMP groups. If your fabric has 400G straight uplinks alongside 4×100G breakout uplinks, standard ECMP treats each leg as an equal next-hop — but they're not equal in bandwidth. Solutions: weighted ECMP (WCMP) where supported, or same-speed ECMP groups configured separately.

14.3 · The fabric-ready checklist#

When all of these pass, the fabric is ready to accept GPUs:

Table 12 · Fabric-ready checklist

The acceptance criteria for handing fabric to the workload team

Criterion	Evidence
Every link passes individual bring-up	Validation records (Ch 16) for every link
Every link completed burn-in successfully	72-hour burn-in records with acceptance criteria met
ECMP distribution CV < 0.15	Per-group CV measurement under representative load
RDMA bandwidth ≥ 90% theoretical	perftest results on representative endpoint pairs
Fabric topology validates clean	ibdiagnet (IB) or routing-table consistency (Ethernet)
Routing stable, no flaps in last 24 hours	BGP/OSPF telemetry review
MPI collective simulation runs error-free	Multi-host all_reduce benchmark output

When GPUs arrive, the first nccl-tests run should closely match pre-GPU RDMA numbers. If it doesn't, the delta is either in the GPU stack (NCCL version, CUDA version, driver mismatch) or GPU-specific (PCIe topology, NVLink config) — not in the fabric. The network team is not the bottleneck.

← Previous chapterCh 13 · Burn-in for production deployment Next chapter →Ch 15 · The validation report

Chapter Fifteen

The validation report#

Every link that passes bring-up generates a validation record. The record is the artifact that joins the factory test report, the bring-up data, and (later) the production telemetry into a single lifetime record per module. This chapter describes what goes in the record, what schema to use, and how the record gets joined to your inventory and operations systems.

15.1 · What goes in the validation record#

Per module / per link:

Identification: vendor, part number, serial number, revision, manufacturing date code
Deployment location: rack, switch hostname, port, position in fabric topology
Configuration: speed, FEC mode, lane count, far-end module identification
Bring-up evidence: timestamp and result for each of Chapter 11's five checks
DOM at acceptance: TX power per lane, RX power per lane, temperature, voltage, laser bias per lane
FEC at acceptance: pre-FEC BER per lane, corrected/uncorrectable codeword totals
Burn-in results: start/end timestamps, duration, final BER per lane, BER trend, max temperature, flap count, error totals
Validator: who or what system performed the validation, when
Status: production-ready / re-test required / failed

This record is the baseline against which all future telemetry is trended. A module showing pre-FEC BER of 1.2e-9 in production has a different meaning depending on whether its acceptance baseline was 1.0e-9 (modest drift) or 5.0e-11 (significant drift).

15.2 · The YAML schema#

A representative validation record:

link_id: "rack42-leaf01:Eth1 -> rack42-spine01:Eth25"
timestamp: "2026-05-12T14:33:00Z"
interconnect:
  vendor: "Vitex"
  part_number: "VO-8CSR8CM-AA"
  serial_number: "VTX20260512001234"
  revision: "1.2"
  cable_length_m: 10
configuration:
  speed: "800G"
  fec_mode: "KP4"
  lanes: 8
bring_up:
  check_1_presence: pass
  check_2_link_state: pass (800G, KP4, full, 1.2s to link)
  check_3_dom:
    tx_power_dbm: [0.8, 0.9, 1.1, 0.8, 0.9, 1.0, 0.9, 0.8]
    rx_power_dbm: [-1.2, -1.1, -1.0, -1.2, -1.1, -1.0, -1.3, -1.2]
    temperature_c: 62.3
    pass: true
  check_4_fec:
    pre_fec_ber_per_lane: [1.2e-9, 1.5e-9, 1.1e-9, 1.3e-9,
                           1.4e-9, 1.2e-9, 1.1e-9, 1.3e-9]
    uncorrectable: 0
    pass: true
  check_5_interface:
    crc: 0
    fcs: 0
    pass: true
burn_in:
  start: "2026-05-12T15:00:00Z"
  end: "2026-05-15T15:00:00Z"
  duration_hours: 72
  final_ber_per_lane: [1.3e-9, 1.4e-9, 1.2e-9, 1.2e-9,
                       1.4e-9, 1.3e-9, 1.2e-9, 1.3e-9]
  ber_trend: "flat"
  flaps: 0
  uncorrectable_total: 0
  crc_total: 0
  max_temperature_c: 66.1
  pass: true
validator: "network-ops-shift-b"
status: "production_ready"
factory_report_ref: "VTX-FR-202604-N1234.pdf"

The exact format isn't sacred; YAML, JSON, or a database schema work equivalently. What matters is the binding to module serial number, deployment location, factory report reference, and the data fields above.

15.3 · Joining the records#

The validation record becomes useful when it's joined to other data sources:

Factory test report — the per-unit report from the vendor, indexed by serial. Joining gives you the comparison point: how does the module's behavior at acceptance compare to its factory measurement?
Production telemetry — the running stream of DOM and FEC data once the module is deployed. The validation record is the baseline; the telemetry is the trend.
Inventory system — what module is in what rack/switch/port, and when did it get there.
RMA and FFA records — when modules are returned, the validation record is part of the evidence package.

15.4 · The lightweight tracking system#

You don't need a full CMDB to do this well. The minimum that works:

A directory of validation records, one file per module serial
An inventory spreadsheet or simple database mapping serial to deployment location
Telemetry storage (Prometheus, InfluxDB, or commercial equivalent) that ingests the production stream and tags by serial
Process discipline that updates the inventory record on every install, swap, and retirement

Operators with thousands of modules sometimes graduate to dedicated DCIM tools; operators with hundreds usually do fine with the lightweight approach. The volume threshold is roughly 5,000 modules, where manual update of inventory records starts to break down.

← Previous chapterCh 14 · Fabric-level validation Next chapter →Ch 16 · Production telemetry

Take the bench-side version to the data hall

Get the field pack →

Part Four

Validation in production

From the moment a link enters production until the moment it is retired — the telemetry, alerting, triage, and vendor-management discipline that keeps a fabric healthy.

Chapter Sixteen

Production telemetry#

Validation does not end at bring-up. The fabric runs for years, and every link in it accumulates a continuous stream of data that — read correctly — tells you which links are healthy, which are drifting, and which are about to fail. This chapter covers what to log, why, and at what cadence. The goal is not maximalist instrumentation; it is the smallest set of measurements that catches the largest fraction of preventable failures.

16.1 · The four telemetry classes#

Production telemetry on an interconnect link falls into four classes, each with different characteristics and different operational use:

Slow-changing health metrics. DOM telemetry — TX power, RX power, temperature, voltage, laser bias. These change on hour-to-day timescales as ambient varies and as components age. Useful for trending, for alerting on excursions from baseline, and for anticipating component aging. Polling cadence: 1–5 minutes is usually sufficient.

Fast-accumulating error counters. Pre-FEC BER, corrected codewords, uncorrectable codewords, CRC errors, FCS errors, symbol errors. These accumulate in real time as the link carries traffic. The instantaneous value matters less than the rate of accumulation. Polling cadence: 30 seconds for trending, with event-driven capture on threshold crossings.

State-change events. Link up, link down, FEC mode changes, CMIS state-machine transitions, firmware version changes. Discrete events that should be captured when they happen, not sampled.

Workload context. The traffic the link is actually carrying — bandwidth, packet rate, burst patterns, RDMA queue depths, PFC pause-frame counts. Useful for distinguishing "the link has a problem" from "the workload is exercising the link unusually."

An operator running all four classes, joined by link identifier and timestamp, has the data to diagnose effectively any production issue. An operator running only one class — DOM, typically — has the data to alert on gross failures but to diagnose almost nothing.

16.2 · The minimum telemetry set#

If you are building telemetry from nothing, this is the minimum. Each metric, polled at the indicated cadence, with the indicated retention.

Table 13 · Production telemetry — minimum set

Per link, per metric, per cadence, per retention

Polling cadences are operator-recommended floors; tighter cadence is fine if your storage and switch management plane support it. Retentions are minimums that support typical operational use cases — long-term failure trending, vendor scorecards, and post-incident forensics.

Metric	Cadence	Retention	Why
Module temperature	1 minute	13 months	Seasonal trend; thermal interface drift; rack hot-spot detection
TX power per lane	5 minutes	13 months	Laser aging trend; identifies degrading transmitters before failure
RX power per lane	5 minutes	13 months	Endface contamination drift; far-end TX degradation
Laser bias per lane	5 minutes	13 months	Direct aging indicator; rises before TX power drops
Pre-FEC BER per lane	30 seconds	90 days raw, 13 months hourly	Primary fabric-health indicator; trending catches drift early
Corrected codewords (cumul)	30 seconds	90 days	Normal under load; watch slope change and lane-to-lane spread, not absolute count
Uncorrectable codewords	30 seconds	2 years	Hard failure indicator; any non-zero is an alert
CRC / FCS errors	30 seconds	2 years	End-of-channel error indicator; non-zero is always an alert
Link state changes	Event-driven	2 years	Flap history; identifies marginal links over time
PFC pause frames (RoCE)	30 seconds	90 days	Lossless mechanism evidence; storms indicate fabric issues
ECN marks (RoCE)	30 seconds	90 days	Congestion-control evidence; mark rate predicts performance
Interface bandwidth utilization	1 minute	90 days	Workload context for interpreting other metrics

16.3 · The OCP Optics Telemetry Specification#

OCP's Optics Telemetry Specification is the operator-friendly reference for what factory test reports and ongoing telemetry should expose. Hyperscalers drove its creation; Tier-2 operators benefit from the same standard. When evaluating vendor telemetry depth, OCP-spec compliance is the floor. See §21.1 for the standard reference.

16.4 · Where the data lives#

Telemetry storage is its own discipline. Three tiers in operator practice:

Switch-local buffers. Every modern NOS retains some amount of telemetry locally — last 24 hours of counter values, recent state transitions, syslog history. Useful for spot-debugging; insufficient for trending across fleet.

Time-series database. Prometheus, InfluxDB, VictoriaMetrics, or commercial equivalents. The standard architecture: switches export metrics via gNMI, SNMP, or vendor-streaming-telemetry; a collector aggregates; the time-series store retains. Useful for fleet-wide trending, alerting, dashboards.

Long-term cold storage. Telemetry data for forensics — when a module fails at month 30, you want its data from month 30 minus 90 days, not just last week. Object storage (S3 or equivalent) with periodic exports from the time-series database.

The simplest setup that works at Tier 2/3 scale: SONiC or Cumulus exporting via gNMI, Prometheus collecting and retaining 90 days, periodic export to S3 for long-term. Total cost: a few thousand dollars per year in storage; a few engineer-weeks to set up.

16.5 · Workload-aware monitoring#

DOM telemetry interpretation depends on what the workload is doing. A module showing TX-power drop during all-reduce-heavy training looks different from the same module during inference idle. Capture workload state alongside telemetry — at minimum, a marker for "training under sustained workload" vs "idle" vs "inference active" — so trend analysis correctly distinguishes drift from workload-correlated transients. This is the single change that makes telemetry actionable rather than noisy.

← Previous chapterCh 15 · The validation report Next chapter →Ch 17 · Counter interpretation and alerting

Chapter Seventeen

Counter interpretation and alerting#

Telemetry that nobody alerts on is decoration. The job of an alerting system is to distinguish normal operation from incipient failure, fire on the latter, and not fire on the former. Get this right and you catch problems before they become outages. Get it wrong and your engineers ignore the alerts, which is a worse outcome than not alerting at all.

Reference Card 09 · Test program

The 5-minute test that catches most bad modules

A Pareto view of where defect-detection happens in a sample-testing program. The shape is the point: the first few minutes of testing catch the most, and returns diminish from there.

Defect-catch rate by test phasecharacteristic shape — illustrative, not measured

What 5 minutes catches

Test	Time	Catches
EEPROM ID	30 s	Mis-shipped SKU; wrong FW
Link-up + DPSM	30 s	Stuck-state modules; gross host mismatch
DOM read	30 s	Out-of-band thresholds; calibration anomalies
Switch PRBS	3 min	Per-lane BER defects; lane imbalance
FEC counter snapshot	1 min	Channel marginal; module marginal

The single 5-min protocol catches a large share of vendor-quality defects for a small fraction of the time cost. It also catches the same defects on incoming-inspection sampling — which is why most operators run a lighter 5-min protocol on every module they receive in addition to the deeper first-article protocol on samples.

The diminishing-returns argument

The first 5 minutes of testing catches the most defects per minute by a wide margin. Each successive phase catches fewer defects per hour invested. After roughly 100 hours of cumulative test time per unit, you hit the noise floor — additional testing finds nothing new.

The implication: run the cheap 5-minute protocol on everything; reserve the expensive 100-hour first-article protocol for new vendor or new SKU evaluations. Every operator running a serious program does both. Operators running only one program — usually the deep protocol on a tiny sample — leave most defects in the shipping pile.

The principle

Don't pick between 5 minutes and 100 hours. Run both — at the right cadence on the right population.

First-article protocol on 30 units when evaluating a new vendor or SKU (one-time). 5-min protocol on every unit at incoming inspection (every lot). Continuous telemetry on every link in production (always). The full program is multi-cadence by design — and the 5-min protocol is the unsung hero that catches the long tail of lot-to-lot drift after qualification.

Vitex Validation Handbook · Card 09/15vitextech.com

A pre-FEC BER snapshot at Day 30 reads as healthy. The 30-day trend at the same moment shows the slope. The slope is what you alert on, not the value. Operators who alert on instantaneous BER chase noise; operators who alert on trend catch failures.

17.1 · Corrected versus uncorrectable — what to alert on#

Before any threshold discussion, one distinction has to be clear, because getting it wrong is the most common alerting mistake operators make on PAM4 fabrics. Corrected codewords are not a fault signal. On a 100G-per-lane PAM4 link, the FEC is doing continuous correction work as a normal condition of operation — a healthy link under load produces a steady stream of corrected codewords, and that count climbing is expected, not alarming. An operator who pages on "corrected codewords increasing" will page constantly on healthy links and quickly learn to ignore the alerts.

What actually warrants attention is one of three things: trend acceleration — the corrected-codeword rate or the pre-FEC BER rising faster than its established baseline slope; lane imbalance — one lane of an eight-lane module correcting far more than its siblings, which points at a specific lane defect rather than uniform aging; and uncorrectable codewords — any non-zero count, because uncorrectables are post-FEC errors that reach the host as real data corruption. Corrected codewords are evidence the link is working as designed. Uncorrectables, lane spread, and slope are the signals. Alert on those.

17.2 · The threshold ladder#

For each metric, three thresholds in sequence:

Baseline. Normal operating value. Used as reference for trending — a metric drifting away from its baseline is the soft-failure signal.

Action threshold. A value that warrants engineer attention but not immediate response. Soft failure has progressed enough to plan intervention.

Critical threshold. A value that warrants immediate response. Approaching or beyond this is hard failure or imminent hard failure.

Table 14 · Threshold ladder for the primary metrics

Operating, action, and critical thresholds by metric

Action threshold typically generates a ticket for engineering review within the workday; critical threshold typically pages on-call. Adjust per-operator policy. Pre-FEC BER thresholds vary by FEC scheme and variant — values shown are for KP4 FEC at 100G/lane PAM4.

Metric	Baseline (typical healthy)	Action threshold	Critical threshold
Module temperature (case)	50–65 °C	Above 70 °C sustained 1 hr	Above 75 °C any sample
TX power per lane	Within 0.5 dB of baseline	Within 1.5 dB of IEEE limit	Crosses IEEE limit
RX power per lane	Within 1 dB of baseline	Within 2 dB of sensitivity floor	At or below sensitivity floor
Pre-FEC BER (KP4, 100G/lane PAM4)	1e-9 to 5e-8 typical	Sustained 5e-7	Sustained 1e-6 or single 1e-5
Pre-FEC BER trend	Flat or improving	Rising 0.5 decade over 30 days	Rising 1.0 decade over 30 days
Laser bias current	Within 5% of baseline	Rising 10% over 90 days	Rising 15% over 30 days, or above lifetime ceiling
Uncorrectable codewords	Zero	Any non-zero (rate < 1/hr)	Sustained rate > 1/min
CRC / FCS errors	Zero	Any non-zero	Sustained rate > 1/min
Link flaps	Zero in 30 days	Any flap (warrants triage)	≥3 flaps in 24 hours (paged immediately)
PFC pause storm (RoCE)	Brief, isolated	Sustained > 1 sec	Cascading across fabric

17.3 · The alert hierarchy#

The four-tier alert structure that we see actually work in operator practice:

Critical / page on-call.Hard failures and imminent hard failures. Uncorrectable codewords sustained, link flap clusters, multi-link correlated failures, fabric-wide PFC storms. Response: respond now, triage immediately.

High / ticket and acknowledge within 4 hours.Action threshold crossings. Single link flap. RX power approaching sensitivity. Pre-FEC BER above action level. Response: investigate during business hours, plan remediation.

Medium / weekly review.Drift signals. Rising bias trend. Slow temperature creep. Lane spread widening. Response: rolled into weekly health review; flagged for retirement planning.

Informational / monthly trend.Population statistics. Vendor-level AFR trending. New batch performance comparisons. Response: feeds vendor scorecards and procurement reviews.

17.4 · Correlation and noise reduction#

An alert on every threshold crossing produces alert fatigue. The alerting system has to do work to suppress redundant noise.

Correlation rules that pay for themselves:

Suppress link-down alerts during planned maintenance. The maintenance window calendar should be a feed into the alerting system; alerts during a window go to a tracker, not to a pager.
Group alerts by switch and rack. Twenty link-down alerts from the same switch is one event ("switch X went down") not twenty events. Pager fatigue is real; group correlated alerts.
Alert on rate, not on instantaneous values. A pre-FEC BER sample of 1e-6 from a normally healthy link could be measurement noise; sustained 1e-6 over 5 minutes is a real condition. Hold-down windows reduce false alerts.
Workload-context filtering. A link with elevated BER during an AllReduce phase, returning to baseline during forward-pass, is normal-for-workload. Subtract the workload-induced component before alerting.
Vendor-aware thresholds. If you've established that vendor X's modules normally run 1.5e-9 pre-FEC BER and vendor Y's run 1e-10, applying the same threshold to both produces noise on vendor X. Calibrate per-vendor.

← Previous chapterCh 16 · Production telemetry Next chapter →Ch 18 · Triage and live module swap

Chapter Eighteen

Triage and live module swap#

When an alert fires, the engineer's job is to determine fast: is this the module, the cabling, the far-end module, or the platform? The answer determines what gets replaced. Replacing the wrong thing wastes time, costs an extra module from spare inventory, and leaves the actual problem unfixed. This chapter is the triage discipline that produces the right answer in minutes rather than hours.

Reference Card 10 · Triage

Module symptom → root cause

Ten common production alerts and what each one most likely is — ranked by whether it points at the module, the channel, the host, the firmware, or the configuration. Investigate the highest-ranked cause first.

Likelihood-ranked triage tabledirectional ranking from field FFA & RMA patterns

Symptom	Module	Channel	Host	FW	Config	Investigate first
RX power dropping slow drift over weeks	·	P	–	–	·	Far-end TX power · endface contamination · cable damage
DPInit stuck link won't come up	·	–	S	P	P	FW vs CMIS parser · ApSel · host SerDes config · bring-up timeout
Host rejects module reads OK, won't accept	–	–	P	S	S	CMIS revision vs host parser · host optic-support package version
Works on one host, fails on another cross-generation bring-up	·	–	S	P	·	Retimer / DSP firmware vs host known-good list · host SerDes expectations
Reduced capability / power-capped links but underperforms	·	–	P	·	S	CMIS power-class advertised by cage vs module requirement
Breakout legs partly link one parent port erratic	–	·	·	·	P	Switch port breakout profile vs module application descriptor
FEC counters rising pre-FEC BER trending up	S	P	–	–	·	DOM trend (laser bias rising = module aging) · channel inspection
Link flapping multiple events per day	·	S	·	P	S	FW history · hold-down timer config · host PCIe state
Temperature alarm case temp > 75°C	·	–	P	–	·	Rack airflow · airflow shadowing · fan speed · thermal-interface contact
Post-FEC errors uncorrected codewords	P	S	–	–	·	Module is at end of margin · cable damage · FEC marginal

Ranking is directional, drawn from field FFA and RMA patterns — not measured probabilities. P Primary — check first S Secondary — next · Possible – Unlikely. Walk the ranking left-to-right before swapping the module.

Principle 01Investigate before swapping

Most symptoms above point at something other than the module first — channel, host, firmware, or config. A swap-first culture costs in NFF RMA freight and leaves the original problem in the fabric.

Principle 02Channel first, module last

For RX-related symptoms, inspect / clean / re-mate the connector before any module action. The 90 seconds it takes prevents a large share of no-fault-found RMAs.

Principle 03Watch the trend, not the value

Most symptoms above present as a slope, not a step. Catching the slope two weeks before the threshold breach gives you scheduled retirement instead of surprise outage.

The principle

"It's the module" is the conclusion of triage, not the assumption.

Most production alerts on optical links are not module faults. The triage discipline — Q1 link state, Q2 DPSM, Q3 DOM range, Q4 RX power — works precisely because it pushes you through the more-likely causes before you reach for the module. Operators who follow the discipline RMA modules in <15% of alerts. Operators who don't return modules at 3–4× that rate, eat the freight on every NFF, and the original problem stays in the fabric.

Vitex Validation Handbook · Card 10/15vitextech.com

The triage discipline. Every module swap that fixes a fiber-plant problem becomes a no-fault-found RMA — and the original problem stays in your fabric. Channel-side investigation always precedes module replacement.

18.1 · The five-minute triage#

Five questions answered in roughly five minutes, in this order:

What does the module's own state say?Read CMIS Page 11h for per-lane datapath state. If the module is in DPInit when it should be DPActivated, the module is not ready — and the most common cause is configuration mismatch, not module fault. Check Page 00h status flags, alarm/warning bits.

What does DOM say?Read TX power, RX power, temperature, laser bias. Compare to the validation record baseline. RX power that has dropped 3 dB from baseline is endface contamination or far-end TX degradation, not local module fault. TX power that has dropped at constant bias is a transmitter issue.

Is the problem one-sided?If TX power is healthy but RX power is degraded, the local module is fine — the problem is elsewhere. Same logic on the far end. Symmetric problems (both ends degraded) suggest the channel or both modules; one-sided problems isolate to one half of the link.

Per-lane vs all-lanes?If only one lane out of 8 is showing FEC errors, the failure is lane-specific — typically a single channel of the optical engine, possibly a single fiber strand in the cable. If all lanes are equally degraded, the failure is whole-module or whole-channel.

Has anything changed?Recent firmware push? Config change? Adjacent maintenance? Many "module faults" are actually changes elsewhere. Rule this out before declaring the module bad.

18.2 · The triage decision tree#

The mapping from triage findings to action:

Table 15 · Triage decision tree

Symptoms, likely cause, and first action

Triage finding	Most likely cause	First action
Module stuck in DPInit; never reaches DPActivated	CMIS / firmware / ApSel mismatch	Check syslog for CMIS warnings; verify host bring-up timeout; verify ApSel matches module advertising before declaring fault
Module reads correctly but host rejects it, or DPSM behaves erratically	CMIS parser / revision mismatch between module and host NOS	Confirm the host NOS supports the module's CMIS revision; a host parsing an unexpected CMIS minor version mishandles state and looks like a module fault. Update host optic-support package before RMA
Module worked on one host generation, fails to bring up on another	Retimer / DSP firmware mismatch with host SerDes expectations	Check module firmware revision against the host vendor's known-good list; a retimer firmware delta is a firmware fix, not a module defect. Re-test after firmware alignment before declaring fault
Module negotiates but runs at reduced capability, or host caps its power class	Power-class negotiation mismatch — host provides a lower CMIS power class than the module needs	Verify the cage/host advertises a power class at or above the module's requirement; a high-power module throttled into a low-power cage looks like a marginal module. Confirm before RMA
Breakout legs partly link; one parent port behaves inconsistently	Breakout application / profile mismatch between module and switch port config	Verify the switch port breakout profile matches the module's advertised application descriptor; a profile mismatch presents as intermittent or partial link, not module failure
RX power dropped >3 dB from baseline; TX healthy on far end	Endface contamination or partial mate	Re-seat connector; if no improvement, inspect endface and clean
TX power dropped at constant bias	Transmitter degradation	Replace local module; original goes to FFA
Bias rising fast at constant TX power	Laser aging — late-life	Schedule replacement during next maintenance window
Single lane FEC errors; other lanes clean	Single-channel optical engine fault, or single fiber strand issue	Try the cable on a known-good module pair; if errors follow cable, replace cable; otherwise replace module
All lanes elevated FEC; symmetric	Channel or both modules degraded	Inspect connectors at both ends, clean, retest; if persistent, swap one module to isolate
Link flap correlated to thermal events	Marginal thermal interface in module	Replace module; original goes to FFA with thermal observation noted
Multi-link failures correlated to one rack	Rack-level power, thermal, or cooling issue	Check rack PDU, intake temperature; do not replace modules until rack issue is ruled out
Module unreadable / EEPROM not detected	Hard module failure, or cage power issue	Re-seat; try in a different cage; if still unreadable, hard fault — replace
CRC errors with clean DOM and clean FEC	Switch ASIC issue, or upstream forwarding issue	Look upstream — switch hardware or far-side forwarding fault, not module

18.3 · Live module swap procedure#

Modern modules support hot-insertion. The procedure for swapping a module on a live fabric, with minimal workload disruption:

Drain the link. If the link is part of an ECMP group, taking it down forces traffic to the other paths. Temporarily disable the interface on the local switch; verify ECMP redistributes correctly; verify aggregate fabric throughput is unaffected.
Take a final snapshot. Read DOM, VDM, lane states, recent FEC counters. Snapshot the validation record — useful for FFA analysis.
Power-down the port. NOS-specific commands (interface shut, no power-on, etc.) — drops cage power before mechanical extraction.
Extract the module. Release latch, withdraw cleanly. Inspect endface immediately for evidence of damage or contamination. Bag the module in antistatic packaging with the validation record reference.
Inspect the cage and far-end endface. Cage pins look intact? Far-end endface clean? This is where you catch issues that would have been re-introduced with the replacement.
Insert the replacement. Verify DPActivated state, run Checks 1–5 from Chapter 11. Run a 15-minute traffic window with counter monitoring before re-enabling in ECMP.
Re-enable in fabric. Bring the interface up; verify routing reconverges; verify ECMP picks up the path.
Update records. Inventory shows new serial in this slot. Old serial moves to "extracted, pending FFA". Validation record for the new module is created.

Total time for an experienced engineer: 20–30 minutes per swap, of which most is the 15-minute post-insertion traffic window.

18.4 · The "the module isn't actually broken" case#

A non-trivial fraction of replaced modules — by some operators' tracking, 15–25% — work fine when re-tested at a vendor's failure analysis facility. The labels for this in the industry are "no fault found" (NFF) and "no trouble found" (NTF). The vendor returns the module, often charges restocking, and points out it tested clean.

NFF/NTF rates above 10% indicate a triage discipline problem. The contributing factors:

Modules replaced based on noisy alerts that resolved on their own
Modules replaced because the platform reported "transceiver fault" without further investigation
Modules replaced as part of "swap them all" responses to a cluster failure that was actually a power or thermal issue
Modules replaced for problems that were actually in the cable or far-end module
Modules replaced for a CMIS parser or revision mismatch — the host NOS mishandled the module's management interface and reported a fault that the module did not have
Modules replaced for a retimer or DSP firmware mismatch with the host — a firmware alignment, not a hardware defect
Modules replaced for a power-class negotiation mismatch — the host throttled a module that was never given the power class it needed
Modules replaced for a breakout profile mismatch — the switch port was configured for a different application than the module advertised

The fix is the triage discipline above. The five-minute triage takes longer than just swapping the module — but reduces NFF/NTF dramatically and saves the inventory cost of the unnecessary swap, the engineer time of the swap itself, and the future time spent diagnosing the actual root cause that didn't get fixed.

18.5 · Common triage mistakes#

Four mistakes account for most wasted triage time:

Swapping the module first. RX-low symptoms point at channel four times in five. Inspect, clean, re-mate before the module comes out — the 90 seconds saves on average 40% of RMA freight.
Reading values, not trends. A reading inside healthy is fine in isolation. The slope is the alert. Pull telemetry history before deciding.
Skipping firmware confirmation. DPInit-stuck modules look like hardware faults but are usually firmware mismatch. Confirm both ends are on the qualified firmware before any swap.
Not capturing evidence before action. The DOM and VDM snapshot before extraction is the input to the FFA report and the contemporaneous record. Capture first, then act — even when the action is "swap immediately because the link is critical." Sixty seconds is rarely the constraint.

← Previous chapterCh 17 · Counter interpretation and alerting Next chapter →Ch 19 · Re-validation, retirement, and the RMA/FFA evidence package

Chapter Nineteen

Re-validation, retirement, and the RMA/FFA evidence package#

A link validated at bring-up is not validated for life. Conditions change — firmware updates, ambient drift, cable disturbances during adjacent maintenance, slow component aging. The discipline of periodic re-validation maintains the assurance that bring-up established. Proactive retirement removes modules from production before their failure becomes a workload event. And when modules do fail, the RMA evidence package and the FFA loop close the cycle by feeding what you learn back to the vendor — and back into the next procurement decision.

Reference Card 11 · Procurement

Vendor evaluation scorecard

Most procurement evaluates optics on price and lead time. Operators who run reliable fabrics evaluate on these eight criteria, weighted. Use this as a working scorecard.

Eight criteria, weightedtotal = 100 · score 1–5 per criterion

#	Criterion	Weight	What 5 looks like	What 1 looks like
1	Platform qualification	20	NVIDIA LinkX or equivalent on your specific switch + NIC + cable + FW	"Compatible" with no documented test plan
2	Factory test rigor	15	Per-unit reports with full metadata; OCP Optics Telemetry compliance	Pass-only summary; no per-unit data
3	FAE response model	15	Direct engineering contact under 4 hr; on-site for first 50 units	Sales-routed only; no engineering escalation
4	FFA / field-return data	12	Quarterly failure-mode breakdown shared; root-cause analysis available	"Confidential" or simply absent
5	Lead time consistency	10	Quoted ≤ 8 weeks, hit consistently across last 12 months	24+ weeks or unreliable schedule
6	Firmware management	10	Documented release notes; release-window scheduling; backward-compatible by default	Surprise pushes; no release notes; field instability after updates
7	RMA discipline	10	NFF rate < 20%; root-cause shared; freight handled both directions	NFF rate > 50%; freight on operator both ways
8	Price & total cost of ownership	8	Within 20% of compatible-vendor floor with documented value	Significantly higher with no documented engineering differentiation

How to use the scorecard

Score each vendor 1–5 on each criterion. Bring engineering to the scoring exercise — procurement alone tends to over-weight criterion 8.

Multiply by weight, sum. Maximum theoretical score is 500.

Above 350: tier-1 candidate. Real engineering differentiation, defensible procurement decision.

250–350: tier-2. Worth shortlisting; weaker on 1–2 criteria worth probing.

Below 250: walk away. Weak across multiple criteria; the savings on price will be eaten by failures elsewhere.

Re-score annually. Vendor performance is not static — firmware management, RMA discipline, and FFA culture all drift.

The principle

Price and lead time are easy to measure. Everything else is what determines whether a vendor decision was right.

Most procurement processes weight criterion 8 (price) at 50–70%. The data says criteria 1–4 — qualification, test rigor, FAE support, FFA — predict fleet AFR and total cost of ownership far more than price does. The eight-criterion scorecard reweights the procurement conversation toward what actually matters in production. It's worth the hour to run.

Vitex Validation Handbook · Card 11/15vitextech.com

19.1 · The re-validation triggers#

A previously-validated link should be re-validated when any of the following occurs:

Firmware update on the module. The module's behavior under your test conditions is specific to its firmware. A new firmware revision may behave differently. Re-validate at minimum a representative sample (10–20 units) before deploying the firmware to the fleet.

Firmware update on the switch (NOS upgrade). The host's CMIS parser, FEC negotiation logic, and auto-negotiation behavior may change. Validate one rack of links after a NOS upgrade before fleet-wide rollout.

Adjacent maintenance. Anything that physically disturbs the cabling — rack reconfiguration, cable rerouting, patch panel work — risks inducing endface contamination or polarity changes. Re-validate the disturbed area.

Power or thermal events. Rack power excursion, cooling system fault, building HVAC issue. The recovery from these can leave latent thermal stress on modules. Re-validate after recovery.

Periodic schedule. Annual re-test of a 5–10% sample of fleet, regardless of trigger. The sample distribution should cover oldest modules and modules from vendors with elevated AFR.

RMA or FFA finding from elsewhere in the fleet. If FFA on returned modules identifies a defect mode that affects a batch, proactively re-validate other modules from that batch — even ones that haven't shown problems yet.

19.2 · The re-validation protocol#

Lighter than full bring-up but heavier than a routine telemetry read. The re-validation procedure:

Read DOM and VDM; compare to validation-record baseline
Run 5 minutes of line-rate traffic; verify clean FEC and interface counters
Compute pre-FEC BER; compare to baseline (significant drift is the signal)
Read laser bias; compare to baseline (rise indicates aging)
Update the validation record with the re-validation timestamp and results

Total time per link: roughly 10 minutes including the 5-minute traffic window. Multiplied across a 5% annual sample of a 5,000-module fleet, that's 250 modules × 10 minutes = ~42 hours per year of engineering time. Easily affordable for the assurance it provides.

19.3 · The retirement decision#

Modules don't last forever. Some retire because they failed; others retire because they're approaching end-of-life and proactive replacement is cheaper than reactive failure response. The decision factors:

Age in service. Most pluggable optics have rated service lifetimes of 5–7 years; some of 10. After the rated lifetime, failure rates rise. Proactive retirement at 80% of rated life avoids the rising failure tail.

Bias-aging trajectory. A module whose laser bias has risen 30% from baseline is at or near its lifetime ceiling for its laser type. Even if the module is still working, it has limited operational life remaining.

Pre-FEC BER trajectory. A module whose pre-FEC BER has risen meaningfully and continues to drift is on a trend that will eventually cross critical thresholds. The question is whether to replace before or after.

Vendor scorecard context. If your fleet AFR data shows a particular vendor or batch has elevated failure rates beyond the trend implied by your individual modules, proactive retirement of the cohort may be cheaper than continued operations.

Workload sensitivity. An aging module on a non-critical management link is not the same as an aging module on an AI training fabric. The cost of a workload disruption shapes how aggressive proactive retirement should be.

19.4 · The lifetime tracking record#

The validation record from Chapter 15, augmented with periodic re-validation entries and ultimately a retirement entry, becomes the lifetime record per module. A representative complete lifetime record:

serial: VTX20260512001234
vendor: Vitex
part_number: VO-8CSR8CM-AA
lifecycle:
  - event: factory_test
    date: "2026-04-15"
    reference: "VTX-FR-202604-N1234.pdf"
  - event: receiving_inspection
    date: "2026-05-09"
    pass: true
  - event: first_article_qualification_member
    date: "2026-05-10"
    qualification_lot: "Q2-2026-Vitex-800G-DR8"
    pass: true
  - event: bring_up_validation
    date: "2026-05-12"
    deployment: "rack42-leaf01:Eth1"
    burn_in_hours: 72
    pass: true
  - event: deployed
    date: "2026-05-15"
  - event: re_validation
    date: "2026-11-12"
    trigger: "firmware_update_v1.3.0"
    findings: "ber_baseline_unchanged"
  - event: re_validation
    date: "2027-05-10"
    trigger: "annual_periodic"
    findings: "bias_+8%_from_baseline_normal"
  - event: re_validation
    date: "2028-05-09"
    trigger: "annual_periodic"
    findings: "bias_+18%_flag_for_retirement_planning"
  - event: proactive_retirement
    date: "2028-09-15"
    reason: "bias_aging_trajectory"
    final_state_snapshot: "snapshot_20280915.yaml"
    disposition: "vendor_FFA_program"

This record persists for the module's lifetime. When the module is finally returned to the vendor for FFA, the record is the evidence package that accompanies it. When the same vendor comes up for procurement renewal, the records of their fielded modules are part of the evaluation.

19.5 · RMA and FFA — the distinction#

RMA — Return Merchandise Authorization. The administrative process for returning a failed unit to the vendor. Vendor issues a return authorization, you ship the unit back with the documentation, vendor sends a replacement (or credit) and tests the returned unit. Most vendor support agreements include RMA SLAs measured in days for replacement turnaround.

FFA — Field Failure Analysis. The technical process the vendor uses to determine why the returned unit failed. The deliverable is a written report — the FFA report — documenting root cause. Some vendors do this routinely as part of the RMA flow; others do it only on request; others charge for it.

An operator who only does RMAs is replacing modules without learning. An operator who pursues FFA on a sample of returns is building knowledge: which failure modes are common, which are batch-correlated, which expose vendor process issues. Specify FFA as a contractual deliverable on at least 5–10% of RMAs. Some vendors include this; others charge a few hundred dollars per FFA report; few refuse it outright.

19.6 · The RMA evidence package#

An RMA backed by evidence is processed faster, taken more seriously, and produces better FFA reports than an RMA backed by "this one's broken." The evidence package per returned module:

The validation record from bring-up — original factory test report reference, acceptance baseline, validation results
The lifetime telemetry record — DOM trends, FEC trends, link state history
The triage findings — what was investigated, what was ruled out, what evidence pointed to module fault
The deployment context — switch, port, far-end module, recent maintenance, ambient conditions
The failure description — what happened, when, what the symptoms were
(If available) the final DOM and VDM snapshot taken before extraction

The evidence package gives the vendor what they need to do real FFA: a complete history of the module from manufacture to failure. Without it, the vendor's FFA capacity is bench testing — the module on a tester, looking at it cold. With it, the FFA finding feeds back into the lifetime record above and into the procurement decision next time around.

← Previous chapterCh 18 · Triage and live module swap Next chapter →Ch 20 · Quick reference — pass/fail thresholds at a glance

Want this as a printable field pack?

About 15 pages: the procedures and threshold tables from this handbook in a print-friendly format your bring-up team can mark up at the rack. We email it as a PDF.

Get the field pack →

Part Five

Reference

The reference you'll come back to during commissioning, triage, and procurement. Four chapters of consolidated tables, standards citations, definitions, and sources.

Chapter Twenty

Quick reference — pass/fail thresholds at a glance#

The threshold values referenced throughout the handbook, consolidated in one place. Indicative for an 800G OSFP IHS module under typical AI-fabric conditions; specific vendor & SKU thresholds may vary. Always baseline at install and alert on excursion from baseline rather than against absolute thresholds.

20.1 · DOM thresholds (per lane)#

Table 16 · DOM pass/fail thresholds — consolidated reference

Parameter	Healthy	Warning	Alarm	Action on alarm
Module temperature	10 — 65 °C	65 — 75 °C	> 75 °C or < 0 °C	Investigate airflow, thermal interface, rack hot-spot
TX power per lane	−5 to +3 dBm	−8 to −5 dBm	< −8 or > +5 dBm	Laser failure (low) or calibration drift (high)
RX power per lane	−8 to +2 dBm	−11 to −8 dBm	< −11 dBm	Channel first — inspect, clean, re-mate before module swap
Laser bias per lane	25 — 80 mA	15 — 25 / 80 — 100 mA	< 15 or > 100 mA	Aging — leading indicator; plan retirement
Module voltage (3.3V rail)	3.135 — 3.465 V	3.10 — 3.135 / 3.465 — 3.50 V	Outside 3.10 — 3.50 V	Host power-rail or PSU issue, not module

20.2 · FEC & counter thresholds#

Table 17 · FEC & error-counter thresholds

Counter	Healthy (per lane, KP4 FEC)	Investigate threshold	Alarm threshold
Pre-FEC BER	< 1×10⁻⁶	1×10⁻⁶ to 1×10⁻⁵	> 1×10⁻⁴ (FEC margin gone)
Post-FEC BER	0 (target)	Any non-zero — investigate	> 1×10⁻¹² sustained
Uncorrected codewords	0 (target)	> 0 in 24h	Any sustained accumulation
CRC / FCS errors	0 / hour	> 0 / hour	> 100 / hour sustained
Link flap rate	< 1 / quarter	> 1 / month	> 1 / week

20.3 · Sample-testing pass/fail#

Table 18 · First-article qualification pass criteria (n = 30)

Test	Pass criteria across the 30-unit sample
Identification	30/30 units match advertised vendor, P/N, FW; no mis-shipped SKUs
DOM at room temp	30/30 inside healthy band; CV across sample < 0.10 on TX power
Switch-resident BER	30/30 lanes < 1×10⁻⁶ pre-FEC; zero post-FEC errors after 5 min
Thermal corner (70°C case)	30/30 maintain DPSM = DPActivated; pre-FEC BER < 1×10⁻⁵ at corner
Mating cycle (50×)	Insertion loss creep < 0.3 dB on 30/30 connectors
72-hour burn-in	30/30 maintain link; no FEC counter excursions; DOM drift < 0.5 dB TX
Decision	0 failures across all phases → pass at 9.5% upper-bound on defect rate (95% CI)

20.4 · Fabric-readiness gates#

Table 19 · Fabric-level acceptance thresholds

Metric	Healthy fabric	Action threshold
ECMP hash CV	< 0.10	> 0.20 — re-tune hash function
RDMA BW (vs theoretical)	> 95% sustained	< 90% — investigate path
PFC pause rate	Bursts only, < 0.1% of time	Sustained pauses or storms
ECN marking rate	Reactive to load; not constant	Constant high marking with no DCQCN response

← Previous chapterCh 19 · Re-validation, retirement, and the RMA/FFA evidence package Next chapter →Ch 21 · Standards and references

Chapter Twenty-One

Standards and references#

The handbook draws on a small set of standards and operator-published data. The list below is intentionally short — these are the documents an operator references in practice, not the complete bibliography of optical networking.

21.1 · The standards that matter operationally#

IEEE 802.3 (-2022 and amendments). The Ethernet physical-layer specification — defines what 800GBASE-DR8, -SR8, -VR8, and the 1.6T variants actually are. When a vendor says "compliant with IEEE 802.3," this is the document referenced. Most operators don't read it directly; you confirm vendor compliance against it indirectly via test reports and qualification documents.

OIF CMIS 5.x. The Common Management Interface Specification — the data model, register map, and Datapath State Machine for pluggable modules. This is the document you reference when you need to know what's actually on Page 11h (per-lane DOM), Page 13h (PRBS controls), or Page 14h (VDM diagnostics). Available from the OIF website. Operators of any depth need the table of contents; few read it cover to cover.

OCP Optics Telemetry Specification. An operator-friendly specification for what factory test reports should contain — per-unit data, calibration coefficients, real timestamps. When you ask a vendor for "OCP-compliant test reports," this is the document. Hyperscalers drove its creation; Tier 2/3 operators benefit from the same standard now.

IEC 61300-3-35. The optical endface inspection standard. Pass/fail criteria for endface contamination by zone (core, cladding, contact, adhesive). Built into every fiber inspection scope. The right-answer reference for "is this endface clean enough."

NVIDIA LinkX, Cisco TMG, Juniper THCD, Arista TOI, SONiC HwSKU. Platform vendor compatibility lists. These tell you what optics are formally qualified on what switches. Always check the most recent version — these change with every major firmware release.

21.2 · Operator-published reliability data we reference#

Alibaba HPN (SIGCOMM 2024). Source of the 0.68% per-link AFR figure on access links — the most-cited operator-published data point on optical reliability at scale.

Microsoft Azure (2023 paper on optical/copper failure ratios). Source of the "100x more failures on optical vs copper" figure. Establishes the optical-link-as-dominant-failure-mode framing.

Meta (LLaMA 3 paper, Table 5). 35 of 419 documented training-job interruptions attributed to network/cable. Concrete operator-side AFR data point at hyperscale.

NVIDIA Spectrum-X / Quantum-X reference architectures. The deployment patterns most Tier-2 AI clusters and SuperPOD-reference deliveries are built against. The customer-facing acceptance benchmark for most sysint and neocloud projects.

21.3 · Glossary and acronyms#

See Chapter 22 for the operationally relevant glossary.

← Previous chapterCh 20 · Quick reference — pass/fail thresholds at a glance Next chapter →Ch 22 · Glossary

Chapter Twenty-Two

Glossary#

Terms used throughout the handbook. Definitions are operational rather than rigorous — where the rigorous version differs, that's noted.

ACC (Active Copper Cable) — Copper twinax cable with redrivers (analog equalization) at each end. Between DAC and AEC in cost and performance.

AEC (Active Electrical Cable) — Copper twinax with retimer ICs at each end. Recovers and re-transmits clean signal; firmware-managed.

AFR (Annualized Failure Rate) — Fraction of fielded units expected to fail per year. The standard fleet reliability metric.

AOC (Active Optical Cable) — Sealed assembly with optical engines at both ends connected by internal fiber. Endface not accessible.

ApSel (Application Selection) — CMIS table advertising the configurations a module supports. Host selects from this menu.

AQL (Acceptable Quality Limit) — Z1.4 parameter setting the percentage of defective units tolerable in an accepted lot.

BER (Bit Error Rate) — Probability of a single bit being received in error. Reported as a power of ten (1e-12 = 10⁻¹²).

BERT (Bit Error Rate Tester) — Instrument that drives test patterns and measures error rates. Switch-resident BERT (CMIS Page 13h) covers most uses; instrumented BERT covers more.

CDB (Command Data Block) — Command-and-control surface within CMIS. Used for firmware updates and vendor extensions.

CMIS (Common Management Interface Specification) — Modern module data model. OIF-governed.

CPO (Co-Packaged Optics) — Optical engines integrated into the switch ASIC package. Lasers external (ELSFP).

CRC / FCS — Frame integrity check fields. Errors indicate corrupted frames at L2.

CV (Coefficient of Variation) — Standard deviation divided by mean. Used here as the ECMP distribution health metric.

DAC (Direct Attach Copper) — Passive copper twinax with no active components. Lowest cost, shortest reach.

DOM (Digital Optical Monitoring) — In-band telemetry from a module: TX/RX power, temperature, voltage, laser bias.

DPSM (Datapath State Machine) — CMIS state machine governing the module's transition from power-on through operational.

DSP (Digital Signal Processor) — Active electronics in pluggable module that retimes and processes the signal. Contrast with LPO.

ECMP (Equal-Cost Multi-Path) — Routing technique distributing flows across multiple equal-cost paths via hash function.

ECN (Explicit Congestion Notification) — IP-layer mechanism marking packets to signal incipient congestion without dropping.

EEPROM — Module's identification and calibration data. Read by host at insertion; format defined by SFF-8472, SFF-8636, or CMIS depending on module.

FEC (Forward Error Correction) — Coding that adds redundancy to detect and correct bit errors. KP4, KR4 are common modes.

FFA (Failure Field Analysis) — Vendor's technical investigation of why a returned unit failed. Distinct from RMA.

FR4 — 2km single-mode variants. Four wavelength multiplexed lanes.

IHS (Integrated Heat Sink) — OSFP variant with heat sink mounted to the module body. Finned top.

InfiniBand — Credit-based lossless interconnect protocol. NVIDIA Quantum platform native.

KP4 FEC — Keying-FEC variant for PAM4 channels. The standard FEC for 400G/800G PAM4.

LinkX — NVIDIA's program of validated transceiver/cable configurations on Quantum/Spectrum platforms.

LPO (Linear Pluggable Optics) — Module without DSP retiming; host SerDes does retiming. Lower power; tighter host-pair certification.

LR4 — 10km single-mode variants. Four wavelength multiplexed lanes.

MPO (Multi-fiber Push-On) — Connector type for parallel fiber. 12-fiber and 16-fiber variants common in AI fabrics.

NOS (Network Operating System) — Switch software. SONiC, NVOS, Cumulus, Arista EOS, Cisco NX-OS, Juniper Junos.

NRZ — Non-Return-to-Zero. Binary modulation; one bit per symbol. Used at 25G/lane and below.

OCP (Open Compute Project) — Open hardware specifications consortium. Optics Telemetry Spec Rev 0.9 is its 2025 publication.

OMA (Outer Modulation Amplitude) — PAM4 transmitter parameter. Distance between extremes of the symbol set.

OSFP — 800G form factor. NVIDIA's choice for Quantum-X and Spectrum-X.

OSFP-XD — Hyperscaler-favored 1.6T form factor. 16 lanes; not interchangeable with OSFP.

OTDR (Optical Time-Domain Reflectometer) — Instrument that characterizes fiber by analyzing reflected pulses.

PAM4 — Pulse Amplitude Modulation, 4 levels. Two bits per symbol. Used at 50G/lane and above.

PFC (Priority-based Flow Control) — Per-priority pause-frame mechanism enabling lossless Ethernet behavior.

PRBS (Pseudo-Random Bit Sequence) — Test pattern for BER measurement. PRBS31Q is the canonical PAM4 test pattern.

QSFP-DD — 400G form factor. 8 lanes at 50G or 4 at 100G.

QSFP-DD800 — 800G in QSFP-DD mechanical. Cage-compatible with QSFP-DD 400G.

RDMA (Remote Direct Memory Access) — Direct memory transfer across network without CPU involvement. Native to InfiniBand, available on Ethernet via RoCE.

RHS (Riding Heat Sink) — OSFP variant with cage-mounted heat sink. Module has flat top.

RoCE (RDMA over Converged Ethernet) — RDMA on Ethernet. RoCEv2 is the IP-routable version.

SerDes (Serializer/Deserializer) — Electrical interface between switch ASIC and module. CEI-112G underlies modern 100G/lane optics.

SSPRQ — Short Stress Pattern Random Quaternary. Official TDECQ test pattern.

TDECQ (Transmitter and Dispersion Eye Closure Quaternary) — PAM4 transmitter quality metric. Measured with TDECQ-capable oscilloscope.

VDM (Versatile Diagnostic Monitoring) — CMIS pages with rich per-lane diagnostic data: BER stats, SNR, equalizer settings.

WCMP (Weighted Cost Multipath) — ECMP variant supporting unequal-cost paths via weighted hashing.

Z1.4 — ANSI/ASQ Z1.4. Attribute sampling standard. Defines lot-size-to-sample-size mapping at given AQL.

← Previous chapterCh 21 · Standards and references Next chapter →Ch 23 · Sources and references

Chapter Twenty-Three

Sources and references#

A short list of the data and documents cited throughout this handbook. Operator-published reliability data is the most useful body of evidence on optical-link AFR; vendor MTBF claims and academic papers fill in around it.

23.1 · Operator-published reliability data#

Tian, Z. et al. Alibaba HPN: A Data Center Network for Large Language Model Training. SIGCOMM 2024. (Source of the 0.68% per-access-link AFR figure.)

Roy, A. et al. Inside the Social Network's (Datacenter) Network. ACM SIGCOMM 2015. Microsoft optical reliability follow-on papers, 2022–2023. (Source of the optical-vs-copper AFR ratio framing.)

Meta AI. The Llama 3 Herd of Models. 2024. (Table 5 — training interruption attribution data, including network and cable.)

23.2 · Standards bodies and platform documentation#

IEEE 802.3 (multiple amendments through 2024) — Ethernet physical layer specifications. OIF CMIS 5.x — Common Management Interface Specification. OCP Optics Telemetry Specification, current revision. IEC 61300-3-35 — fiber endface inspection. NVIDIA LinkX (current revision), Cisco Transceiver Module Group, Juniper Hardware Compatibility, Arista Transceiver Optical Interoperability — current platform compatibility documents.

23.3 · A note on data sources#

The reliability and failure-mode data in this handbook is aggregated from: vendor failure-field-analysis reports we have visibility into; operator-published papers cited above; conversations with operators across Tier-2 AI clusters, neoclouds, and system integrators over the past 24 months. Specific numerical claims (e.g., "35% of failures are connector-related") should be read as indicative orders of magnitude — operator fleets vary substantially with workload, vendor mix, and deployment vintage. The methodologies in this handbook are the most directly transferable content; the numerical context exists to motivate them.

← Previous chapterCh 22 · GlossaryEndYou've reached the end of the handbook

Put this handbook to work

Two ways to take it with you: the ~15-page printable field pack with the procedures and threshold tables from these chapters — or send us the platform you're terminating against (switch and NIC both) and the Vitex engineering team will help you spec the validation program and quote against current U.S. stock.

Get the field pack →

About Vitex

A US-based optical interconnect supplier for AI data centers, colocation operators, and enterprise builds.

Vitex LLC, headquartered in Hackensack, New Jersey, supplies the full range of interconnects covered in this handbook: 100G through 1.6T optical transceivers, AOC, AEC, ACC, and DAC cables, and the engineering support to specify, validate, and deploy them in modern AI fabric architectures.

What we focus on:

4–7 week lead times on most SKUs, against an industry typical of 24+ weeks. Strategic inventory and direct manufacturing relationships keep schedules predictable.
US-based field applications engineering. Your engineers can talk to ours directly. We can support sample-testing protocol design, bring-up troubleshooting, and triage discussions at no additional cost.
The full AI-fabric portfolio — 100G, 400G, 800G, and 1.6T transceivers across OSFP and QSFP-DD form factors, plus AOC, AEC, ACC, and DAC cables. One supplier across the cable plant and the optics.
Compatibility focus on the platforms operators are deploying in 2026 — NVIDIA Quantum-X, Spectrum-X, Arista 7060X/7800R, Cisco Nexus, and white-box SONiC.

If you're evaluating optical suppliers, the criteria in this handbook are the right ones to apply — to us as much as to anyone else. Samples, factory test reports, and reference introductions are available on request.

Reach us

Email: marketing@vitextech.com
Web: vitextech.com
Headquarters: Hackensack, New Jersey, USA
Engineering & product: based in the US and partner facilities

Vitex has supplied optical transceivers, DAC/AOC/AEC/ACC cables, and fiber for over 23 years from Hackensack, New Jersey. This handbook is part of the Vitex resources library for AI data center operators. © 2026 Vitex LLC. Procedures and thresholds reflect operator practice as of mid-2026; verify current details with vendors and standards bodies before procurement or deployment decisions.

Contents

Tags: 800G-OSFP ai-data-center-optics DOM-telemetry fiber-optic-testing optical-interconnect-handbook optical-transceiver-validation transceiver-bring-up

Optical Transceivers

Active Optical Cables (AOCs)

DACs/AECs/ACCs

Video Over Fiber

Optical Components - TOSA & ROSA

Fiber & Hybrid Cables

Why this handbook exists#

What you're testing#

1.1 · The deployment landscape in 2026#

The switches

The GPU systems and reference designs

The NICs and host SerDes

Putting the layers together

1.2 · Ethernet vs InfiniBand — and what the choice means for testing#

1.3 · Interconnect types at the depth needed to test them#

1.4 · Form factors and what they imply for testing#

1.5 · Testing by deployment scenario#

A few specific notes for each scenario

Reading the module#

2.1 · The CMIS data model#

2.2 · The Datapath State Machine#

2.3 · VDM — the diagnostic gold mine#

2.4 · Firmware as a validation event#

Take the bench-side version to the data hall

Why operators run their own testing#

3.1 · When sample testing pays for itself#

3.2 · In-house, contract lab, or trust-but-verify#

Sample size — the math, and why "five units" doesn't pass#

4.1 · The standard everyone should know#

4.2 · The brutal arithmetic at n=5#

4.3 · What hyperscalers actually do#

4.4 · The realistic Tier 2/3 sampling pattern#

4.5 · The decision: how many samples for your program#

Test equipment — what you need, what you can skip#

5.1 · Equipment by test purpose#

5.2 · What you can do with just a switch and hosts#

5.3 · The realistic Tier 2/3 lab#

5.4 · When to use a contract lab#

Reading factory test reports critically#

6.1 · What a serious factory test report contains#

6.2 · The OCP Optics Telemetry Specification as a procurement floor#

6.3 · The red flags#

6.4 · The 60-second scan#

6.5 · Challenging a suspicious report#

Running the tests#

7.1 · The first-article qualification protocol#

Phase 1: Receiving and identification (Day 1, ~2 hours)

Phase 2: Initial DOM and link characterization (Day 1, ~3 hours)

Phase 3: Thermal corner testing (Day 2, ~6 hours)

Phase 4: Burn-in (Days 3–5, 72 hours, automated)

Phase 5: Mating cycle and recovery (Day 6, ~3 hours)

Phase 6: Statistical analysis and decision (Day 7, ~4 hours)

7.2 · The follow-on lot inspection protocol#

7.3 · Switch-resident PRBS — the protocol details#

7.4 · Thermal corner testing without an environmental chamber#

The fan-modulation method

The HVAC method

When the chamber is necessary

7.5 · Mating cycle testing#

Want this as a printable field pack?

Receiving inspection and physical hygiene#

8.1 · The ten-minute receiving check#

8.2 · Endface inspection, cleaning, and the first-mate rule#

8.3 · Polarity, APC vs UPC, and IHS vs RHS#

Cabling and fiber plant validation#

9.1 · Why validate the plant separately#

9.2 · OTDR traces — what they tell you#

9.3 · Insertion loss budget — the math#

9.4 · The plant acceptance protocol#

9.5 · Common findings and what to do#

Power, thermal, and infrastructure validation#

10.1 · The PDU headroom check#

10.2 · Thermal validation under load#

10.3 · Airflow shadowing and cable density at 800G#

10.4 · Ambient conditions#

10.5 · The infrastructure pre-flight checklist#

The five-check bring-up procedure#

11.1 · Why five checks, in this order#

11.2 · Check 1 — Presence and identity#

11.3 · Check 2 — Link state and negotiated rate#