From 04e7e854d9850c1bc5e36064a7166bed02ac15f0 Mon Sep 17 00:00:00 2001 From: Network Agent Date: Fri, 5 Jun 2026 10:51:07 +0000 Subject: [PATCH] initial commit --- designdoc.md | 728 ++++++++++++++++++++++++++++++++++++++++++++ failure-analysis.md | 642 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 1370 insertions(+) create mode 100644 designdoc.md create mode 100644 failure-analysis.md diff --git a/designdoc.md b/designdoc.md new file mode 100644 index 0000000..000c5a5 --- /dev/null +++ b/designdoc.md @@ -0,0 +1,728 @@ +# Network Design Guide + +This document describes the architectural design principles for a VyOS L3VPN network. It provides enough knowledge for an agent to take high-level network lifecycle instructions and decompose them into a concrete network design expressed as Kubernetes Custom Resources (CRDs). + +--- + +## Table of Contents + +1. [Design Process Overview](#design-process-overview) +2. [L3VPN Layered Architecture](#l3vpn-layered-architecture) +3. [Router Types and Roles](#router-types-and-roles) +4. [VPN Topology Patterns](#vpn-topology-patterns) +5. [IP Addressing Strategy](#ip-addressing-strategy) +6. [VRF and Route Target Design](#vrf-and-route-target-design) +7. [Assembling a Network Design](#assembling-a-network-design) +8. [CRD Reference and Dependency Chain](#crd-reference-and-dependency-chain) +9. [Design Rules and Constraints](#design-rules-and-constraints) +10. [Example: Hub-and-Spoke L3VPN](#example-hub-and-spoke-l3vpn) + +--- + +## Design Process Overview + +An agent receiving a high-level natural language request (e.g., "connect three UK branch offices to a central hub over a private VPN") must follow this decomposition process: + +``` +High-Level Intent + │ + ▼ +1. Identify sites, roles, and connectivity requirements + │ + ▼ +2. Choose VPN topology (hub-and-spoke, any-to-any) + │ + ▼ +3. Assign router roles (P, PE, RR, CE) and count + │ + ▼ +4. Design physical topology and IP address plan + │ + ▼ +5. Define underlay protocols (OSPF areas, BGP AS, MPLS) + │ + ▼ +6. Define VRF, Route Distinguisher and Route Target plan + │ + ▼ +7. Emit CRD YAML: VyOSInfrastructure (including Routers, Networks, and Devices) → VyOSUnderlay → VyOSL3VPN +``` + +At each step, consult the current VyOS network deployed in k8s to check which routers and networks already exist before adding new resources. Never duplicate a router or network name that already exists in the live topology. + +--- + +## L3VPN Layered Architecture + +The network is built in three clearly separated layers. Each layer has its own CRD and its own lifecycle. + +``` +┌──────────────────────────────────────────────┐ +│ Customer / Overlay Layer (VRFs, BGP VPNv4) │ ← VyOSL3VPN CRD +│ Route Distinguishers, Route Targets, VRFs │ +├──────────────────────────────────────────────┤ +│ Underlay / Core Layer (OSPF + MPLS + iBGP) │ ← VyOSUnderlay CRD +│ P routers, RR routers, LSPs via LDP │ +├──────────────────────────────────────────────┤ +│ Physical / Infrastructure Layer │ ← VyOSInfrastructure CRD +│ Routers, Links, Devices, IP addresses │ +└──────────────────────────────────────────────┘ +``` + +### Layer 1 – Physical Infrastructure + +Defines all routers, devices, and the point-to-point or multi-access networks connecting them. This is the only layer that knows about hardware (or virtual hardware) — container images, port assignments, IP addresses on links, MTU, VLAN IDs, and physical location. + +**Key concepts:** +- Every router gets a unique loopback IP (e.g., `10.0.0.x/32`) used as its stable router ID. +- Point-to-point (P2P) links use `/24` subnets in the `172.16.x.0/24` range (only `.1` and `.2` are used). +- The management network (`192.168.122.0/24`) connects `eth0` of every router and `eth1` of every device for out-of-band access. +- Each P2P link is assigned a unique VLAN ID (301+) to isolate traffic on Linux bridges. +- **Devices** are simulated end-hosts (traffic agents) that attach to customer LANs. +- **QoS and Security** policies are defined at this layer and bound to router interfaces. + +### Layer 2 – Underlay Routing + +Configures OSPF (for link-state flooding and loopback reachability), LDP/MPLS (for label-switched paths between PEs), and iBGP (for distributing VPNv4 routes between PEs via Route Reflectors). + +**Key concepts:** +- OSPF backbone area `0.0.0.0` is used for all P and PE routers. All loopbacks must be reachable via OSPF before MPLS can signal LSPs. +- LDP uses the loopback interface as its router ID and is enabled on all P-to-P and P-to-PE links. It is **not** enabled on PE-to-CE links. +- iBGP runs in a single AS (e.g., `65001`). All PEs peer only with Route Reflectors — never directly with each other. +- Route Reflectors do not participate in MPLS data-plane forwarding; they only reflect VPNv4 routes. + +### Layer 3 – L3VPN Overlay + +Configures VRFs on PE routers, assigns customer-facing interfaces to VRFs, and sets up eBGP sessions between each PE and its attached CE router. MP-BGP carries VPNv4 prefixes (customer routes + RD/RT labels) between PEs via the Route Reflectors. + +**Key concepts:** +- Each customer site is represented by a VRF on its PE router. +- Route Distinguishers (RD) make customer routes globally unique across PEs. +- Route Targets (RT) control which VRFs import which routes, implementing the hub-and-spoke or any-to-any policy. +- CE routers run eBGP in a customer AS (e.g., `65035`) and peer with the PE's VRF interface IP. + +--- + +## Router Types and Roles + +### P – Provider Core Router + +**Role:** MPLS label forwarding only. Carries customer traffic inside LSPs without inspecting the IP payload. + +**Protocols required:** +- OSPF (backbone area, all P-to-P interfaces) +- MPLS/LDP (all P-to-P interfaces) + +**Protocols NOT required:** +- BGP (P routers do not run BGP) +- VRFs + +**Interface pattern:** +- `eth0` → management network +- `eth1`, `eth2`, … → P-to-P links to other P routers or PE/RR routers +- `lo` → loopback network (router ID) + +**When to add more P routers:** Add P routers to increase core capacity or redundancy. A minimal core needs at least one P router; for resilience design two parallel P-router paths between any PE pair. + +--- + +### PE – Provider Edge Router + +**Role:** L3VPN service delivery. Maintains VRFs for customer traffic, runs VPNv4 MP-BGP with RRs, and eBGP with attached CE routers. + +**Protocols required:** +- OSPF (backbone area, P-facing interfaces only) +- MPLS/LDP (P-facing interfaces only — **not** CE-facing interfaces) +- iBGP (peers with Route Reflectors only, using loopback addresses) +- eBGP per VRF (peers with CE router) + +**Protocols NOT required:** +- LDP on CE-facing interfaces + +**Interface pattern:** +- `eth0` → management network +- `eth1` (or `eth1`/`eth2`) → P-to-PE link(s) into the core +- `eth2` (or next available) → PE-to-CE link (customer-facing, inside a VRF) +- `lo` → loopback network (router ID) + +**Rule:** A PE must connect to **at least one** P router in the core. For redundancy, connect to two different P routers on different paths through the core. + +--- + +### RR – Route Reflector + +**Role:** iBGP route reflection. Receives VPNv4 routes from PE clients and re-advertises them to all other PE clients, eliminating the need for a full iBGP mesh. + +**Protocols required:** +- OSPF (backbone area, links to P routers) +- MPLS/LDP (links to P routers — needed for loopback reachability via the MPLS core) +- iBGP with `route_reflector: true`, listing all PE routers as `route_reflector_client: true` + +**Protocols NOT required:** +- VRFs +- eBGP + +**Interface pattern:** +- `eth0` → management network +- `eth1`, `eth2` → links to P routers (for OSPF/LDP reachability) +- `lo` → loopback network (router ID) + +**Sizing rule:** Deploy at least **two RRs** for redundancy. Every PE must peer with **all** RRs. RRs do not need to peer with each other; they are peers of the PE clients only. + +--- + +### CE – Customer Edge Router + +**Role:** Customer premises router. Connects the customer LAN to the provider PE over an eBGP session. The CE is aware only of the customer's own AS and the routes the VPN policy allows. + +**Protocols required:** +- eBGP (peers with PE's VRF interface IP) + +**Protocols NOT required:** +- OSPF +- MPLS +- VRFs + +**Interface pattern:** +- `eth0` → management network +- `eth1` → CE-to-PE link (provider-facing, eBGP peer) +- `eth2` → LAN / customer site network +- `lo` → loopback network + +**Rule:** CE router's BGP AS must be **different** from the provider iBGP AS. A common pattern is `65001` for provider iBGP and `65035` for all CE routers. + +--- + +## VPN Topology Patterns + +### Hub-and-Spoke + +The most common telco/enterprise pattern. One site (hub) can communicate with all spoke sites; spoke sites communicate **only** through the hub — they cannot reach each other directly. + +**Use when:** Central services, internet breakout, security inspection, or regulatory compliance require all traffic to pass through a central point. + +**VRF policy:** +| Role | VRF Name | RT Export | RT Import | +|------|----------|-----------|-----------| +| Hub PE | `BLUE_HUB` | `65035:1030` | `65035:1011`, `65035:1030` | +| Spoke PE | `BLUE_SPOKE` | `65035:1011` | `65035:1030` | + +- The hub imports its **own** export target (self-import of `65035:1030`) to allow spoke-to-spoke traffic to hairpin through the hub. +- Spoke PEs import only the hub's routes (`65035:1030`) — they cannot see each other's prefixes directly. + +### Any-to-Any (Full Mesh) + +All sites can communicate with all other sites directly. + +**Use when:** Latency-sensitive peer-to-peer applications (e.g., distributed databases, real-time collaboration). + +**VRF policy:** All PE VRFs share a single RT — all export and all import the same community. Example: `RT export: 65035:1000`, `RT import: 65035:1000`. + +### Choosing a Topology + +| Requirement | Recommended Topology | +|-------------|---------------------| +| Internet/security hub | Hub-and-Spoke | +| Centralised application server | Hub-and-Spoke | +| Branch-to-branch collaboration | Any-to-Any | +| Mixed: hub services + branch direct | Hub-and-Spoke with selected spoke import overrides | + +--- + +## IP Addressing Strategy + +Use a disciplined allocation plan. The reference lab uses these ranges — maintain the same scheme for consistency: + +| Purpose | Range | Example | +|---------|-------|---------| +| Loopbacks (Router IDs) | `10.0.0.0/24` | `10.0.0.1` – `10.0.0.254` | +| P-to-P core links | `172.16.x.0/24` | `172.16.30.0/24` – `172.16.140.0/24` | +| PE-to-CE links | `10.50–90.x.0/24` | `10.50.50.0/24` (PE1↔CE1) | +| Customer LANs | `10.100.x.0/24` | `10.100.1.0/24` (spoke1 LAN) | +| Management | `192.168.122.0/24` | `192.168.122.11` – `.254` | + +**Loopback allocation rules:** +- RR routers: start at `10.0.0.1` (e.g., rr1=`.1`, rr2=`.2`) +- P routers: start at `10.0.0.3` (e.g., p1=`.3`, p2=`.4`, p3=`.5`, p4=`.6`) +- PE routers: start at `10.0.0.7` (e.g., pe1=`.7`, pe2=`.8`, pe3=`.10`) +- CE routers: start at `10.0.0.80` (e.g., ce1-spoke=`.80`, ce2-spoke=`.90`, ce1-hub=`.100`) + +**VLAN allocation rules:** +- Assign one unique VLAN per P2P network segment starting at VLAN 301. +- PE-to-CE links start at VLAN 401. +- Management network has no VLAN tag. +- Loopback network has no VLAN tag. + +**BGP AS numbers:** +- Provider iBGP: `65001` +- Customer eBGP (all CEs): `65035` + +--- + +## VRF and Route Target Design + +### Route Distinguisher (RD) + +Every VRF on every PE must have a **globally unique** RD. Use the convention: + +``` +: +``` + +Example: PE1 (`10.0.0.7`) hosting BLUE_SPOKE with service ID `1011`: +``` +rd: "10.50.50.1:1011" # Use the PE-to-CE interface IP, not loopback +``` + +The PE-to-CE link IP is used as the RD IP component (not the loopback) to ensure uniqueness when a PE hosts multiple VRFs for the same customer. + +### Route Target (RT) + +Route Targets implement the traffic policy between sites. Use the convention: + +``` +: +``` + +| Policy | RT Value | Purpose | +|--------|----------|---------| +| Spoke export | `65035:1011` | Routes exported by spoke sites | +| Hub export | `65035:1030` | Routes exported by hub site | + +**Hub VRF RT configuration:** +``` +rt_export: ["65035:1030"] +rt_import: ["65035:1011", "65035:1030"] +``` + +**Spoke VRF RT configuration:** +``` +rt_export: ["65035:1011"] +rt_import: ["65035:1030"] +``` + +### VRF Table IDs + +Each VRF requires a unique Linux routing table ID. Use: +- Spoke VRFs: `200` +- Hub VRFs: `400` +- Additional VRFs: increment by 100 + +--- + +## Assembling a Network Design + +Follow these steps to translate a customer intent into a complete set of CRDs. + +### Step 1: Identify Sites and Roles + +From the customer description, extract: +- Number of sites +- Which site is the hub (central services, security) +- Which sites are spokes (branches, remote offices) +- Geographic locations (for Spanner metadata) + +### Step 2: Map Sites to Router Roles + +For each site: +- One **CE** router per customer site (speaks eBGP to its PE) +- One **PE** router per customer site attachment point (may serve multiple customers) +- Shared **P** routers in the core (typically 2–4 for resilience) +- Shared **RR** routers (exactly 2 for redundancy) + +**Minimum viable topology:** 1 P + 2 RR + 2 PE + 2 CE (one hub, one spoke) + +**Production resilient topology:** 4 P + 2 RR + N PE + N CE + +### Step 3: Plan Physical Links + +Every router pair that needs connectivity requires a dedicated P2P network entry in `VyOSInfrastructure.spec.networks`. + +Required links: +- P↔P: form the core mesh +- P↔RR: for RR loopback reachability via OSPF/LDP +- P↔PE: attach PE to core +- PE↔CE: customer-facing link (outside MPLS domain) + +Each CE also needs a LAN network (`network_type: multi-access`) for attached Devices. + +### Step 4: Assign Interface Names + +Interface naming is sequential per router: +- `eth0` → always management +- `eth1` → first P2P link (usually into the core for PE/RR; first P neighbor for P) +- `eth2`, `eth3`, … → additional links in the order they appear in `spec.routers[].interfaces` +- `lo` → always loopback + +Interface names in `connected_routers` must exactly match the names in `spec.routers[].interfaces`. + +### Step 5: Write the VyOSInfrastructure + +Declare all networks, routers, and devices. The infrastructure CRD is self-contained — it does not reference underlay protocols. + +### Step 6: Write the VyOSUnderlay + +Reference the infrastructure with `infrastructureRef`. Configure OSPF and LDP on all P, PE, and RR routers. Configure iBGP on RR routers (as reflectors) and PE routers (as RR clients). CE routers are **not** listed in the underlay — they have no OSPF/MPLS. + +### Step 7: Write the VyOSL3VPN + +Reference the underlay with `underlayRef`. For each PE router, declare its VRF(s) with RD, RT import/export, the CE-facing interface, and the eBGP neighbour entry. + +### Step 8: (Optional) Write Standalone Device Resources + +If not defined within the `VyOSInfrastructure` spec, you can create standalone `Device` resources attached to the CE's LAN network. + +--- + +## CRD Dependency Chain + +Resources must be applied and reach `Ready` in this strict order: + +``` +VyOSInfrastructure (kind: VyOSInfrastructure, group: google.dev/v1) + │ generates VyOSRouter, LinuxNetwork, and Device children automatically + ▼ +VyOSUnderlay (kind: VyOSUnderlay, group: google.dev/v1) + │ spec.infrastructureRef → VyOSInfrastructure name + ▼ +VyOSL3VPN (kind: VyOSL3VPN, group: google.dev/v1) + │ spec.underlayRef → VyOSUnderlay name + ▼ +TrafficTest (kind: TrafficTest, group: google.dev/v1) + │ spec.source_devices / spec.destination_device → Device names +``` + +The operator enforces these dependencies automatically. A resource will wait (retrying every 10 seconds) until its parent is `Ready`. + +## Design Rules and Constraints + +These rules must always be satisfied in a valid design. Validate against them before emitting CRDs. + +### Physical / Infrastructure Rules + +1. Every router must have exactly one `eth0` interface connected to the `management` network. +2. Every router must have exactly one `lo` interface connected to the `loopback` network. +3. Every P2P network must have exactly **two** `connected_routers` entries. +4. A `multi-access` LAN network must have exactly **one** connected router (the CE gateway interface). +5. Interface names must match between `spec.networks[].connected_routers[].interface` and the router's `spec.routers[].interfaces[].name`. +6. Each P2P network must have a unique VLAN ID. VLAN IDs must not overlap across any networks in the same infrastructure. +7. No two routers may share the same `router_id` (loopback IP). +8. No two `connected_routers` entries in the same network may share the same IP address. +9. Router names must be globally unique within the infrastructure. + +### Underlay Rules + +10. Only P, PE, and RR routers are listed in `VyOSUnderlay.spec.routers`. CE routers are excluded. +11. OSPF must be enabled on all P, PE, and RR routers with area `0.0.0.0`. +12. LDP `interfaces` for each router must list only interfaces connected to P2P core links (never the management `eth0`, CE-facing interfaces, or `lo`). +13. Every PE router must have at least one BGP neighbor entry pointing to an RR loopback IP with `remote_as` equal to the provider AS. +14. Every RR router must list **all** PE router loopback IPs as `route_reflector_client: true` neighbors. +15. The `route_reflectors` list in `spec.routing.bgp` must contain the loopback IPs of all RR routers. + +### L3VPN / VRF Rules + +16. Only PE routers are listed in `VyOSL3VPN.spec.routers`. +17. Every VRF `rd` must be unique across all VRFs in all PE routers. +18. Hub VRFs must import **both** spoke RT and hub RT (to enable spoke-to-spoke hairpin via the hub). +19. Spoke VRFs must import **only** hub RT (spokes cannot see each other directly). +20. The `interfaces` list in a VRF must reference the PE's CE-facing interface (the one connected to the PE-to-CE network), not any core-facing interface. +21. The BGP neighbor `peer` IP in a VRF must be the CE's IP on the PE-to-CE link. +22. The BGP neighbor `remote_as` in the L3VPN spec must match the CE router's AS number declared in the underlay. + +### Addressing Rules + +23. All loopback IPs (`10.0.0.x`) must be unique across all routers. +24. All P2P link subnets (`172.16.x.0/24`) must not overlap. +25. All PE-to-CE link subnets (`10.x.x.0/24`) must not overlap with each other or with the core. +26. All customer LAN subnets (`10.100.x.0/24`) must not overlap. +27. Management IPs (`192.168.122.x`) must be unique per router. +28. The `ip_address` in `connected_routers` must fall within the network's declared `subnet`. + +--- + +## Example: Hub-and-Spoke L3VPN + +The canonical example is the `l3vpn-hub-spoke` topology in `environment/telco-lab/l3vpn-hub-spoke.yaml`. It provides a full working reference. Key design decisions made in this example: + +### Topology Summary + +``` + [ce1-spoke]──(10.50.50.0/24)──[pe1] + │ + (172.16.90.0/24 via p1) + │ +[ce2-spoke]──(10.60.60.0/24)──[pe3] [p1]──[p2]──[rr1] + │ │ + (p4-pe3) │ │(p1-p3) + [p4]──[p3]──[rr2] + │ + (p3-pe2 / p1-pe2) + │ + [pe2] + │ + (10.80.80.0/24) + │ + [ce1-hub] +``` + +### Router Role Assignments + +| Router | Role | Loopback | Connected To | +|--------|------|----------|--------------| +| p1 | P | 10.0.0.3 | p2, p3, rr1, pe1, pe2 | +| p2 | P | 10.0.0.4 | p1, p4, rr1 | +| p3 | P | 10.0.0.5 | p1, p4, pe2, rr2 | +| p4 | P | 10.0.0.6 | p2, p3, pe3, rr2 | +| rr1 | RR | 10.0.0.1 | p1, p2 | +| rr2 | RR | 10.0.0.2 | p3, p4 | +| pe1 | PE | 10.0.0.7 | p1, ce1-spoke | +| pe2 | PE | 10.0.0.8 | p1, p3, ce1-hub | +| pe3 | PE | 10.0.0.10 | p4, ce2-spoke | +| ce1-spoke | CE | 10.0.0.80 | pe1 | +| ce1-hub | CE | 10.0.0.100 | pe2 | +| ce2-spoke | CE | 10.0.0.90 | pe3 | + +### VRF Design + +| PE | VRF | Topology | RD | RT Export | RT Import | +|----|-----|----------|----|-----------|-----------| +| pe1 | BLUE_SPOKE | spoke | 10.50.50.1:1011 | 65035:1011 | 65035:1030 | +| pe2 | BLUE_HUB | hub | 10.80.80.1:1011 | 65035:1030 | 65035:1011, 65035:1030 | +| pe3 | BLUE_SPOKE | spoke | 10.60.60.1:1011 | 65035:1011 | 65035:1030 | + +### CRD Skeleton + +```yaml +# Layer 1: Physical topology +apiVersion: google.dev/v1 +kind: VyOSInfrastructure +metadata: + name: + namespace: network +spec: + networks: + # Core P2P links (one entry per link pair) + - name: p1-p2 + subnet: "172.16.30.0/24" + vlan: 301 + network_type: "p2p" + bandwidth: "1gbit" + connected_routers: + - router_name: "p1" + interface: "eth1" + ip_address: "172.16.30.1" + - router_name: "p2" + interface: "eth1" + ip_address: "172.16.30.2" + # ... (more P2P links) ... + # Management network + - name: mgmt + subnet: "192.168.122.0/24" + gateway: "192.168.122.1" + network_type: "management" + bandwidth: "unlimited" + connected_routers: + - router_name: "p1" + interface: "eth0" + ip_address: "192.168.122.11" + # ... (all routers) ... + # Loopback network (no connected_routers; loopbacks are auto-assigned) + - name: loopbacks + subnet: "10.0.0.0/24" + network_type: "loopback" + bandwidth: "unlimited" + # PE-to-CE links + - name: pe1-ce1-spoke + subnet: "10.50.50.0/24" + vlan: 401 + network_type: "p2p" + bandwidth: "100mbit" + connected_routers: + - router_name: "pe1" + interface: "eth2" + ip_address: "10.50.50.1" + - router_name: "ce1-spoke" + interface: "eth1" + ip_address: "10.50.50.2" + # CE LAN networks + - name: lan-spoke1 + subnet: "10.100.1.0/24" + gateway: "10.100.1.1" + network_type: "multi-access" + bandwidth: "1gbit" + connected_routers: + - router_name: "ce1-spoke" + interface: "eth2" + ip_address: "10.100.1.1" + routers: + - name: "p1" + hostname: "p1" + router_id: "10.0.0.3" + role: "P" + location: + latitude: 51.5074 + longitude: -0.1278 + city: "London" + country: "United Kingdom" + site: "London-DC1" + interfaces: + - name: "eth0" + network: "mgmt" + - name: "eth1" + network: "p1-p2" + - name: "lo" + network: "loopbacks" + # ... (remaining routers) ... + devices: + - name: "dev1" + network_name: "lan-spoke1" + ip_address: "10.100.1.10" + mgmt_ip: "192.168.122.100" + gateway: "10.100.1.1" + qos: + policies: + - name: "bronze" + bandwidth: "10mbit" + security: + firewall: + policies: + - name: "allow-all" + default_action: "accept" +--- +# Layer 2: Underlay routing protocols +apiVersion: google.dev/v1 +kind: VyOSUnderlay +metadata: + name: + namespace: network +spec: + infrastructureRef: + routing: + ospf: + router_id_source: "loopback" + areas: + - area_id: "0.0.0.0" + type: "backbone" + bgp: + as_number: 65001 + router_id_source: "loopback" + route_reflectors: ["10.0.0.1", "10.0.0.2"] # RR loopback IPs + mpls: + enabled: true + ldp: + router_id_interface: "loopback" + routers: + # P router: OSPF + MPLS only + - name: "p1" + protocols: + ospf: + router_id: "10.0.0.3" + areas: + - area: "0.0.0.0" + type: "backbone" + mpls: + enabled: true + ldp: + router_id: "10.0.0.3" + interfaces: ["eth1", "eth2"] # All core-facing interfaces + # RR router: OSPF + MPLS + BGP (route_reflector: true) + - name: "rr1" + protocols: + ospf: + router_id: "10.0.0.1" + areas: + - area: "0.0.0.0" + type: "backbone" + bgp: + as_number: 65001 + router_id: "10.0.0.1" + route_reflector: true + neighbors: + - peer: "10.0.0.7" # pe1 loopback + remote_as: 65001 + route_reflector_client: true + - peer: "10.0.0.8" # pe2 loopback + remote_as: 65001 + route_reflector_client: true + mpls: + enabled: true + ldp: + router_id: "10.0.0.1" + interfaces: ["eth1", "eth2"] + # PE router: OSPF + MPLS + BGP (peers with RRs) + - name: "pe1" + protocols: + ospf: + router_id: "10.0.0.7" + areas: + - area: "0.0.0.0" + type: "backbone" + bgp: + as_number: 65001 + router_id: "10.0.0.7" + neighbors: + - peer: "10.0.0.1" # rr1 loopback + remote_as: 65001 + - peer: "10.0.0.2" # rr2 loopback + remote_as: 65001 + mpls: + enabled: true + ldp: + router_id: "10.0.0.7" + interfaces: ["eth1"] # Core-facing interfaces only + # CE router: eBGP only (no OSPF, no MPLS) + - name: "ce1-spoke" + protocols: + bgp: + as_number: 65035 + router_id: "10.0.0.80" + neighbors: + - peer: "10.50.50.1" # PE's IP on the PE-to-CE link + remote_as: 65001 +--- +# Layer 3: L3VPN overlay (VRFs and eBGP to CE) +apiVersion: google.dev/v1 +kind: VyOSL3VPN +metadata: + name: + namespace: network +spec: + underlayRef: + services: + - name: "BLUE_SPOKE" + type: "l3vpn" + topology: "spoke" + - name: "BLUE_HUB" + type: "l3vpn" + topology: "hub" + routers: + # Spoke PE + - name: "pe1" + vrfs: + - name: "BLUE_SPOKE" + table: 200 + rd: "10.50.50.1:1011" # : + rt_export: ["65035:1011"] + rt_import: ["65035:1030"] + interfaces: ["eth2"] # CE-facing interface + bgp: + vrfs: + - name: "BLUE_SPOKE" + neighbors: + - peer: "10.50.50.2" # CE IP on PE-to-CE link + remote_as: 65035 + # Hub PE + - name: "pe2" + vrfs: + - name: "BLUE_HUB" + table: 400 + rd: "10.80.80.1:1011" + rt_export: ["65035:1030"] + rt_import: ["65035:1011", "65035:1030"] # Import both to enable hairpin + interfaces: ["eth3"] + bgp: + vrfs: + - name: "BLUE_HUB" + neighbors: + - peer: "10.80.80.2" + remote_as: 65035 diff --git a/failure-analysis.md b/failure-analysis.md new file mode 100644 index 0000000..84a7035 --- /dev/null +++ b/failure-analysis.md @@ -0,0 +1,642 @@ +# L3VPN Failure Analysis — GNN Root Cause Analysis vs. Traditional Fault Management + +This document analyses all fault injection scenarios for the telco-lab L3VPN network, grounded +in the actual topology and traffic descriptors, and explains how the GNN-based RCA approach +compares with traditional fault management for each failure type. + +For the GNN model definition — graph schema, node features, edge types, model architecture, training pipeline, and fault classification algorithm — see [rca.md](rca.md). + +--- + +## GNN Value Summary + +| # | Fault | Impact | Traditional Approach | GNN Approach | Value | +|---|---|---|---|---|---| +| 1 | **MTU Mismatch** | Large packets silently dropped on PE1 uplink; TCP sessions stall intermittently with no visible cause | No alarm fires. Requires manual `ping -s 1400` probing from the correct vantage — typically triggered hours after customer complaint | Detects `tx_util`/`rx_util` asymmetry across the connected-interface edge in 5 min; `packet_loss_pct` on flow nodes confirms customer impact | 🟡 Passive detection with no probing | +| 2 | **Hub CE Session Down** | Total BLUE VPN blackout — all spoke-to-hub traffic fails immediately | BGP trap fires instantly but generates 3–4 cascade VRF alarms; NOC must manually determine a single root cause | Common-path analysis collapses cascade to 1 root-cause alert; `vrf_active_sessions` and `active_sessions_norm` on flow nodes quantify SLA breach | 🟢 Alarm noise reduction — N→1 | +| 3 | **RR1 Process Crash** | All VPN traffic disrupted for up to 90 seconds during route-reflector reconvergence | 4+ simultaneous BGP alarms fire; NOC investigates each PE independently, unaware of the shared RR root cause | Common-path analysis across all failing sessions identifies RR1 as the sole cause; at production scale (50 PEs) 50 alarms become 1 | 🟢 **50× alarm reduction at scale** | +| 4 | **Wrong Import RT on PE3** | Spoke2 (Liverpool) silently isolated — d2-blue has no route to hub while the hub still reaches d2-blue, creating a deceptive one-way illusion | Zero alarms. All sessions Established. Dashboard shows green. Only discovered via customer complaint and manual `show vrf` audit | `rt_import_hash` on VRF node `BLUE_SPOKE@PE3` deviates from trained baseline in 5 min — policy misconfiguration detected directly without any inference | 🔴 **Zero-alarm silent misconfiguration — only GNN** | +| 5 | **Degrading SFP** | Gradual CRC error injection simulates a failing optical SFP; retransmissions rise unnoticed for 30+ minutes before routing protocols destabilise | Traditional alarm fires at t=45–55 min once absolute error thresholds are exceeded — 15–25 minutes after SLA breach has begun | `rx_err_gradient` trend crosses anomaly threshold at t=30 min; `packet_loss_pct` on affected flow nodes rises progressively — proactive alert before any SLA impact | 🔴 **Predictive — 35 min early warning** | +| 5b | **Link Down** | Immediate traffic rerouting or blackout depending on redundancy; OSPF and BGP cascade alarms follow | Binary alarm fires in < 1 second via SNMP linkDown trap — fastest traditional detection of any fault type | Suppresses cascade alarms; `(flow, transits)` edges identify exactly which customer flows are affected and which recover via backup paths | 🟡 Blast radius analysis; RCA suppression | +| 6 | **OSPF Area Mismatch** | P2-P4 OSPF adjacency silently fails; physical link UP but L3 paths lost; PE4-originating traffic reroutes over longer detour paths | Zero alarms — SNMP sees interface UP, BGP Established. The L3 failure is completely invisible to single-layer monitoring | Cross-layer contradiction: `ospf_num_routes` drops while `tx_util≈0` on an UP interface — a signal traditional tools cannot form without correlating two separate data sources | 🔴 Cross-layer L1/L3 contradiction — only GNN | +| 7 | **Duplicate IP** | P3 claims a P1 address causing ARP cache poisoning on P4; traffic intermittently black-holes on a 20–30 minute ARP timeout cycle | Zero alarms — ARP table changes are not surfaced by SNMP. Pings frequently succeed when NOC investigates, masking the fault | `session_uptime_norm` oscillation and overlapping peer IP across two BGP session nodes infers the ARP conflict indirectly from its effect on routing stability | 🔴 ARP-layer conflict inferred from BGP signal | +| **8** | **TX Queue Starvation** ⭐ NEW | Hub PE2 uplink txqueuelen shrunk to 20 packets (a Linux kernel parameter); queue overflows constantly under load causing 30–60% throughput loss on all hub downloads | Not detectable by any tool — kernel parameter is absent from VyOS config, GitOps, and all config management systems; `ifOutDiscards` threshold never triggers | `tx_queue_len_norm=0.02` directly flags the misconfiguration; corroborated by `jitter_norm` spike on the constant 8 Mbps hub UDP monitoring flow | 🔴 **Kernel-layer misconfiguration — only GNN** | +| **9** | **OSPF Cost Inflation** ⭐ NEW | OSPF cost set to 65535 on P2-P1 link; all Brighton/Cardiff PE traffic reroutes via 3-hop detour congesting the 100 Mbps P3-PE2 link; OSPF adjacency stays Full | Not detectable — cost 65535 is a legal configuration value; all adjacencies Full, all interfaces UP, all BGP sessions Established | `tx_util≈0` on a Full-adjacency core link is uniquely contradictory; `latency_ms_norm` spikes and `egresses_at` edge shifts across multiple flow nodes confirm topology-wide rerouting | 🔴 **Legal-but-wrong config — only GNN** | +| *(10)* | *(BGP Update Storm)* | *RR1 CPU saturated by 10,000-prefix route-flap injection; BGP keepalive processing delayed; forwarding latency increases for all transit traffic* | *CPU visible in SNMP `hrProcessorLoad` but not correlated to service impact by standard NMS* | *`bgp_update_rate` spike on RR1 directly flags RESOURCE_EXHAUSTION; validates the cpu/mem RCA classifier branch currently untested by F1–F9* | *🔴 CPU/resource exhaustion — validates GNN classifier* | +| *(11)* | *(Cross-VPN Route Leak)* | *BLUE spoke routes accidentally exported into RED VPN RIB; RED VPN traffic to matching prefixes is misdirected; completely silent from control-plane perspective* | *More routes in a session is never alarmed; completely silent until customers complain about misdirected traffic* | *`rt_export_hash` deviation on VRF node + anomalous `leaks_to` cross-VPN edge directly detected; the `leaks_to` edge has never appeared in training — structurally anomalous* | *🔴 Multi-VPN policy violation — only GNN* | + +**Legend**: 🔴 = GNN is the **only** detection path · 🟡 = GNN significantly improves on traditional · 🟢 = GNN reduces alarm noise on top of existing detection + +*Faults 8 and 9 are new silent-performance misconfigurations introduced specifically to demonstrate GNN capabilities against faults that exist outside all configuration management systems. Faults 10 and 11 are recommended additions to fill GNN feature coverage gaps.* + +--- + +## Network Reference + +### Physical Topology + +``` + RR1 (Birmingham, 10.0.0.1) + / \ + P1 (London) ── P2 (Manchester) + 10.0.0.3 10.0.0.4 + / \ \ / \ + P3 PE1 PE2 P4 PE3 PE4 +(Edinburgh) (Oxford)(Cambridge)(Leeds)(Brighton)(Cardiff) +10.0.0.5 10.0.0.7 10.0.0.8 10.0.0.6 10.0.0.10 10.0.0.11 + | \ | | | | + PE2 PE1 P3 PE3 PE4 P2 + RR2 (Bristol, 10.0.0.2) connects P3 + P4 +``` + +**Provider Core (1 Gbps links):** + +| Link | Subnet | P Router | Interface | P Router | Interface | +|---|---|---|---|---|---| +| P1 ↔ P2 | 172.16.30.0/24 | p1 (London) | eth1 | p2 (Manchester) | eth1 | +| P1 ↔ P3 | 172.16.40.0/24 | p1 (London) | eth2 | p3 (Edinburgh) | eth2 | +| P2 ↔ P4 | 172.16.60.0/24 | p2 (Manchester) | eth2 | p4 (Leeds) | eth2 | +| P3 ↔ P4 | 172.16.50.0/24 | p3 (Edinburgh) | eth3 | p4 (Leeds) | eth3 | +| P1 ↔ RR1 | 172.16.10.0/24 | p1 (London) | eth4 | rr1 (Birmingham) | eth2 | +| P2 ↔ RR1 | 172.16.20.0/24 | p2 (Manchester) | eth3 | rr1 (Birmingham) | eth1 | +| P3 ↔ RR2 | 172.16.70.0/24 | p3 (Edinburgh) | eth4 | rr2 (Bristol) | eth2 | +| P4 ↔ RR2 | 172.16.80.0/24 | p4 (Leeds) | eth1 | rr2 (Bristol) | eth1 | + +**PE Uplinks (100 Mbps, all PE routers are dual-homed):** + +| Link | Subnet | P Router | Interface | PE Router | Interface | +|---|---|---|---|---|---| +| P1 ↔ PE1 | 172.16.90.0/24 | p1 (London) | eth3 | pe1 (Oxford) | eth1 | +| P3 ↔ PE1 | 172.16.160.0/24 | p3 (Edinburgh) | eth5 | pe1 (Oxford) | eth4 | +| P1 ↔ PE2 | 172.16.100.0/24 | p1 (London) | eth5 | pe2 (Cambridge) | eth1 | +| P3 ↔ PE2 | 172.16.110.0/24 | p3 (Edinburgh) | eth1 | pe2 (Cambridge) | eth2 | +| P4 ↔ PE3 | 172.16.140.0/24 | p4 (Leeds) | eth4 | pe3 (Brighton) | eth1 | +| P2 ↔ PE3 | 172.16.170.0/24 | p2 (Manchester) | eth5 | pe3 (Brighton) | eth4 | +| P2 ↔ PE4 | 172.16.150.0/24 | p2 (Manchester) | eth4 | pe4 (Cardiff) | eth1 | +| P4 ↔ PE4 | 172.16.180.0/24 | p4 (Leeds) | eth5 | pe4 (Cardiff) | eth4 | + +### VPN Services + +**BLUE VPN — Hub-and-Spoke:** + +| Device | Site | PE Router | CE Router | LAN Subnet | +|---|---|---|---|---| +| dh-blue (10.100.2.10) | Nottingham (Hub) | PE2 (Cambridge) | ce1-hub | 10.100.2.0/24 | +| d1-blue (10.100.1.10) | Sheffield (Spoke1) | PE1 (Oxford) | ce1-spoke | 10.100.1.0/24 | +| d2-blue (10.100.3.10) | Liverpool (Spoke2) | PE3 (Brighton) | ce2-spoke | 10.100.3.0/24 | +| d3-blue (10.100.4.10) | Huddersfield (Spoke3) | PE4 (Cardiff) | ce3-spoke | 10.100.4.0/24 | + +**RED VPN — Any-to-Any Mesh:** + +| Device | Site | PE Router | CE Router | LAN Subnet | +|---|---|---|---|---| +| d1-red (10.101.1.10) | Norwich | PE1 (Oxford) | ce1-red | 10.101.1.0/24 | +| d2-red (10.101.2.10) | Coventry | PE2 (Cambridge) | ce2-red | 10.101.2.0/24 | +| d3-red (10.101.3.10) | Plymouth | PE3 (Brighton) | ce3-red | 10.101.3.0/24 | +| d4-red (10.101.4.10) | Leicester | PE4 (Cardiff) | ce4-red | 10.101.4.0/24 | + +### Route Target Policy (BLUE VPN Hub-and-Spoke) + +| Role | PE Router | VRF | Export RT | Import RT | +|---|---|---|---|---| +| Spoke | PE1 (Oxford) | `BLUE_SPOKE` | `65035:1011` | `65035:1030` | +| Spoke | PE3 (Brighton) | `BLUE_SPOKE` | `65035:1011` | `65035:1030` | +| Spoke | PE4 (Cardiff) | `BLUE_SPOKE` | `65035:1011` | `65035:1030` | +| Hub | PE2 (Cambridge) | `BLUE_HUB` | `65035:1030` | `65035:1011`, `65035:1030` | + +> The hub imports both spoke routes (`65035:1011`) and its own routes (`65035:1030`) to enable spoke-to-spoke traffic via the hub. Fault 4 changes PE3's import RT to `65035:9999`, breaking this policy. + +### Traffic Tests + +**BLUE VPN Traffic Tests (`l3vpn-blue-test.yaml`):** + +| Flow ID | Protocol | Direction | Rate Profile | Peak | +|---|---|---|---|---| +| `d1-to-hub-tcp` | TCP | d1-blue → dh-blue | multi_sine (daily + weekly cycle) | ~45 Mbps at 14:00 UTC | +| `d2-to-hub-udp` | UDP | d2-blue → dh-blue | schedule (business hours) | ~45 Mbps at 13:30 UTC | +| `d1-hub-tcp-bidir` | TCP | d1-blue ↔ dh-blue | upload: multi_sine; download: multi_sine (heavier) | Upload ~45 Mbps, Download ~70 Mbps | +| `d2-hub-udp-bidir` | UDP | d2-blue ↔ dh-blue | upload: schedule; download: **8 Mbps constant** | Upload ~45 Mbps, Hub push constant | + +**RED VPN Traffic Tests (`l3vpn-red-test.yaml`):** + +| Flow ID | Protocol | Direction | Rate Profile | Peak | +|---|---|---|---|---| +| `d1-red-to-d2-red-tcp` | TCP | d1-red → d2-red | multi_sine | ~45 Mbps | +| `d3-red-to-d4-red-udp` | UDP | d3-red → d4-red | schedule (business hours) | ~45 Mbps | +| `d1-red-d3-red-bidir` | TCP | d1-red ↔ d3-red diagonal | upload: multi_sine; download: heavier | Upload ~45 Mbps, Download ~70 Mbps | +| `d2-red-d4-red-bidir` | UDP | d2-red ↔ d4-red diagonal | upload: schedule; download: **8 Mbps constant** | Upload ~45 Mbps, Hub push constant | + +> **Key signal**: The `d2-hub-udp-bidir` and `d2-red-d4-red-bidir` hub-to-spoke reverse directions run a **constant 8 Mbps UDP stream 24/7**. Any fault that degrades this path produces a clean, unambiguous signal with no natural rate variation to hide behind — making it the most sensitive GNN training signal in the test suite. + +--- + +> **GNN model details** — for the full graph schema (node types, features, edge types), model architecture, training pipeline, inference scoring, and fault classification algorithm, see [rca.md](rca.md). + +--- + +## On Physical Link Down — Is GNN Appropriate? + +Before the per-fault analysis, this question deserves a direct answer. + +For a **clean physical link-down** (cable pull, port failure), traditional SNMP gives a faster first alarm (< 1 second vs. 5-minute GNN inference cycle). However the GNN is superior in three specific scenarios: + +| Scenario | Traditional NMS | GNN | +|---|---|---| +| **First detection of binary link failure** | ✅ < 1 second (SNMP linkDown trap) | ❌ Up to 5 minutes | +| **Alarm suppression / root cause isolation** | ❌ Fires N alarms (one per BGP session, OSPF peer) | ✅ Common-path analysis → 1 root-cause alert | +| **Pre-failure degradation detection** | ❌ Alarms only after threshold crossed (45+ min) | ✅ Detects `rx_err_gradient` trend 35+ min before failure | +| **Cross-layer RCA** (L1 UP but L3 broken) | ❌ Requires manual correlation across layers | ✅ Graph structure enables simultaneous cross-layer inference | + +**Recommendation**: Use traditional SNMP/syslog as the fast alarm layer for binary physical failures. Use the GNN as the root-cause isolation and proactive degradation detection layer. They are complementary, not competing. + +--- + +## GNN Feature Coverage by Fault + +The table below shows which GNN node features each fault exercises. A comprehensive test suite must exercise all features; gaps indicate RCA classifier branches that cannot be validated. + +| GNN Feature | Node Type | Faults That Exercise It | +|---|---|---| +| `state` | router, interface | F3 (loopback disable), F5b (link down) | +| `cpu` | router | **F10** (BGP Update Storm) — gap in F1–F9 | +| `mem` | router | **F10** — gap in F1–F9 | +| `bgp_update_rate` | router | **F10** — primary `RESOURCE_EXHAUSTION` signal | +| `vrf_count` | router | **F11** (Cross-VPN Leak) | +| `fib_size_norm` | router | **F11** | +| `ospf_num_routes` | router | F6 (area mismatch), F3 (RR crash), F9 (cost inflation) | +| `pfx_count_norm` | router | F2 (hub CE down), F3 (RR crash), F4 (wrong RT) | +| `rx_drops` / `tx_drops` | interface | F8 (TX queue starvation) | +| `mtu_norm` | interface | F1 (MTU mismatch) | +| `tx_queue_len_norm` | interface | **F8** — direct detection (`txqueuelen 20` = 0.02 vs. healthy 1.0) | +| `rx_err_gradient` | interface | F5 (SFP degradation) | +| `tx_util` / `rx_util` | interface | F1, F8, F9 | +| `bgp_state` | bgp_session | F2 (CE session down), F3 (RR crash) | +| `pfx_count_norm` | bgp_session | F2, F3, F4, F11 (route leak) | +| `prefix_count_delta` | bgp_session | F2, F3, F4 | +| `session_uptime_norm` | bgp_session | F7 (duplicate IP, flapping) | +| `rt_import_count` | bgp_session | F11 | +| **`vrf_route_count`** | **vrf** | **F4, F2** | +| **`vrf_route_count_delta`** | **vrf** | **F4** | +| **`rt_import_hash`** | **vrf** | **F4** — direct detection of wrong RT | +| **`rt_export_hash`** | **vrf** | **F11** — direct detection of route leak | +| **`vrf_mem_bytes_norm`** | **vrf** | **F11** | +| **`vrf_active_sessions`** | **vrf** | **F2** | +| **`throughput_norm`** | **flow** | **F1, F2, F8, F9** | +| **`throughput_delta`** | **flow** | **F2, F8** | +| **`expected_rate_deviation`** | **flow** | **F8, F9** — primary signal for silent performance faults | +| **`jitter_norm`** | **flow** | **F8** — direct queue saturation signal | +| **`packet_loss_pct`** | **flow** | **F1, F5** | +| **`latency_ms_norm`** | **flow** | **F9** — direct OSPF rerouting signal | +| **`active_sessions_norm`** | **flow** | **F2, F3** | + +--- + +## Fault-by-Fault Analysis + +--- + +### Fault 1 — MTU Mismatch + +| Property | Value | +|---|---| +| **Type** | `MTU_MISMATCH` | +| **File** | `l3vpn-hub-spoke-fault1-mtu.yaml` | +| **Target** | PE1 / eth1 (Oxford → London, 172.16.90.0/24) | +| **Alarms generated** | ❌ None | +| **Severity** | Performance degradation — intermittent | + +**What happens**: PE1's uplink toward P1 (London) has its MTU reduced from 1500 to 1400 bytes. Any MPLS-encapsulated packets exceeding 1400 bytes (typical BGP UPDATE messages with many prefixes, and customer TCP segments with standard MSS) are silently dropped by the kernel. No ICMP fragmentation-needed is returned. + +**Traffic flows affected**: + +| Flow | Impact | Mechanism | +|---|---|---| +| `d1-to-hub-tcp` | ⚠️ Intermittent degradation | Sheffield → PE1/eth1 → P1: 45 Mbps peak with 10 TCP sessions generates frequent large segments | +| `d1-hub-tcp-bidir` (upload) | ⚠️ Degraded | Same outbound path | +| `d1-hub-tcp-bidir` (download) | ✅ Unaffected | Hub → d1 enters PE1 inbound — MTU on PE1/eth1 does not affect ingress | +| `d1-red-to-d2-red-tcp` | ⚠️ Degraded | Norwich (PE1) → PE1/eth1 → P1 → PE2 | +| `d1-red-d3-red-bidir` (d1 → d3) | ⚠️ Degraded | PE1/eth1 on outbound path | +| All d2, d3 flows | ✅ Unaffected | d2 uses PE3 (Brighton); d3 uses PE4 (Cardiff) | + +**Backup path note**: PE1 also has eth4 (P3-PE1 link, 172.16.160.0/24). If OSPF ECMP distributes load across both uplinks, approximately 50% of d1 traffic may avoid the fault, making the impact intermittent and harder to reproduce on demand. + +**Time-of-day dependency**: The fault is most visible at 14:00 UTC (45 Mbps peak with 10 TCP sessions generating maximum segment sizes). At 02:00 UTC (5 Mbps overnight minimum), smaller traffic volume produces fewer drop events and the anomaly score may fall below threshold. + +**GNN detection**: +- `interface` node PE1/eth1: `tx_drops` increases; `tx_util` is high while P1/eth3 (connected interface) `rx_util` is lower than expected +- 2-hop message passing across the `connected_to` edge detects the tx/rx utilisation asymmetry +- Reconstruction error confined to PE1/eth1 — router and BGP session nodes healthy +- **Classifier output**: `INTERFACE → top_feature=mtu_norm → MTU_MISMATCH on PE1/eth1` + +**Traditional tools**: Interface UP, OSPF Full, BGP Established. `ifOutDiscards` SNMP counter increases but absolute thresholds are typically calibrated for hardware failures, not gradual MTU-induced drops. + +**GNN advantage**: Passive detection from standard telemetry counters. Traditional approach requires explicit MTU probing (`ping -s 1400`) from every link in both directions — not standard practice and must target exactly the right vantage point. + +--- + +### Fault 2 — Hub CE Session Teardown + +| Property | Value | +|---|---| +| **Type** | `BGP_SESSION_DOWN` | +| **File** | `l3vpn-hub-spoke-fault2-ce-down.yaml` | +| **Target** | PE2 / eBGP session to ce1-hub (10.80.80.0/24, VLAN 402) | +| **Alarms generated** | ✅ BGP session-down trap | +| **Severity** | Service outage — total BLUE VPN blackout | + +**What happens**: The eBGP session between PE2 (Cambridge) and ce1-hub (Nottingham) is deleted. PE2 withdraws all hub customer routes. All spoke VRFs lose their imported hub routes immediately. + +**Traffic flows affected**: + +| Flow | Impact | Mechanism | +|---|---|---| +| `d1-to-hub-tcp` | 🔴 100% loss | dh-blue (10.100.2.10) is unreachable | +| `d2-to-hub-udp` | 🔴 100% loss | Same | +| `d1-hub-tcp-bidir` | 🔴 100% loss (both directions) | Hub cannot reach spokes either | +| `d2-hub-udp-bidir` | 🔴 100% loss | The constant 8 Mbps hub push drops to 0 — unambiguous signal | +| All RED VPN flows | ✅ Unaffected | RED VPN uses separate VRF and CE routers (ce2-red at PE2 is independent) | + +**Diagnostic note**: The constant 8 Mbps hub monitoring push (`d2-hub-udp-bidir` reverse direction) is the most sensitive trigger. Any fault on the hub drops this to exactly 0, with no natural rate variation that could mask the outage. + +**GNN detection**: +- `bgp_session` node PE2↔ce1-hub: `bgp_state` → 0.0, `pfx_count_norm` → 0.0, `prefix_count_delta` → large negative +- BGP session reconstruction error spikes; router and interface nodes remain healthy +- **Classifier output**: `BGP_SESSION → parent role=CE → count=1 → Local Access Failure on PE2/ce1-hub` + +**Traditional tools**: ✅ BGP session-down trap fires within seconds. However in a network with many VRFs, this generates one alarm per spoke VRF that loses hub routes (~3–4 cascade alarms). The GNN suppresses these and issues a single alert pointing to the root session. + +**GNN advantage**: In a production network with 50 PE routers, one hub CE session failure can cascade into 50+ downstream alarms. The GNN collapses this to 1 root-cause alert. + +--- + +### Fault 3 — RR1 Process Crash + +| Property | Value | +|---|---| +| **Type** | `PROCESS_CRASH` | +| **File** | `l3vpn-hub-spoke-fault3-rr1-crash.yaml` | +| **Target** | RR1 (Birmingham, 10.0.0.1) — bgpd kill or loopback disable | +| **Alarms generated** | ✅ 4+ BGP session-down traps simultaneously | +| **Severity** | Route reflection instability — up to 90 second disruption | + +**What happens**: RR1's BGP daemon crashes or its loopback (source of iBGP router-ID) is disabled. All 4 PE-to-RR1 sessions drop simultaneously. Traffic re-reflects via RR2 (Bristol, connected to P3 and P4), but reconvergence takes up to 90 seconds. + +**Traffic flows affected during reconvergence window**: + +| Flow | Impact | Mechanism | +|---|---|---| +| ALL flows (both VPNs) | ⚠️ Up to 90s disruption | VPNv4 route re-reflection via RR2 required | +| `d2-hub-udp-bidir` (8 Mbps constant) | 🎯 Clearest detector | Constant baseline makes even a 5-second interruption unambiguous | +| `d1-hub-tcp-bidir` peak at 14:00 UTC | 🔴 Severe | TCP retransmissions + new session re-establishment during convergence | + +After reconvergence (~90 seconds), all flows recover. The event appears as a "network hiccup" in hindsight. + +**GNN detection**: +- All 4 PE-to-RR1 session embeddings spike simultaneously +- Common-path analysis: all failing sessions share RR1 as parent router +- **Classifier output**: `BGP_SESSION → parent role=RR → count=4 → RR_CRASH → root cause: rr1` +- 4 downstream PE-level alarms suppressed; 1 RR alert issued + +**Traditional tools**: ✅ 4+ BGP alarms fire simultaneously. Without common-path analysis, the NOC sees "4 BGP sessions down across 4 sites" and starts individual per-site investigations. Root cause (single RR1) is not obvious from the alarm stream. + +**GNN advantage**: This is the GNN's clearest win for traditional alarm reduction. At scale (50 PEs, 2 RRs), one RR crash generates 50 simultaneous BGP alarms — the GNN collapses this to 1 root-cause alert. + +--- + +### Fault 4 — Wrong Import Route-Target on PE3 + +| Property | Value | +|---|---| +| **Type** | `BGP_SESSION_DOWN` / VRF misconfiguration | +| **File** | `l3vpn-hub-spoke-fault4-rt-import.yaml` | +| **Target** | PE3 (Brighton) — `BLUE_SPOKE` VRF import RT changed from `65035:1030` to `65035:9999` | +| **Alarms generated** | ❌ None | +| **Severity** | Silent isolation of Spoke2 (Liverpool) | + +**What happens**: PE3's BLUE_SPOKE VRF no longer imports hub routes (RT `65035:1030` is rejected). PE3's VRF routing table empties of hub prefixes. d2-blue cannot reach dh-blue. However, ce2-spoke still exports its own prefix correctly, so the hub can still see PE3's routes and sends packets toward d2-blue that arrive but get no response. + +**Traffic flows affected**: + +| Flow | Impact | Mechanism | +|---|---|---| +| `d2-to-hub-udp` | 🔴 Silent failure | d2-blue (Liverpool) has no route to hub — packets blackholed at PE3 | +| `d2-hub-udp-bidir` (d2 → hub) | 🔴 Fails | Same | +| `d2-hub-udp-bidir` (hub → d2, 8 Mbps constant) | ⚠️ One-way illusion | Hub can still send to d2; d2 receives but cannot respond — one-way connectivity | +| All d1, d3 flows | ✅ Unaffected | PE1 and PE4 VRFs have correct RT | +| All RED VPN flows | ✅ Unaffected | Separate VRF — RED VPN PE3 VRF unaffected | + +**The one-way deception**: Pings from dh-blue → d2-blue **succeed** (hub→spoke path works). Pings from d2-blue → dh-blue **fail**. A naive NOC test from the hub side concludes "connectivity OK." Only bidirectional end-to-end testing from d2-blue's vantage reveals the fault. + +**GNN detection**: +- PE3's iBGP sessions toward RR show anomalous `pfx_count_norm` (receiving fewer VPN routes than the baseline) +- HetGNN isolates deviation to PE3's VRF config sub-embedding — the RT import hash deviates from training baseline +- D-GAT detects asymmetric reachability (CE2 advertises normally; hub→PE3 imports zero) +- **Classifier output**: `BGP_SESSION → asymmetric import pattern on PE3 → VRF_RT_MISCONFIGURATION` + +**Traditional tools**: ❌ Zero alarms. All sessions Established. BGP prefix counts appear normal from the hub's perspective. RT policy is not monitored by standard NMS. + +**GNN advantage**: No traditional tool passively monitors RT import policy compliance. This fault is only discoverable via explicit VRF audit scripts or customer complaint. + +--- + +### Fault 5 — Degrading SFP on P1-P3 Link + +| Property | Value | +|---|---| +| **Type** | `PACKET_CORRUPTION` (tc netem) | +| **Target** | P1 / eth2 (London → Edinburgh, 172.16.40.0/24) | +| **Alarms generated** | ⚠️ Late alarm at ~45 minutes | +| **Severity** | Gradual hardware degradation → eventual link failure | + +**What happens**: Progressive packet corruption is injected on P1/eth2 simulating a failing optical SFP. Errors start at <1% and accelerate over 55 minutes until the link is unusable. + +**Degradation timeline vs. traffic flows**: + +| Time | Error Rate | Traffic Impact | GNN Signal | +|---|---|---|---| +| t=0–30 min | <1% | Imperceptible — TCP absorbs retransmissions | `rx_err_gradient` rising on P1/eth2; score below threshold | +| t=30 min | ~2–3% | Visible TCP retransmits at 45 Mbps peak | **GNN anomaly score crosses threshold** — alert: hardware degradation | +| t=45 min | ~8% | OSPF LSA drops; link metric instability; ECMP shifts | `rx_err_gradient` high; P1 `ospf_num_routes` fluctuates | +| t=55 min | >20% | LDP drops, routing instability, re-route | Full reconstruction error spike; traditional alarm fires | + +**Traffic flows affected** (flows using P1-P3 path): + +| Flow | Impact | Why | +|---|---|---| +| `d1-hub-tcp-bidir` (download, 70 Mbps peak) | 🔴 High degradation at peak | PE2→P3→PE1 path (or PE1→P1→P3→PE2) uses P1-P3 link under ECMP | +| `d1-red-d3-red-bidir` diagonal | ⚠️ Degraded | PE1↔PE3 diagonal traverses P1-P3 under some OSPF paths | +| Flows not on P1-P3 path | ✅ Initially unaffected | OSPF ECMP shifts load; other paths pick up the slack | + +**GNN advantage**: At t=30 minutes, GNN raises `HARDWARE_DEGRADATION` alert on P1/eth2 with `top_feature=rx_err_gradient`. Traditional monitoring does not alarm until t=45–55 minutes. The GNN provides **15–25 minutes of early warning** — enough time for proactive SFP replacement before SLA breach. + +**GNN detection**: +- `interface` node P1/eth2: `rx_err_gradient` rises steadily across successive inference cycles +- Trajectory analysis: steadily increasing anomaly score = hardware degradation (vs. single spike = transient noise) +- **Classifier output**: `INTERFACE → top_feature=rx_err_gradient → HARDWARE_DEGRADATION → root cause: P1/eth2` + +**Traditional tools**: ⚠️ Late alarm. Fires only when CRC errors exceed absolute SNMP thresholds (~45+ minutes into the fault). By then, SLA is already breached for d1 and d1-red customers. + +--- + +### Fault 5b — Link Down (Physical) + +| Property | Value | +|---|---| +| **Type** | `LINK_DOWN` | +| **Target** | Any router interface — e.g., P1/eth2 (P1-P3 link) | +| **Alarms generated** | ✅ Immediate (SNMP linkDown trap < 1s) | +| **Severity** | Immediate traffic rerouting or blackout | + +**GNN role for physical link-down**: Traditional SNMP wins on detection speed. The GNN's value here is: +1. **Alarm suppression**: One link down cascades into OSPF adjacency failures, BGP session drops, and prefix withdrawals. The GNN issues one root-cause alert rather than N cascade alarms. +2. **Blast radius analysis**: The graph structure shows exactly which flows are affected and which have backup paths. +3. **Disambiguation**: Distinguishes a single link failure from a router failure (which would take down all links simultaneously). + +**GNN detection**: +- `interface` node: `state` → 0.0; `tx_util` → 0.0; `rx_util` → 0.0 +- Connected router node: `ospf_num_routes` drops; `pfx_count_norm` changes +- **Classifier output**: `INTERFACE → top_feature=state → INTERFACE_DOWN on [router]/[interface]` + +--- + +### Fault 6 — OSPF Area Mismatch + +| Property | Value | +|---|---| +| **Type** | `OSPF_AREA_MISMATCH` | +| **Target** | P2 / eth2 (Manchester → Leeds, 172.16.60.0/24) — area set to `0.0.0.99` instead of `0.0.0.0` | +| **Alarms generated** | ❌ None (physical link remains UP) | +| **Severity** | OSPF traffic-engineering lost on P2-P4 path | + +**What happens**: P2-P4 OSPF adjacency fails (will not reach Full state). Physical link remains UP and transmitting — only L3 forwarding is affected. MPLS LDP may stay up, but OSPF-computed paths through P2-P4 are lost. + +**Traffic flows affected**: + +| Flow | Impact | Why | +|---|---|---| +| `d3-blue-to-hub` (Huddersfield/PE4 → hub) | ⚠️ Rerouted | PE4→P4→[no P2 OSPF]→must detour via P3→P1→PE2 or P3→PE2 | +| `d3-red-to-d4-red-udp` | ⚠️ Rerouted | PE3→PE4 path normally via P2-P4 direct; now detours | +| `d2-red-d4-red-bidir` | ⚠️ Rerouted | PE2↔PE4 diagonal return path affected | +| BLUE d1 flows (PE1↔PE2) | ✅ Mostly unaffected | PE1↔PE2 via P1 or P3 direct; doesn't require P2-P4 | +| RED d1-d2 flows (PE1↔PE2) | ✅ Unaffected | PE1→P1→PE2 direct path | + +**Key nuance**: PE3 (Brighton) has **two uplinks** — to P4 (Leeds) and to P2 (Manchester) directly. Even if P2-P4 adjacency fails, PE3 can still reach P2 directly via its own PE3-P2 link (172.16.170.0/24). So PE3-sourced traffic is less affected than PE4-sourced traffic (which must route to P2 via P2-P4 or P4-P3-P1-P2 detour). + +**GNN detection**: +- P2 router node: `ospf_num_routes` drops (SPF tree loses P4's LSAs) +- P2/eth2 interface node: `tx_util ≈ 0.0` despite `state=UP` — link transmitting but carrying no routed traffic +- D-GAT: `ospf_peer` edge P2↔P4 shows anomalous OSPF state while physical link state=UP (cross-layer mismatch) +- **Classifier output**: `ROUTER → ospf_num_routes drop + INTERFACE tx_util=0 on UP link → OSPF_AREA_MISMATCH on P2/eth2` + +**Traditional tools**: ❌ No alarm. Physical link UP. BGP Established. Unless OSPF adjacency state is explicitly monitored (non-default in most NMS tools), this is invisible. Models an operator copy-paste error during a maintenance window. + +**GNN advantage**: The cross-layer contradiction — L1 says "UP", L3 says "no routes via this link" — is precisely what the GNN's multi-layer graph captures and traditional single-layer monitoring cannot. + +--- + +### Fault 7 — Duplicate IP Address + +| Property | Value | +|---|---| +| **Type** | `DUPLICATE_IP` | +| **Target** | P3 / eth3 (Edinburgh → Leeds, 172.16.50.0/24) — duplicate of P1/eth1 address `172.16.30.1` | +| **Alarms generated** | ❌ None | +| **Severity** | Intermittent black-holing on P3-P4 transit paths | + +**What happens**: P3 (Edinburgh) claims the IP address `172.16.30.1/24` which legitimately belongs to P1 (London) on the P1-P2 link. P3 sends gratuitous ARPs claiming this address on the P3-P4 segment. P4 (Leeds) receives conflicting ARP entries. Any traffic P4 forwards toward `172.16.30.1` may be misdirected to P3 instead of P1, depending on which ARP entry is cached at any given moment. + +**Traffic flows affected** (intermittently): + +| Flow | Impact | Why | +|---|---|---| +| `d3-blue-to-hub` | ⚠️ Intermittent | PE4→P4→P2→P1→PE2 path: P4 may misdirect 172.16.30.1-bound traffic | +| `d3-red-to-d4-red-udp` | ⚠️ Intermittent | PE3→P4→PE4: P4 ARP confusion | +| `d2-red-d4-red-bidir` | ⚠️ Intermittent | PE4 as endpoint; P4 as transit | +| d1/d2 BLUE and d1-d2 RED flows | ✅ Mostly unaffected | Primarily use P1-P2 direct paths, not P3-P4 segment | + +**The intermittent pattern**: Impact is worst immediately after P3 sends gratuitous ARPs. It fades as ARP entries timeout (20–30 minutes), then returns on the next gratuitous ARP cycle. This creates a cycling availability pattern — extremely hard to diagnose because pings often succeed when the NOC investigates. + +**GNN detection**: +- `session_uptime_norm` on BGP sessions belonging to P4-adjacent routers oscillates — sessions reset as routing breaks intermittently +- `prefix_count_delta` oscillates as routes withdraw and return +- Two routers show anomalous sessions with overlapping peer IP space +- **Classifier output**: `BGP_SESSION → session_uptime_norm low on 2+ routers → IP_OVERLAP → rogue session on P3` + +**Traditional tools**: ❌ Zero alarms. ARP table changes are not surfaced by standard NMS. The intermittent pattern means test pings frequently succeed during investigation. + +**GNN advantage**: The GNN infers ARP-level conflicts from their effect on BGP session stability — an indirect signal that no SNMP MIB directly exposes. + +--- + +### Fault 8 — TX Queue Starvation ⭐ NEW — Silent Performance + +| Property | Value | +|---|---| +| **Type** | `TXQUEUE_STARVATION` (new type) | +| **Target** | PE2 / eth2 (Cambridge → Edinburgh, 172.16.110.0/24) — `txqueuelen` 1000 → 20 | +| **Alarms generated** | ❌ None | +| **Severity** | 30–60% hub throughput loss on Edinburgh-bound paths | + +**What happens**: The Linux transmit queue length on PE2's P3-facing uplink is reduced from the default 1000 packets to 20 via `ip link set eth2 txqueuelen 20`. This is a kernel parameter — it does not appear in VyOS running-config, is not stored in the VyOS commit history, and is not captured by any configuration management or GitOps system. Under the aggregate hub traffic load, the 20-packet queue fills and overflows thousands of times per second. + +**Why PE2/eth2 is the highest-impact target**: PE2 (Cambridge) is the hub router for BLUE VPN. PE2/eth2 is the P3 (Edinburgh) uplink — traffic from PE2 toward Edinburgh-routed paths (PE1 via P3, and ECMP-distributed hub downloads) exits here. + +**Traffic flows affected**: + +| Flow | Time of Day | Impact | Mechanism | +|---|---|---|---| +| `d1-hub-tcp-bidir` (hub → d1 downloads) | 14:00 UTC peak (70 Mbps) | 🔴 Severe — 30–60% throughput loss | Hub distributes downloads via PE2/eth2 if ECMP routes Oxford-bound traffic via P3 | +| `d2-hub-udp-bidir` (hub → d2, **8 Mbps constant**) | 24/7 | ⚠️ Detectable — constant baseline makes deviations unambiguous | Any congestion on PE2/eth2 immediately shows in `jitter_ms` and `packet_loss_pct` of the constant stream | +| `d2-to-hub-udp` return path | Business hours | ⚠️ Congested | Hub response traffic exits PE2/eth2 toward Liverpool-bound path | +| BLUE d3 hub traffic | Business hours | ⚠️ Congested | If routed via P3 from PE2 | + +**Time-of-day pattern**: Queue starvation is worst during 09:00–17:00 UTC when multiple flows compete for PE2/eth2. At 02:00 UTC (5 Mbps overnight minimum), the 20-packet queue overflows less frequently — anomaly score may fall below threshold briefly, making the fault appear "intermittent" to traditional monitoring even if it were detectable. + +**GNN detection**: +- `interface` node PE2/eth2: `tx_drops` → high; `tx_util` → high; while P3/eth1 (connected interface) `rx_util` → lower than expected +- 2-hop message passing detects the transmit/receive utilisation asymmetry across the `connected_to` edge +- Reconstruction error concentrated on PE2/eth2 `interface` node +- **Classifier output**: `INTERFACE → top_feature=tx_drops → TX_QUEUE_STARVATION on PE2/eth2` + +**Traditional tools**: ❌ `txqueuelen` is a kernel parameter, not in any config management system. `ifOutDiscards` SNMP counter increases, but absolute thresholds are calibrated for hardware failure rates, not kernel config drift. During business-hours peak, drops are significant but masked by traffic variability. + +**GNN advantage**: This fault exists entirely outside every configuration management system. No config audit, no VyOS diff, no compliance tool can detect it. The GNN detects it purely from telemetry behaviour — the only observable signal this fault produces. + +--- + +### Fault 9 — OSPF Interface Cost Inflation ⭐ NEW — Silent Performance + +| Property | Value | +|---|---| +| **Type** | `OSPF_COST_INFLATION` (new type) | +| **Target** | P2 / eth1 (Manchester → London, 172.16.30.2) — OSPF cost 1 → 65535 | +| **Alarms generated** | ❌ None | +| **Severity** | 3-hop detour for Brighton/Cardiff traffic; 100 Mbps P3-PE2 link becomes congestion point | + +**What happens**: The OSPF interface cost on P2's eth1 (P2-to-P1 link, Manchester side) is changed from 1 to 65535. The OSPF adjacency on P2/eth1 remains **Full** — cost changes never break adjacencies. OSPF SPF recalculates: the P2→P1 direction is now prohibitively expensive. Traffic that previously used the direct P2→P1 path reroutes via P2→P4→P3→P1. + +**OSPF path change** (P2→P1 direction only — asymmetric): + +| Before | After | +|---|---| +| PE3 → P2 → P1 → PE2 (2 hops, cost 3) | PE3 → P4 → P3 → PE2 (3 hops, but cost 3) | +| PE4 → P2 → P1 → PE2 (2 hops, cost 3) | PE4 → P4 → P3 → PE2 (3 hops, cost 3) | +| P2 → P1 (direct, cost 1) | P2 → P4 → P3 → P1 (cost 65537 vs. detour) | + +**Critical bottleneck created**: The P3-PE2 link (172.16.110.0/24) is a **100 Mbps link**. With P2/eth1 cost inflated, traffic from PE3 (Brighton) and PE4 (Cardiff) returning toward PE2 (Cambridge) reroutes through P4→P3→PE2, converging on this single 100 Mbps link. BLUE VPN hub downloads (up to 70 Mbps) and RED VPN diagonal traffic (up to 45 Mbps) may both reroute through it simultaneously. + +**Traffic flows affected**: + +| Flow | Before Fault | After Fault | Impact | +|---|---|---|---| +| `d2-to-hub-udp` (Liverpool/PE3 → Cambridge/PE2) | PE3→P2→P1→PE2 (direct) | PE3→P4→P3→PE2 (via 100 Mbps P3-PE2) | ⚠️ Higher latency; potential congestion | +| `d2-hub-udp-bidir` reverse (hub → d2, constant) | PE2→P1→P2→PE3 | PE2→P3→P4→PE3 (P3-PE2 now bidirectional) | 🔴 Congestion + jitter on constant 8 Mbps — clean signal | +| `d3-blue-to-hub` (Huddersfield/PE4 → hub) | PE4→P2→P1→PE2 | PE4→P4→P3→PE2 | ⚠️ Increased latency | +| `d2-red-d4-red-bidir` return | PE2→P1→P2→PE4 | PE2→P3→P4→PE4 | ⚠️ Return path via congested P3-PE2 link | +| `d3-red-to-d4-red-udp` | PE3→P2→P4→PE4 | PE3→P4→PE4 (shorter!) | ✅ Slightly improved | +| BLUE d1 flows (PE1↔PE2) | PE1→P1→PE2 (direct) | ✅ Unchanged | Doesn't traverse P2 Manchester | +| RED d1-d2 flows (PE1↔PE2) | PE1→P1→PE2 | ✅ Unchanged | Doesn't traverse P2 Manchester | + +**Cross-VPN impact**: OSPF is shared infrastructure across both VPNs. One P-router misconfiguration degrades both BLUE and RED VPN simultaneously. + +**GNN detection (multi-node)**: +- **P2/eth1 interface**: `tx_util ≈ 0.0` despite `state=UP` and OSPF Full adjacency — never seen during training. **Highest individual anomaly score.** +- **P3/eth1 (P3-PE2 link) interface**: `tx_util` and `rx_util` elevated above training baseline — congestion +- **P4/eth3 (P3-P4 link) interface**: elevated from rerouted traffic +- **Router nodes P3, P4**: elevated `cpu` from increased forwarding load +- **Classifier output**: `INTERFACE → state=UP + OSPF_Full + tx_util≈0 on P2/eth1 → OSPF_COST_INFLATION` + +**Traditional tools**: ❌ OSPF cost 65535 is a legal configured value. All adjacencies Full. All interfaces UP. BGP Established. Identifying root cause requires manually running `show ip ospf interface` on every P router — impractical without the GNN's graph-wide visibility. + +**GNN advantage**: The GNN knows that a Full OSPF adjacency on a core link should carry traffic proportional to its position in the topology. P2/eth1 with zero traffic but Full adjacency is a contradiction the model has never seen — the reconstruction error reveals the anomaly without any operator intervention. + +--- + +### Fault 10 — BGP Update Storm / CPU Resource Exhaustion ⭐ RECOMMENDED GAP-FILLER + +| Property | Value | +|---|---| +| **Type** | Candidate new type: `BGP_UPDATE_STORM` | +| **Target** | RR1 (Birmingham) — controlled route-flap injection from a test peer | +| **Alarms generated** | ❌ None (BGP sessions remain Established) | +| **Severity** | CPU saturation on RR1 and P-routers; potential keepalive delays | + +**Why this fault is needed**: The GNN RCA classifier contains the branch `CASE dominant layer = 'router': IF top_feature IN ('cpu', 'mem') → RESOURCE_EXHAUSTION` — but **no current fault exercises the `cpu` and `mem` features**. Without a training/validation scenario for this path, the classifier branch is implemented but unvalidated. + +**What happens**: A controlled BGP peer connected to RR1 advertises ~10,000 prefixes with rapid withdraw/re-advertise cycles. RR1's bgpd process CPU spikes to 70–80%. P-routers receiving frequent route updates via RR1 spend excessive time on SPF recalculation, competing with packet forwarding interrupt handling. + +**Traffic impact**: At 70–80% CPU on RR1, BGP keepalive processing is delayed. If hold-timers are tight (default 90s/30s keepalive), sessions may reset. At moderate levels, forwarding latency increases for all transit traffic. + +**GNN detection**: +- RR1 `router` node: `cpu` feature spikes to values never seen in training +- `ospf_num_routes` fluctuates as updates compete with forwarding +- **Classifier output**: `ROUTER → top_feature=cpu → RESOURCE_EXHAUSTION on rr1` + +**Traditional tools**: ❌ BGP sessions remain Established. CPU is visible in SNMP `hrProcessorLoad` MIB, but most NMS tools don't correlate CPU spikes with specific service impact without additional rules. + +--- + +### Fault 11 — Cross-VPN Route Leak ⭐ RECOMMENDED GAP-FILLER + +| Property | Value | +|---|---| +| **Type** | Candidate new type: `VRF_RT_EXPORT_LEAK` | +| **Target** | PE1 (Oxford) — BLUE_SPOKE VRF accidentally exports with RED VPN RT | +| **Alarms generated** | ❌ None | +| **Severity** | RED VPN traffic attracted to BLUE spoke prefixes; potential misdirection | + +**Why this fault is needed**: The RED and BLUE VPNs coexist with separate VRFs and RT policies, but **no fault exercises the interaction between them**. A realistic operator error during a maintenance window (adding wrong RT to an export policy) causes routes from one VPN to appear in the other's RIB. + +**What happens**: PE1's BLUE_SPOKE VRF export RT is extended to include the RED VPN export RT. PE1 now advertises BLUE spoke routes (10.100.1.0/24) into the RED VPN's RIB. RED VPN devices that have routes overlapping with BLUE spoke prefixes may have their traffic misdirected. + +**Traffic impact**: RED VPN d1-red (Norwich/PE1) traffic to any destination matching the leaked prefix gets misdirected. Variable packet loss depending on prefix overlap. + +**GNN detection**: +- `pfx_count_norm` on RED VPN BGP sessions toward PE1 increases anomalously (more prefixes than the model learned for RED sessions) +- HetGNN isolates the deviation to PE1's bgp_session nodes with anomalous prefix counts +- **Classifier output**: `BGP_SESSION → pfx_count_norm above baseline on RED sessions at PE1 → VRF_RT_EXPORT_LEAK on PE1` + +**Traditional tools**: ❌ More routes in a BGP session is typically not alarmed. The leak is completely silent until customer complaint. + +--- + +## Consolidated Comparison Table + +| # | Fault | Target | Traffic Flows Affected | Alarms | Traditional Detection | GNN Detection Time | GNN Advantage | +|---|---|---|---|---|---|---|---| +| 1 | MTU Mismatch | PE1/eth1 (Oxford→London) | d1 BLUE TCP, d1-red TCP | ❌ | Manual MTU probe — hours | 5 min (tx/rx asymmetry) | Passive; no probing needed | +| 2 | Hub CE Session Down | PE2 ↔ ce1-hub (Nottingham) | ALL BLUE VPN — total blackout | ✅ BGP trap | Seconds | 5 min + cascade suppression | Reduces ~4 VRF alarms → 1 | +| 3 | RR1 Crash | RR1 (Birmingham) | ALL traffic — 90s disruption | ✅ 4+ BGP traps | Seconds (noisy) | 5 min, 1 RR alert | **4× alarm reduction; 50× at scale** | +| 4 | Wrong RT on PE3 | PE3 BLUE_SPOKE VRF | d2 BLUE UDP — silent isolation | ❌ | Customer complaint | 5 min (pfx_count drop on PE3) | Zero-alarm scenario: only GNN | +| 5 | Degrading SFP | P1/eth2 (London↔Edinburgh) | d1 TCP, d1-red diagonal | ⚠️ 45+ min late | 45+ min after degradation | **t=30 min — 35 min earlier** | Proactive vs. reactive | +| 5b | Link Down | Any P/PE interface | Depends on redundancy | ✅ Immediate | < 1 second | 5 min | RCA suppression + blast radius | +| 6 | OSPF Area Mismatch | P2/eth2 (Manchester↔Leeds) | PE4 flows; some PE3 flows | ❌ | Never | 5 min (cross-layer L1/L3) | Cross-layer correlation | +| 7 | Duplicate IP | P3/eth3 (Edinburgh→Leeds) | Intermittent P4-transit flows | ❌ | Never | 5 min (session_uptime oscillation) | Infers ARP conflict from BGP | +| **8** | **TX Queue Starvation** | **PE2/eth2 (Cambridge→Edinburgh)** | **Hub downloads; d2 constant UDP** | **❌** | **Never (not in config)** | **5 min (tx_drops + asymmetry)** | **Kernel param: only GNN can detect** | +| **9** | **OSPF Cost Inflation** | **P2/eth1 (Manchester→London)** | **d2/d3 BLUE; d3/d4 RED diagonal** | **❌** | **Never (legal config value)** | **5 min (tx_util≈0 on Full link)** | **Legal-but-wrong: only GNN** | +| *(10)* | *(BGP Update Storm)* | *(RR1 Birmingham)* | *(All flows — CPU pressure)* | *❌* | *CPU MIB only — no service correlation* | *5 min (cpu feature)* | *Validates RESOURCE_EXHAUSTION path* | +| *(11)* | *(Cross-VPN Route Leak)* | *(PE1 Oxford — BLUE/RED VRF)* | *(RED VPN d1-red misdirection)* | *❌* | *Never (more routes ≠ alarm)* | *5 min (pfx_count anomaly)* | *Multi-VPN interaction: only GNN* | + +*Faults 10 and 11 are recommended additions to fill identified GNN feature coverage gaps.* + +--- + +## Why GNN Is Superior: Three Core Principles + +### 1. Behavioural Baseline, Not State Transitions + +Traditional tools alarm when something **transitions** from a known-good state to a known-bad state (UP → DOWN, Established → Idle). This is powerful for binary failures but completely blind to misconfigurations that remain in valid states. + +The GNN alarms when **behaviour deviates** from the learned normal pattern, regardless of protocol state. Faults 1, 4, 6, 7, 8, and 9 all maintain perfectly healthy protocol states — the GNN is the only system that can detect them. + +### 2. Graph-Wide Simultaneous Inference + +When Fault 9 (OSPF cost inflation on P2/eth1) causes P2/eth1 `tx_util ≈ 0`, congestion on P3/eth1, elevated CPU on P3 and P4, and latency increases on PE3/PE4 — traditional monitoring tools see each of these as independent signals. A NOC engineer must manually correlate "P2 link traffic dropped" + "P3 link congestion" + "PE3 customers slow" and connect them to a single cause. + +The GNN sees all nodes simultaneously in one inference pass. The `ospf_peer` edge structure tells the model exactly which router-router pairs are OSPF adjacencies — allowing it to reason that P2/eth1 with zero traffic but a Full adjacency is anomalous relative to the graph topology, and that the congestion pattern on the detour path is the expected consequence. + +### 3. Temporal Feature Engineering — Trends, Not Snapshots + +`rx_err_gradient` (hardware degradation) and `session_uptime_norm` (IP conflict flapping) encode *rates of change* and *normalised ages* — features that capture trends over time, not just the current state. Traditional monitoring systems compare each poll to a static threshold. The GNN's gradient features enable it to detect "this metric is getting worse at an accelerating rate" and alert proactively 35+ minutes before the failure becomes severe. + +The **constant 8 Mbps monitoring push** in both BLUE (`d2-hub-udp-bidir` reverse) and RED (`d2-red-d4-red-bidir` reverse) traffic tests is specifically designed to serve as a high-sensitivity GNN training signal. Any deviation from a perfectly flat 8 Mbps UDP stream is immediately anomalous — providing a clean, unambiguous signal for any fault that degrades the hub-to-spoke or diagonal-mesh return paths.