networkdesign/failure-analysis.md
2026-06-05 10:51:07 +00:00

47 KiB
Raw Permalink Blame History

L3VPN Failure Analysis — GNN Root Cause Analysis vs. Traditional Fault Management

This document analyses all fault injection scenarios for the telco-lab L3VPN network, grounded in the actual topology and traffic descriptors, and explains how the GNN-based RCA approach compares with traditional fault management for each failure type.

For the GNN model definition — graph schema, node features, edge types, model architecture, training pipeline, and fault classification algorithm — see rca.md.


GNN Value Summary

# Fault Impact Traditional Approach GNN Approach Value
1 MTU Mismatch Large packets silently dropped on PE1 uplink; TCP sessions stall intermittently with no visible cause No alarm fires. Requires manual ping -s 1400 probing from the correct vantage — typically triggered hours after customer complaint Detects tx_util/rx_util asymmetry across the connected-interface edge in 5 min; packet_loss_pct on flow nodes confirms customer impact 🟡 Passive detection with no probing
2 Hub CE Session Down Total BLUE VPN blackout — all spoke-to-hub traffic fails immediately BGP trap fires instantly but generates 34 cascade VRF alarms; NOC must manually determine a single root cause Common-path analysis collapses cascade to 1 root-cause alert; vrf_active_sessions and active_sessions_norm on flow nodes quantify SLA breach 🟢 Alarm noise reduction — N→1
3 RR1 Process Crash All VPN traffic disrupted for up to 90 seconds during route-reflector reconvergence 4+ simultaneous BGP alarms fire; NOC investigates each PE independently, unaware of the shared RR root cause Common-path analysis across all failing sessions identifies RR1 as the sole cause; at production scale (50 PEs) 50 alarms become 1 🟢 50× alarm reduction at scale
4 Wrong Import RT on PE3 Spoke2 (Liverpool) silently isolated — d2-blue has no route to hub while the hub still reaches d2-blue, creating a deceptive one-way illusion Zero alarms. All sessions Established. Dashboard shows green. Only discovered via customer complaint and manual show vrf audit rt_import_hash on VRF node BLUE_SPOKE@PE3 deviates from trained baseline in 5 min — policy misconfiguration detected directly without any inference 🔴 Zero-alarm silent misconfiguration — only GNN
5 Degrading SFP Gradual CRC error injection simulates a failing optical SFP; retransmissions rise unnoticed for 30+ minutes before routing protocols destabilise Traditional alarm fires at t=4555 min once absolute error thresholds are exceeded — 1525 minutes after SLA breach has begun rx_err_gradient trend crosses anomaly threshold at t=30 min; packet_loss_pct on affected flow nodes rises progressively — proactive alert before any SLA impact 🔴 Predictive — 35 min early warning
5b Link Down Immediate traffic rerouting or blackout depending on redundancy; OSPF and BGP cascade alarms follow Binary alarm fires in < 1 second via SNMP linkDown trap — fastest traditional detection of any fault type Suppresses cascade alarms; (flow, transits) edges identify exactly which customer flows are affected and which recover via backup paths 🟡 Blast radius analysis; RCA suppression
6 OSPF Area Mismatch P2-P4 OSPF adjacency silently fails; physical link UP but L3 paths lost; PE4-originating traffic reroutes over longer detour paths Zero alarms — SNMP sees interface UP, BGP Established. The L3 failure is completely invisible to single-layer monitoring Cross-layer contradiction: ospf_num_routes drops while tx_util≈0 on an UP interface — a signal traditional tools cannot form without correlating two separate data sources 🔴 Cross-layer L1/L3 contradiction — only GNN
7 Duplicate IP P3 claims a P1 address causing ARP cache poisoning on P4; traffic intermittently black-holes on a 2030 minute ARP timeout cycle Zero alarms — ARP table changes are not surfaced by SNMP. Pings frequently succeed when NOC investigates, masking the fault session_uptime_norm oscillation and overlapping peer IP across two BGP session nodes infers the ARP conflict indirectly from its effect on routing stability 🔴 ARP-layer conflict inferred from BGP signal
8 TX Queue Starvation NEW Hub PE2 uplink txqueuelen shrunk to 20 packets (a Linux kernel parameter); queue overflows constantly under load causing 3060% throughput loss on all hub downloads Not detectable by any tool — kernel parameter is absent from VyOS config, GitOps, and all config management systems; ifOutDiscards threshold never triggers tx_queue_len_norm=0.02 directly flags the misconfiguration; corroborated by jitter_norm spike on the constant 8 Mbps hub UDP monitoring flow 🔴 Kernel-layer misconfiguration — only GNN
9 OSPF Cost Inflation NEW OSPF cost set to 65535 on P2-P1 link; all Brighton/Cardiff PE traffic reroutes via 3-hop detour congesting the 100 Mbps P3-PE2 link; OSPF adjacency stays Full Not detectable — cost 65535 is a legal configuration value; all adjacencies Full, all interfaces UP, all BGP sessions Established tx_util≈0 on a Full-adjacency core link is uniquely contradictory; latency_ms_norm spikes and egresses_at edge shifts across multiple flow nodes confirm topology-wide rerouting 🔴 Legal-but-wrong config — only GNN
(10) (BGP Update Storm) RR1 CPU saturated by 10,000-prefix route-flap injection; BGP keepalive processing delayed; forwarding latency increases for all transit traffic CPU visible in SNMP hrProcessorLoad but not correlated to service impact by standard NMS bgp_update_rate spike on RR1 directly flags RESOURCE_EXHAUSTION; validates the cpu/mem RCA classifier branch currently untested by F1F9 🔴 CPU/resource exhaustion — validates GNN classifier
(11) (Cross-VPN Route Leak) BLUE spoke routes accidentally exported into RED VPN RIB; RED VPN traffic to matching prefixes is misdirected; completely silent from control-plane perspective More routes in a session is never alarmed; completely silent until customers complain about misdirected traffic rt_export_hash deviation on VRF node + anomalous leaks_to cross-VPN edge directly detected; the leaks_to edge has never appeared in training — structurally anomalous 🔴 Multi-VPN policy violation — only GNN

Legend: 🔴 = GNN is the only detection path · 🟡 = GNN significantly improves on traditional · 🟢 = GNN reduces alarm noise on top of existing detection

Faults 8 and 9 are new silent-performance misconfigurations introduced specifically to demonstrate GNN capabilities against faults that exist outside all configuration management systems. Faults 10 and 11 are recommended additions to fill GNN feature coverage gaps.


Network Reference

Physical Topology

              RR1 (Birmingham, 10.0.0.1)
             /        \
   P1 (London)  ──  P2 (Manchester)
   10.0.0.3          10.0.0.4
      / \   \           / \
  P3    PE1  PE2     P4   PE3  PE4
(Edinburgh) (Oxford)(Cambridge)(Leeds)(Brighton)(Cardiff)
10.0.0.5  10.0.0.7 10.0.0.8 10.0.0.6 10.0.0.10 10.0.0.11
  |  \                |             |      |       |
 PE2  PE1           P3             PE3   PE4      P2
      RR2 (Bristol, 10.0.0.2) connects P3 + P4

Provider Core (1 Gbps links):

Link Subnet P Router Interface P Router Interface
P1 ↔ P2 172.16.30.0/24 p1 (London) eth1 p2 (Manchester) eth1
P1 ↔ P3 172.16.40.0/24 p1 (London) eth2 p3 (Edinburgh) eth2
P2 ↔ P4 172.16.60.0/24 p2 (Manchester) eth2 p4 (Leeds) eth2
P3 ↔ P4 172.16.50.0/24 p3 (Edinburgh) eth3 p4 (Leeds) eth3
P1 ↔ RR1 172.16.10.0/24 p1 (London) eth4 rr1 (Birmingham) eth2
P2 ↔ RR1 172.16.20.0/24 p2 (Manchester) eth3 rr1 (Birmingham) eth1
P3 ↔ RR2 172.16.70.0/24 p3 (Edinburgh) eth4 rr2 (Bristol) eth2
P4 ↔ RR2 172.16.80.0/24 p4 (Leeds) eth1 rr2 (Bristol) eth1

PE Uplinks (100 Mbps, all PE routers are dual-homed):

Link Subnet P Router Interface PE Router Interface
P1 ↔ PE1 172.16.90.0/24 p1 (London) eth3 pe1 (Oxford) eth1
P3 ↔ PE1 172.16.160.0/24 p3 (Edinburgh) eth5 pe1 (Oxford) eth4
P1 ↔ PE2 172.16.100.0/24 p1 (London) eth5 pe2 (Cambridge) eth1
P3 ↔ PE2 172.16.110.0/24 p3 (Edinburgh) eth1 pe2 (Cambridge) eth2
P4 ↔ PE3 172.16.140.0/24 p4 (Leeds) eth4 pe3 (Brighton) eth1
P2 ↔ PE3 172.16.170.0/24 p2 (Manchester) eth5 pe3 (Brighton) eth4
P2 ↔ PE4 172.16.150.0/24 p2 (Manchester) eth4 pe4 (Cardiff) eth1
P4 ↔ PE4 172.16.180.0/24 p4 (Leeds) eth5 pe4 (Cardiff) eth4

VPN Services

BLUE VPN — Hub-and-Spoke:

Device Site PE Router CE Router LAN Subnet
dh-blue (10.100.2.10) Nottingham (Hub) PE2 (Cambridge) ce1-hub 10.100.2.0/24
d1-blue (10.100.1.10) Sheffield (Spoke1) PE1 (Oxford) ce1-spoke 10.100.1.0/24
d2-blue (10.100.3.10) Liverpool (Spoke2) PE3 (Brighton) ce2-spoke 10.100.3.0/24
d3-blue (10.100.4.10) Huddersfield (Spoke3) PE4 (Cardiff) ce3-spoke 10.100.4.0/24

RED VPN — Any-to-Any Mesh:

Device Site PE Router CE Router LAN Subnet
d1-red (10.101.1.10) Norwich PE1 (Oxford) ce1-red 10.101.1.0/24
d2-red (10.101.2.10) Coventry PE2 (Cambridge) ce2-red 10.101.2.0/24
d3-red (10.101.3.10) Plymouth PE3 (Brighton) ce3-red 10.101.3.0/24
d4-red (10.101.4.10) Leicester PE4 (Cardiff) ce4-red 10.101.4.0/24

Route Target Policy (BLUE VPN Hub-and-Spoke)

Role PE Router VRF Export RT Import RT
Spoke PE1 (Oxford) BLUE_SPOKE 65035:1011 65035:1030
Spoke PE3 (Brighton) BLUE_SPOKE 65035:1011 65035:1030
Spoke PE4 (Cardiff) BLUE_SPOKE 65035:1011 65035:1030
Hub PE2 (Cambridge) BLUE_HUB 65035:1030 65035:1011, 65035:1030

The hub imports both spoke routes (65035:1011) and its own routes (65035:1030) to enable spoke-to-spoke traffic via the hub. Fault 4 changes PE3's import RT to 65035:9999, breaking this policy.

Traffic Tests

BLUE VPN Traffic Tests (l3vpn-blue-test.yaml):

Flow ID Protocol Direction Rate Profile Peak
d1-to-hub-tcp TCP d1-blue → dh-blue multi_sine (daily + weekly cycle) ~45 Mbps at 14:00 UTC
d2-to-hub-udp UDP d2-blue → dh-blue schedule (business hours) ~45 Mbps at 13:30 UTC
d1-hub-tcp-bidir TCP d1-blue ↔ dh-blue upload: multi_sine; download: multi_sine (heavier) Upload ~45 Mbps, Download ~70 Mbps
d2-hub-udp-bidir UDP d2-blue ↔ dh-blue upload: schedule; download: 8 Mbps constant Upload ~45 Mbps, Hub push constant

RED VPN Traffic Tests (l3vpn-red-test.yaml):

Flow ID Protocol Direction Rate Profile Peak
d1-red-to-d2-red-tcp TCP d1-red → d2-red multi_sine ~45 Mbps
d3-red-to-d4-red-udp UDP d3-red → d4-red schedule (business hours) ~45 Mbps
d1-red-d3-red-bidir TCP d1-red ↔ d3-red diagonal upload: multi_sine; download: heavier Upload ~45 Mbps, Download ~70 Mbps
d2-red-d4-red-bidir UDP d2-red ↔ d4-red diagonal upload: schedule; download: 8 Mbps constant Upload ~45 Mbps, Hub push constant

Key signal: The d2-hub-udp-bidir and d2-red-d4-red-bidir hub-to-spoke reverse directions run a constant 8 Mbps UDP stream 24/7. Any fault that degrades this path produces a clean, unambiguous signal with no natural rate variation to hide behind — making it the most sensitive GNN training signal in the test suite.


GNN model details — for the full graph schema (node types, features, edge types), model architecture, training pipeline, inference scoring, and fault classification algorithm, see rca.md.


Before the per-fault analysis, this question deserves a direct answer.

For a clean physical link-down (cable pull, port failure), traditional SNMP gives a faster first alarm (< 1 second vs. 5-minute GNN inference cycle). However the GNN is superior in three specific scenarios:

Scenario Traditional NMS GNN
First detection of binary link failure < 1 second (SNMP linkDown trap) Up to 5 minutes
Alarm suppression / root cause isolation Fires N alarms (one per BGP session, OSPF peer) Common-path analysis → 1 root-cause alert
Pre-failure degradation detection Alarms only after threshold crossed (45+ min) Detects rx_err_gradient trend 35+ min before failure
Cross-layer RCA (L1 UP but L3 broken) Requires manual correlation across layers Graph structure enables simultaneous cross-layer inference

Recommendation: Use traditional SNMP/syslog as the fast alarm layer for binary physical failures. Use the GNN as the root-cause isolation and proactive degradation detection layer. They are complementary, not competing.


GNN Feature Coverage by Fault

The table below shows which GNN node features each fault exercises. A comprehensive test suite must exercise all features; gaps indicate RCA classifier branches that cannot be validated.

GNN Feature Node Type Faults That Exercise It
state router, interface F3 (loopback disable), F5b (link down)
cpu router F10 (BGP Update Storm) — gap in F1F9
mem router F10 — gap in F1F9
bgp_update_rate router F10 — primary RESOURCE_EXHAUSTION signal
vrf_count router F11 (Cross-VPN Leak)
fib_size_norm router F11
ospf_num_routes router F6 (area mismatch), F3 (RR crash), F9 (cost inflation)
pfx_count_norm router F2 (hub CE down), F3 (RR crash), F4 (wrong RT)
rx_drops / tx_drops interface F8 (TX queue starvation)
mtu_norm interface F1 (MTU mismatch)
tx_queue_len_norm interface F8 — direct detection (txqueuelen 20 = 0.02 vs. healthy 1.0)
rx_err_gradient interface F5 (SFP degradation)
tx_util / rx_util interface F1, F8, F9
bgp_state bgp_session F2 (CE session down), F3 (RR crash)
pfx_count_norm bgp_session F2, F3, F4, F11 (route leak)
prefix_count_delta bgp_session F2, F3, F4
session_uptime_norm bgp_session F7 (duplicate IP, flapping)
rt_import_count bgp_session F11
vrf_route_count vrf F4, F2
vrf_route_count_delta vrf F4
rt_import_hash vrf F4 — direct detection of wrong RT
rt_export_hash vrf F11 — direct detection of route leak
vrf_mem_bytes_norm vrf F11
vrf_active_sessions vrf F2
throughput_norm flow F1, F2, F8, F9
throughput_delta flow F2, F8
expected_rate_deviation flow F8, F9 — primary signal for silent performance faults
jitter_norm flow F8 — direct queue saturation signal
packet_loss_pct flow F1, F5
latency_ms_norm flow F9 — direct OSPF rerouting signal
active_sessions_norm flow F2, F3

Fault-by-Fault Analysis


Fault 1 — MTU Mismatch

Property Value
Type MTU_MISMATCH
File l3vpn-hub-spoke-fault1-mtu.yaml
Target PE1 / eth1 (Oxford → London, 172.16.90.0/24)
Alarms generated None
Severity Performance degradation — intermittent

What happens: PE1's uplink toward P1 (London) has its MTU reduced from 1500 to 1400 bytes. Any MPLS-encapsulated packets exceeding 1400 bytes (typical BGP UPDATE messages with many prefixes, and customer TCP segments with standard MSS) are silently dropped by the kernel. No ICMP fragmentation-needed is returned.

Traffic flows affected:

Flow Impact Mechanism
d1-to-hub-tcp ⚠️ Intermittent degradation Sheffield → PE1/eth1 → P1: 45 Mbps peak with 10 TCP sessions generates frequent large segments
d1-hub-tcp-bidir (upload) ⚠️ Degraded Same outbound path
d1-hub-tcp-bidir (download) Unaffected Hub → d1 enters PE1 inbound — MTU on PE1/eth1 does not affect ingress
d1-red-to-d2-red-tcp ⚠️ Degraded Norwich (PE1) → PE1/eth1 → P1 → PE2
d1-red-d3-red-bidir (d1 → d3) ⚠️ Degraded PE1/eth1 on outbound path
All d2, d3 flows Unaffected d2 uses PE3 (Brighton); d3 uses PE4 (Cardiff)

Backup path note: PE1 also has eth4 (P3-PE1 link, 172.16.160.0/24). If OSPF ECMP distributes load across both uplinks, approximately 50% of d1 traffic may avoid the fault, making the impact intermittent and harder to reproduce on demand.

Time-of-day dependency: The fault is most visible at 14:00 UTC (45 Mbps peak with 10 TCP sessions generating maximum segment sizes). At 02:00 UTC (5 Mbps overnight minimum), smaller traffic volume produces fewer drop events and the anomaly score may fall below threshold.

GNN detection:

  • interface node PE1/eth1: tx_drops increases; tx_util is high while P1/eth3 (connected interface) rx_util is lower than expected
  • 2-hop message passing across the connected_to edge detects the tx/rx utilisation asymmetry
  • Reconstruction error confined to PE1/eth1 — router and BGP session nodes healthy
  • Classifier output: INTERFACE → top_feature=mtu_norm → MTU_MISMATCH on PE1/eth1

Traditional tools: Interface UP, OSPF Full, BGP Established. ifOutDiscards SNMP counter increases but absolute thresholds are typically calibrated for hardware failures, not gradual MTU-induced drops.

GNN advantage: Passive detection from standard telemetry counters. Traditional approach requires explicit MTU probing (ping -s 1400) from every link in both directions — not standard practice and must target exactly the right vantage point.


Fault 2 — Hub CE Session Teardown

Property Value
Type BGP_SESSION_DOWN
File l3vpn-hub-spoke-fault2-ce-down.yaml
Target PE2 / eBGP session to ce1-hub (10.80.80.0/24, VLAN 402)
Alarms generated BGP session-down trap
Severity Service outage — total BLUE VPN blackout

What happens: The eBGP session between PE2 (Cambridge) and ce1-hub (Nottingham) is deleted. PE2 withdraws all hub customer routes. All spoke VRFs lose their imported hub routes immediately.

Traffic flows affected:

Flow Impact Mechanism
d1-to-hub-tcp 🔴 100% loss dh-blue (10.100.2.10) is unreachable
d2-to-hub-udp 🔴 100% loss Same
d1-hub-tcp-bidir 🔴 100% loss (both directions) Hub cannot reach spokes either
d2-hub-udp-bidir 🔴 100% loss The constant 8 Mbps hub push drops to 0 — unambiguous signal
All RED VPN flows Unaffected RED VPN uses separate VRF and CE routers (ce2-red at PE2 is independent)

Diagnostic note: The constant 8 Mbps hub monitoring push (d2-hub-udp-bidir reverse direction) is the most sensitive trigger. Any fault on the hub drops this to exactly 0, with no natural rate variation that could mask the outage.

GNN detection:

  • bgp_session node PE2↔ce1-hub: bgp_state → 0.0, pfx_count_norm → 0.0, prefix_count_delta → large negative
  • BGP session reconstruction error spikes; router and interface nodes remain healthy
  • Classifier output: BGP_SESSION → parent role=CE → count=1 → Local Access Failure on PE2/ce1-hub

Traditional tools: BGP session-down trap fires within seconds. However in a network with many VRFs, this generates one alarm per spoke VRF that loses hub routes (~34 cascade alarms). The GNN suppresses these and issues a single alert pointing to the root session.

GNN advantage: In a production network with 50 PE routers, one hub CE session failure can cascade into 50+ downstream alarms. The GNN collapses this to 1 root-cause alert.


Fault 3 — RR1 Process Crash

Property Value
Type PROCESS_CRASH
File l3vpn-hub-spoke-fault3-rr1-crash.yaml
Target RR1 (Birmingham, 10.0.0.1) — bgpd kill or loopback disable
Alarms generated 4+ BGP session-down traps simultaneously
Severity Route reflection instability — up to 90 second disruption

What happens: RR1's BGP daemon crashes or its loopback (source of iBGP router-ID) is disabled. All 4 PE-to-RR1 sessions drop simultaneously. Traffic re-reflects via RR2 (Bristol, connected to P3 and P4), but reconvergence takes up to 90 seconds.

Traffic flows affected during reconvergence window:

Flow Impact Mechanism
ALL flows (both VPNs) ⚠️ Up to 90s disruption VPNv4 route re-reflection via RR2 required
d2-hub-udp-bidir (8 Mbps constant) 🎯 Clearest detector Constant baseline makes even a 5-second interruption unambiguous
d1-hub-tcp-bidir peak at 14:00 UTC 🔴 Severe TCP retransmissions + new session re-establishment during convergence

After reconvergence (~90 seconds), all flows recover. The event appears as a "network hiccup" in hindsight.

GNN detection:

  • All 4 PE-to-RR1 session embeddings spike simultaneously
  • Common-path analysis: all failing sessions share RR1 as parent router
  • Classifier output: BGP_SESSION → parent role=RR → count=4 → RR_CRASH → root cause: rr1
  • 4 downstream PE-level alarms suppressed; 1 RR alert issued

Traditional tools: 4+ BGP alarms fire simultaneously. Without common-path analysis, the NOC sees "4 BGP sessions down across 4 sites" and starts individual per-site investigations. Root cause (single RR1) is not obvious from the alarm stream.

GNN advantage: This is the GNN's clearest win for traditional alarm reduction. At scale (50 PEs, 2 RRs), one RR crash generates 50 simultaneous BGP alarms — the GNN collapses this to 1 root-cause alert.


Fault 4 — Wrong Import Route-Target on PE3

Property Value
Type BGP_SESSION_DOWN / VRF misconfiguration
File l3vpn-hub-spoke-fault4-rt-import.yaml
Target PE3 (Brighton) — BLUE_SPOKE VRF import RT changed from 65035:1030 to 65035:9999
Alarms generated None
Severity Silent isolation of Spoke2 (Liverpool)

What happens: PE3's BLUE_SPOKE VRF no longer imports hub routes (RT 65035:1030 is rejected). PE3's VRF routing table empties of hub prefixes. d2-blue cannot reach dh-blue. However, ce2-spoke still exports its own prefix correctly, so the hub can still see PE3's routes and sends packets toward d2-blue that arrive but get no response.

Traffic flows affected:

Flow Impact Mechanism
d2-to-hub-udp 🔴 Silent failure d2-blue (Liverpool) has no route to hub — packets blackholed at PE3
d2-hub-udp-bidir (d2 → hub) 🔴 Fails Same
d2-hub-udp-bidir (hub → d2, 8 Mbps constant) ⚠️ One-way illusion Hub can still send to d2; d2 receives but cannot respond — one-way connectivity
All d1, d3 flows Unaffected PE1 and PE4 VRFs have correct RT
All RED VPN flows Unaffected Separate VRF — RED VPN PE3 VRF unaffected

The one-way deception: Pings from dh-blue → d2-blue succeed (hub→spoke path works). Pings from d2-blue → dh-blue fail. A naive NOC test from the hub side concludes "connectivity OK." Only bidirectional end-to-end testing from d2-blue's vantage reveals the fault.

GNN detection:

  • PE3's iBGP sessions toward RR show anomalous pfx_count_norm (receiving fewer VPN routes than the baseline)
  • HetGNN isolates deviation to PE3's VRF config sub-embedding — the RT import hash deviates from training baseline
  • D-GAT detects asymmetric reachability (CE2 advertises normally; hub→PE3 imports zero)
  • Classifier output: BGP_SESSION → asymmetric import pattern on PE3 → VRF_RT_MISCONFIGURATION

Traditional tools: Zero alarms. All sessions Established. BGP prefix counts appear normal from the hub's perspective. RT policy is not monitored by standard NMS.

GNN advantage: No traditional tool passively monitors RT import policy compliance. This fault is only discoverable via explicit VRF audit scripts or customer complaint.


Property Value
Type PACKET_CORRUPTION (tc netem)
Target P1 / eth2 (London → Edinburgh, 172.16.40.0/24)
Alarms generated ⚠️ Late alarm at ~45 minutes
Severity Gradual hardware degradation → eventual link failure

What happens: Progressive packet corruption is injected on P1/eth2 simulating a failing optical SFP. Errors start at <1% and accelerate over 55 minutes until the link is unusable.

Degradation timeline vs. traffic flows:

Time Error Rate Traffic Impact GNN Signal
t=030 min <1% Imperceptible — TCP absorbs retransmissions rx_err_gradient rising on P1/eth2; score below threshold
t=30 min ~23% Visible TCP retransmits at 45 Mbps peak GNN anomaly score crosses threshold — alert: hardware degradation
t=45 min ~8% OSPF LSA drops; link metric instability; ECMP shifts rx_err_gradient high; P1 ospf_num_routes fluctuates
t=55 min >20% LDP drops, routing instability, re-route Full reconstruction error spike; traditional alarm fires

Traffic flows affected (flows using P1-P3 path):

Flow Impact Why
d1-hub-tcp-bidir (download, 70 Mbps peak) 🔴 High degradation at peak PE2→P3→PE1 path (or PE1→P1→P3→PE2) uses P1-P3 link under ECMP
d1-red-d3-red-bidir diagonal ⚠️ Degraded PE1↔PE3 diagonal traverses P1-P3 under some OSPF paths
Flows not on P1-P3 path Initially unaffected OSPF ECMP shifts load; other paths pick up the slack

GNN advantage: At t=30 minutes, GNN raises HARDWARE_DEGRADATION alert on P1/eth2 with top_feature=rx_err_gradient. Traditional monitoring does not alarm until t=4555 minutes. The GNN provides 1525 minutes of early warning — enough time for proactive SFP replacement before SLA breach.

GNN detection:

  • interface node P1/eth2: rx_err_gradient rises steadily across successive inference cycles
  • Trajectory analysis: steadily increasing anomaly score = hardware degradation (vs. single spike = transient noise)
  • Classifier output: INTERFACE → top_feature=rx_err_gradient → HARDWARE_DEGRADATION → root cause: P1/eth2

Traditional tools: ⚠️ Late alarm. Fires only when CRC errors exceed absolute SNMP thresholds (~45+ minutes into the fault). By then, SLA is already breached for d1 and d1-red customers.


Property Value
Type LINK_DOWN
Target Any router interface — e.g., P1/eth2 (P1-P3 link)
Alarms generated Immediate (SNMP linkDown trap < 1s)
Severity Immediate traffic rerouting or blackout

GNN role for physical link-down: Traditional SNMP wins on detection speed. The GNN's value here is:

  1. Alarm suppression: One link down cascades into OSPF adjacency failures, BGP session drops, and prefix withdrawals. The GNN issues one root-cause alert rather than N cascade alarms.
  2. Blast radius analysis: The graph structure shows exactly which flows are affected and which have backup paths.
  3. Disambiguation: Distinguishes a single link failure from a router failure (which would take down all links simultaneously).

GNN detection:

  • interface node: state → 0.0; tx_util → 0.0; rx_util → 0.0
  • Connected router node: ospf_num_routes drops; pfx_count_norm changes
  • Classifier output: INTERFACE → top_feature=state → INTERFACE_DOWN on [router]/[interface]

Fault 6 — OSPF Area Mismatch

Property Value
Type OSPF_AREA_MISMATCH
Target P2 / eth2 (Manchester → Leeds, 172.16.60.0/24) — area set to 0.0.0.99 instead of 0.0.0.0
Alarms generated None (physical link remains UP)
Severity OSPF traffic-engineering lost on P2-P4 path

What happens: P2-P4 OSPF adjacency fails (will not reach Full state). Physical link remains UP and transmitting — only L3 forwarding is affected. MPLS LDP may stay up, but OSPF-computed paths through P2-P4 are lost.

Traffic flows affected:

Flow Impact Why
d3-blue-to-hub (Huddersfield/PE4 → hub) ⚠️ Rerouted PE4→P4→[no P2 OSPF]→must detour via P3→P1→PE2 or P3→PE2
d3-red-to-d4-red-udp ⚠️ Rerouted PE3→PE4 path normally via P2-P4 direct; now detours
d2-red-d4-red-bidir ⚠️ Rerouted PE2↔PE4 diagonal return path affected
BLUE d1 flows (PE1↔PE2) Mostly unaffected PE1↔PE2 via P1 or P3 direct; doesn't require P2-P4
RED d1-d2 flows (PE1↔PE2) Unaffected PE1→P1→PE2 direct path

Key nuance: PE3 (Brighton) has two uplinks — to P4 (Leeds) and to P2 (Manchester) directly. Even if P2-P4 adjacency fails, PE3 can still reach P2 directly via its own PE3-P2 link (172.16.170.0/24). So PE3-sourced traffic is less affected than PE4-sourced traffic (which must route to P2 via P2-P4 or P4-P3-P1-P2 detour).

GNN detection:

  • P2 router node: ospf_num_routes drops (SPF tree loses P4's LSAs)
  • P2/eth2 interface node: tx_util ≈ 0.0 despite state=UP — link transmitting but carrying no routed traffic
  • D-GAT: ospf_peer edge P2↔P4 shows anomalous OSPF state while physical link state=UP (cross-layer mismatch)
  • Classifier output: ROUTER → ospf_num_routes drop + INTERFACE tx_util=0 on UP link → OSPF_AREA_MISMATCH on P2/eth2

Traditional tools: No alarm. Physical link UP. BGP Established. Unless OSPF adjacency state is explicitly monitored (non-default in most NMS tools), this is invisible. Models an operator copy-paste error during a maintenance window.

GNN advantage: The cross-layer contradiction — L1 says "UP", L3 says "no routes via this link" — is precisely what the GNN's multi-layer graph captures and traditional single-layer monitoring cannot.


Fault 7 — Duplicate IP Address

Property Value
Type DUPLICATE_IP
Target P3 / eth3 (Edinburgh → Leeds, 172.16.50.0/24) — duplicate of P1/eth1 address 172.16.30.1
Alarms generated None
Severity Intermittent black-holing on P3-P4 transit paths

What happens: P3 (Edinburgh) claims the IP address 172.16.30.1/24 which legitimately belongs to P1 (London) on the P1-P2 link. P3 sends gratuitous ARPs claiming this address on the P3-P4 segment. P4 (Leeds) receives conflicting ARP entries. Any traffic P4 forwards toward 172.16.30.1 may be misdirected to P3 instead of P1, depending on which ARP entry is cached at any given moment.

Traffic flows affected (intermittently):

Flow Impact Why
d3-blue-to-hub ⚠️ Intermittent PE4→P4→P2→P1→PE2 path: P4 may misdirect 172.16.30.1-bound traffic
d3-red-to-d4-red-udp ⚠️ Intermittent PE3→P4→PE4: P4 ARP confusion
d2-red-d4-red-bidir ⚠️ Intermittent PE4 as endpoint; P4 as transit
d1/d2 BLUE and d1-d2 RED flows Mostly unaffected Primarily use P1-P2 direct paths, not P3-P4 segment

The intermittent pattern: Impact is worst immediately after P3 sends gratuitous ARPs. It fades as ARP entries timeout (2030 minutes), then returns on the next gratuitous ARP cycle. This creates a cycling availability pattern — extremely hard to diagnose because pings often succeed when the NOC investigates.

GNN detection:

  • session_uptime_norm on BGP sessions belonging to P4-adjacent routers oscillates — sessions reset as routing breaks intermittently
  • prefix_count_delta oscillates as routes withdraw and return
  • Two routers show anomalous sessions with overlapping peer IP space
  • Classifier output: BGP_SESSION → session_uptime_norm low on 2+ routers → IP_OVERLAP → rogue session on P3

Traditional tools: Zero alarms. ARP table changes are not surfaced by standard NMS. The intermittent pattern means test pings frequently succeed during investigation.

GNN advantage: The GNN infers ARP-level conflicts from their effect on BGP session stability — an indirect signal that no SNMP MIB directly exposes.


Fault 8 — TX Queue Starvation NEW — Silent Performance

Property Value
Type TXQUEUE_STARVATION (new type)
Target PE2 / eth2 (Cambridge → Edinburgh, 172.16.110.0/24) — txqueuelen 1000 → 20
Alarms generated None
Severity 3060% hub throughput loss on Edinburgh-bound paths

What happens: The Linux transmit queue length on PE2's P3-facing uplink is reduced from the default 1000 packets to 20 via ip link set eth2 txqueuelen 20. This is a kernel parameter — it does not appear in VyOS running-config, is not stored in the VyOS commit history, and is not captured by any configuration management or GitOps system. Under the aggregate hub traffic load, the 20-packet queue fills and overflows thousands of times per second.

Why PE2/eth2 is the highest-impact target: PE2 (Cambridge) is the hub router for BLUE VPN. PE2/eth2 is the P3 (Edinburgh) uplink — traffic from PE2 toward Edinburgh-routed paths (PE1 via P3, and ECMP-distributed hub downloads) exits here.

Traffic flows affected:

Flow Time of Day Impact Mechanism
d1-hub-tcp-bidir (hub → d1 downloads) 14:00 UTC peak (70 Mbps) 🔴 Severe — 3060% throughput loss Hub distributes downloads via PE2/eth2 if ECMP routes Oxford-bound traffic via P3
d2-hub-udp-bidir (hub → d2, 8 Mbps constant) 24/7 ⚠️ Detectable — constant baseline makes deviations unambiguous Any congestion on PE2/eth2 immediately shows in jitter_ms and packet_loss_pct of the constant stream
d2-to-hub-udp return path Business hours ⚠️ Congested Hub response traffic exits PE2/eth2 toward Liverpool-bound path
BLUE d3 hub traffic Business hours ⚠️ Congested If routed via P3 from PE2

Time-of-day pattern: Queue starvation is worst during 09:0017:00 UTC when multiple flows compete for PE2/eth2. At 02:00 UTC (5 Mbps overnight minimum), the 20-packet queue overflows less frequently — anomaly score may fall below threshold briefly, making the fault appear "intermittent" to traditional monitoring even if it were detectable.

GNN detection:

  • interface node PE2/eth2: tx_drops → high; tx_util → high; while P3/eth1 (connected interface) rx_util → lower than expected
  • 2-hop message passing detects the transmit/receive utilisation asymmetry across the connected_to edge
  • Reconstruction error concentrated on PE2/eth2 interface node
  • Classifier output: INTERFACE → top_feature=tx_drops → TX_QUEUE_STARVATION on PE2/eth2

Traditional tools: txqueuelen is a kernel parameter, not in any config management system. ifOutDiscards SNMP counter increases, but absolute thresholds are calibrated for hardware failure rates, not kernel config drift. During business-hours peak, drops are significant but masked by traffic variability.

GNN advantage: This fault exists entirely outside every configuration management system. No config audit, no VyOS diff, no compliance tool can detect it. The GNN detects it purely from telemetry behaviour — the only observable signal this fault produces.


Fault 9 — OSPF Interface Cost Inflation NEW — Silent Performance

Property Value
Type OSPF_COST_INFLATION (new type)
Target P2 / eth1 (Manchester → London, 172.16.30.2) — OSPF cost 1 → 65535
Alarms generated None
Severity 3-hop detour for Brighton/Cardiff traffic; 100 Mbps P3-PE2 link becomes congestion point

What happens: The OSPF interface cost on P2's eth1 (P2-to-P1 link, Manchester side) is changed from 1 to 65535. The OSPF adjacency on P2/eth1 remains Full — cost changes never break adjacencies. OSPF SPF recalculates: the P2→P1 direction is now prohibitively expensive. Traffic that previously used the direct P2→P1 path reroutes via P2→P4→P3→P1.

OSPF path change (P2→P1 direction only — asymmetric):

Before After
PE3 → P2 → P1 → PE2 (2 hops, cost 3) PE3 → P4 → P3 → PE2 (3 hops, but cost 3)
PE4 → P2 → P1 → PE2 (2 hops, cost 3) PE4 → P4 → P3 → PE2 (3 hops, cost 3)
P2 → P1 (direct, cost 1) P2 → P4 → P3 → P1 (cost 65537 vs. detour)

Critical bottleneck created: The P3-PE2 link (172.16.110.0/24) is a 100 Mbps link. With P2/eth1 cost inflated, traffic from PE3 (Brighton) and PE4 (Cardiff) returning toward PE2 (Cambridge) reroutes through P4→P3→PE2, converging on this single 100 Mbps link. BLUE VPN hub downloads (up to 70 Mbps) and RED VPN diagonal traffic (up to 45 Mbps) may both reroute through it simultaneously.

Traffic flows affected:

Flow Before Fault After Fault Impact
d2-to-hub-udp (Liverpool/PE3 → Cambridge/PE2) PE3→P2→P1→PE2 (direct) PE3→P4→P3→PE2 (via 100 Mbps P3-PE2) ⚠️ Higher latency; potential congestion
d2-hub-udp-bidir reverse (hub → d2, constant) PE2→P1→P2→PE3 PE2→P3→P4→PE3 (P3-PE2 now bidirectional) 🔴 Congestion + jitter on constant 8 Mbps — clean signal
d3-blue-to-hub (Huddersfield/PE4 → hub) PE4→P2→P1→PE2 PE4→P4→P3→PE2 ⚠️ Increased latency
d2-red-d4-red-bidir return PE2→P1→P2→PE4 PE2→P3→P4→PE4 ⚠️ Return path via congested P3-PE2 link
d3-red-to-d4-red-udp PE3→P2→P4→PE4 PE3→P4→PE4 (shorter!) Slightly improved
BLUE d1 flows (PE1↔PE2) PE1→P1→PE2 (direct) Unchanged Doesn't traverse P2 Manchester
RED d1-d2 flows (PE1↔PE2) PE1→P1→PE2 Unchanged Doesn't traverse P2 Manchester

Cross-VPN impact: OSPF is shared infrastructure across both VPNs. One P-router misconfiguration degrades both BLUE and RED VPN simultaneously.

GNN detection (multi-node):

  • P2/eth1 interface: tx_util ≈ 0.0 despite state=UP and OSPF Full adjacency — never seen during training. Highest individual anomaly score.
  • P3/eth1 (P3-PE2 link) interface: tx_util and rx_util elevated above training baseline — congestion
  • P4/eth3 (P3-P4 link) interface: elevated from rerouted traffic
  • Router nodes P3, P4: elevated cpu from increased forwarding load
  • Classifier output: INTERFACE → state=UP + OSPF_Full + tx_util≈0 on P2/eth1 → OSPF_COST_INFLATION

Traditional tools: OSPF cost 65535 is a legal configured value. All adjacencies Full. All interfaces UP. BGP Established. Identifying root cause requires manually running show ip ospf interface on every P router — impractical without the GNN's graph-wide visibility.

GNN advantage: The GNN knows that a Full OSPF adjacency on a core link should carry traffic proportional to its position in the topology. P2/eth1 with zero traffic but Full adjacency is a contradiction the model has never seen — the reconstruction error reveals the anomaly without any operator intervention.


Property Value
Type Candidate new type: BGP_UPDATE_STORM
Target RR1 (Birmingham) — controlled route-flap injection from a test peer
Alarms generated None (BGP sessions remain Established)
Severity CPU saturation on RR1 and P-routers; potential keepalive delays

Why this fault is needed: The GNN RCA classifier contains the branch CASE dominant layer = 'router': IF top_feature IN ('cpu', 'mem') → RESOURCE_EXHAUSTION — but no current fault exercises the cpu and mem features. Without a training/validation scenario for this path, the classifier branch is implemented but unvalidated.

What happens: A controlled BGP peer connected to RR1 advertises ~10,000 prefixes with rapid withdraw/re-advertise cycles. RR1's bgpd process CPU spikes to 7080%. P-routers receiving frequent route updates via RR1 spend excessive time on SPF recalculation, competing with packet forwarding interrupt handling.

Traffic impact: At 7080% CPU on RR1, BGP keepalive processing is delayed. If hold-timers are tight (default 90s/30s keepalive), sessions may reset. At moderate levels, forwarding latency increases for all transit traffic.

GNN detection:

  • RR1 router node: cpu feature spikes to values never seen in training
  • ospf_num_routes fluctuates as updates compete with forwarding
  • Classifier output: ROUTER → top_feature=cpu → RESOURCE_EXHAUSTION on rr1

Traditional tools: BGP sessions remain Established. CPU is visible in SNMP hrProcessorLoad MIB, but most NMS tools don't correlate CPU spikes with specific service impact without additional rules.


Property Value
Type Candidate new type: VRF_RT_EXPORT_LEAK
Target PE1 (Oxford) — BLUE_SPOKE VRF accidentally exports with RED VPN RT
Alarms generated None
Severity RED VPN traffic attracted to BLUE spoke prefixes; potential misdirection

Why this fault is needed: The RED and BLUE VPNs coexist with separate VRFs and RT policies, but no fault exercises the interaction between them. A realistic operator error during a maintenance window (adding wrong RT to an export policy) causes routes from one VPN to appear in the other's RIB.

What happens: PE1's BLUE_SPOKE VRF export RT is extended to include the RED VPN export RT. PE1 now advertises BLUE spoke routes (10.100.1.0/24) into the RED VPN's RIB. RED VPN devices that have routes overlapping with BLUE spoke prefixes may have their traffic misdirected.

Traffic impact: RED VPN d1-red (Norwich/PE1) traffic to any destination matching the leaked prefix gets misdirected. Variable packet loss depending on prefix overlap.

GNN detection:

  • pfx_count_norm on RED VPN BGP sessions toward PE1 increases anomalously (more prefixes than the model learned for RED sessions)
  • HetGNN isolates the deviation to PE1's bgp_session nodes with anomalous prefix counts
  • Classifier output: BGP_SESSION → pfx_count_norm above baseline on RED sessions at PE1 → VRF_RT_EXPORT_LEAK on PE1

Traditional tools: More routes in a BGP session is typically not alarmed. The leak is completely silent until customer complaint.


Consolidated Comparison Table

# Fault Target Traffic Flows Affected Alarms Traditional Detection GNN Detection Time GNN Advantage
1 MTU Mismatch PE1/eth1 (Oxford→London) d1 BLUE TCP, d1-red TCP Manual MTU probe — hours 5 min (tx/rx asymmetry) Passive; no probing needed
2 Hub CE Session Down PE2 ↔ ce1-hub (Nottingham) ALL BLUE VPN — total blackout BGP trap Seconds 5 min + cascade suppression Reduces ~4 VRF alarms → 1
3 RR1 Crash RR1 (Birmingham) ALL traffic — 90s disruption 4+ BGP traps Seconds (noisy) 5 min, 1 RR alert 4× alarm reduction; 50× at scale
4 Wrong RT on PE3 PE3 BLUE_SPOKE VRF d2 BLUE UDP — silent isolation Customer complaint 5 min (pfx_count drop on PE3) Zero-alarm scenario: only GNN
5 Degrading SFP P1/eth2 (London↔Edinburgh) d1 TCP, d1-red diagonal ⚠️ 45+ min late 45+ min after degradation t=30 min — 35 min earlier Proactive vs. reactive
5b Link Down Any P/PE interface Depends on redundancy Immediate < 1 second 5 min RCA suppression + blast radius
6 OSPF Area Mismatch P2/eth2 (Manchester↔Leeds) PE4 flows; some PE3 flows Never 5 min (cross-layer L1/L3) Cross-layer correlation
7 Duplicate IP P3/eth3 (Edinburgh→Leeds) Intermittent P4-transit flows Never 5 min (session_uptime oscillation) Infers ARP conflict from BGP
8 TX Queue Starvation PE2/eth2 (Cambridge→Edinburgh) Hub downloads; d2 constant UDP Never (not in config) 5 min (tx_drops + asymmetry) Kernel param: only GNN can detect
9 OSPF Cost Inflation P2/eth1 (Manchester→London) d2/d3 BLUE; d3/d4 RED diagonal Never (legal config value) 5 min (tx_util≈0 on Full link) Legal-but-wrong: only GNN
(10) (BGP Update Storm) (RR1 Birmingham) (All flows — CPU pressure) CPU MIB only — no service correlation 5 min (cpu feature) Validates RESOURCE_EXHAUSTION path
(11) (Cross-VPN Route Leak) (PE1 Oxford — BLUE/RED VRF) (RED VPN d1-red misdirection) Never (more routes ≠ alarm) 5 min (pfx_count anomaly) Multi-VPN interaction: only GNN

Faults 10 and 11 are recommended additions to fill identified GNN feature coverage gaps.


Why GNN Is Superior: Three Core Principles

1. Behavioural Baseline, Not State Transitions

Traditional tools alarm when something transitions from a known-good state to a known-bad state (UP → DOWN, Established → Idle). This is powerful for binary failures but completely blind to misconfigurations that remain in valid states.

The GNN alarms when behaviour deviates from the learned normal pattern, regardless of protocol state. Faults 1, 4, 6, 7, 8, and 9 all maintain perfectly healthy protocol states — the GNN is the only system that can detect them.

2. Graph-Wide Simultaneous Inference

When Fault 9 (OSPF cost inflation on P2/eth1) causes P2/eth1 tx_util ≈ 0, congestion on P3/eth1, elevated CPU on P3 and P4, and latency increases on PE3/PE4 — traditional monitoring tools see each of these as independent signals. A NOC engineer must manually correlate "P2 link traffic dropped" + "P3 link congestion" + "PE3 customers slow" and connect them to a single cause.

The GNN sees all nodes simultaneously in one inference pass. The ospf_peer edge structure tells the model exactly which router-router pairs are OSPF adjacencies — allowing it to reason that P2/eth1 with zero traffic but a Full adjacency is anomalous relative to the graph topology, and that the congestion pattern on the detour path is the expected consequence.

rx_err_gradient (hardware degradation) and session_uptime_norm (IP conflict flapping) encode rates of change and normalised ages — features that capture trends over time, not just the current state. Traditional monitoring systems compare each poll to a static threshold. The GNN's gradient features enable it to detect "this metric is getting worse at an accelerating rate" and alert proactively 35+ minutes before the failure becomes severe.

The constant 8 Mbps monitoring push in both BLUE (d2-hub-udp-bidir reverse) and RED (d2-red-d4-red-bidir reverse) traffic tests is specifically designed to serve as a high-sensitivity GNN training signal. Any deviation from a perfectly flat 8 Mbps UDP stream is immediately anomalous — providing a clean, unambiguous signal for any fault that degrades the hub-to-spoke or diagonal-mesh return paths.