Cutting wld CPU on real hardware — measured, fixed, validated
A reproducible real-device study of the prplOS WiFi daemon under client scale — and an AI-agent-driven path from a 44% CPU hotspot to 18%.
wld CPU climbs steeply with client densityOn the OSPv2 access point, the WiFi daemon's CPU rises hard as more clients associate — long before any data traffic. At 114 clients it burned ~44% of a core just keeping its datamodel in sync.
Measured wld CPU vs associated clients, three build stages. Real hardware, 3-rep sweeps.
mt7915e), CDRouter 16.4.2perf record of wld + 1 Hz CPU / IO / RSS samplerBuild → flash → associate → measure → symbolize → report — orchestrated end-to-end by an LLM agent against the live testbed.
Verify by binary fingerprint (prplMesh git hash + .so sha256), assert overlay wiped — not opkg/DISTRIB_REVISION (which lie when patched-not-bumped).
performance governor, turbo off, CLK_TCK verified, dwarf,32768+cycles:pp PEBS, static-IP path so the result is assoc-bound.
Topology assert re-enables scaling VAPs, pins channels, forces ProcessManager on — reboots silently drift these off.
perf record -e cycles:pp (PEBS precise), --call-graph dwarf,32768kallsyms (cross-perf can't dwarf-unwind)<0.1%libwld/libswla exportsgetStationStats RPC/s and datamodel write-txns/sutime+stime), IO, RSS per windowProfiles say where; counters say how often and how much. Together they pin the mechanism.
perf probe segfaultsperf probe segfaults on this perf 5.15 musl build — so probes are registered manually via tracefs (/sys/kernel/debug/tracing/uprobe_events) and counted with perf stat -e <probe> -p $(pidof wld)./usr/lib/libwld.so.7.27.8 (the on-DUT plugin is stripped). Offsets = .dynsym st_value (executable PT_LOAD has p_offset==p_vaddr).swl_object_finalizeTransactionOnLocalDm in libswla (2 mappers) — not amxd_trans_apply in libamxd (92 mappers — would int3 the whole box).monotonic CPU 18<60<114 · unknown-frame % within budget · assoc count matches datamodel · governor/CLK_TCK verified · RSS bounded · functional parity (clients held, disassoc cleans up). Every shipped run passed all six first-pass.
The agent SIGSTOPs the polling consumer for one window, then resumes it — a reversible, decisive attribution test.
Stop the prplMesh monitor for 20 s @114: wld CPU collapses 42.0% → 1.4%. Clients held; no reboot.
The prplMesh monitor polls getStationStats ~1×/s per radio. Each call ran a full
per-station datamodel write-back of mostly-static fields.
event loop → USP "operate" RPC → amxd_object_invoke_function
_getStationStats 41.9% (≈57% of resolved wld CPU)
└ s_addStaStatsValues 36.0%
└ wld_vap_sync_assoclist 27.3% ← loops ALL N stations / VAP
└ wld_ad_syncInfo 25.2% ← per-station write of 40+ params
└ finalizeTransaction 22.4% ← one DM write-txn / station / poll
| @ | stations/VAP | syncInfo / call | DM write-txn/s | wld CPU |
|---|---|---|---|---|
| 60 | ~21 | 21.1 | 65.6 | 28.5% |
| 114 | ~39 | 41.1 | 98.9 | 42.7% |
Write-txns/call ≈ stations-per-VAP ⇒ one datamodel write-transaction per associated station per poll. ~1.9 ms/station-sync, constant — total scales with density, independent of load.
Before getStationStats dominated, the top hotspot was
sahTraceGetZone + strcmp — a trace-zone lookup on every log
check — eating ~28–30% of CPU under DHCP load.
Collapsed the hotspot to ~1.5% — which uncovered the getStationStats cost underneath.
sahTraceGetZone;strcmp share of CPU, before/after the trace-walk gate.
int beats an 80-zone strcmp walksahtrace registered ~80 trace zones; every trace check walked the whole
list + an "all" fallback, strcmp-ing each — even when the message would never
print. The hot swl/pwhm netlink path fires ~1036 INFO traces.
/* libsahtrace fork (patch 0100): short-circuit before the zone walk */
if (lvl > maxEffectiveZoneLevel) /* one int, recomputed at the 4 zone mutators */
return; /* was: walk ~80 zones + "all" fallback, strcmp each */
The mxl zones were pinned at level 400 (INFO) — exactly the hot
trace level — so 400 > 400 is false and the dominant path still walked every zone.
Config fix (option-b): lower mxl zones 400 → 200 (re-applied each
prepare — they revert to 400 on wld restart) ⇒ maxEffectiveZoneLevel < 400 ⇒ the guard fires.
| zones @400 | @200 | |
|---|---|---|
| wld CPU | 18.47% | 16.55% |
| sahtrace zone-lookup | 7.06% | 0.17% |
| getStationStats/win | ~180 | ~180 |
−1.92 pp — offered load identical, so the drop is the lever, not less work.
Caught pre-flight: sah_trace_level is unsigned on this toolchain, so
the −1 sentinel promoted to UINT_MAX and silently suppressed all tracing — a
perf-only run would have passed. A host equivalence test (now a regression guard) caught it; fixed with a signed cast.
The per-poll write-back rebuilt 40+ mostly-static identity/capability params every time, because they shared the volatile-sample freshness trigger.
Fix: gate wld_ad_syncInfo's static block on actual change
(first sync / assoc-state), not the per-sample timestamp. Steady-state polls now
trans_clean instead of apply.
| @114 | before | after |
|---|---|---|
| DM write-txn/s | 98.5 | 3.47 |
| txn / call | ~43.5 | ~1.3 |
| wld CPU | 43.95% | 28.5% |
Datamodel write-transactions/s @114 — a 96% drop to the NrDev floor. wld CPU −15.5 pp (strict A/B, same image ±patch).
lastSampleTime.wld_assocdev.c:1929 compares lastSampleSyncTime != lastSampleTime → always true.
/* before: any new sample re-applies the whole static block */
if (pAD->lastSampleSyncTime != pAD->lastSampleTime) { rebuild_40_params(); apply(); }
/* after (0901): only on genuine change; otherwise clean, no apply */
if (first_sync || assoc_state_changed) { rebuild_40_params(); apply(); }
else { amxd_trans_clean(&trans); } /* steady state */
pAD->lastSampleSyncTime = pAD->lastSampleTime; /* watermark still advanced */
Result in the profile: wld_ad_syncInfo 25.2% → 0.5%; finalizeTransaction 22.5% → 0.6%; amxd_trans_apply 21.9% → 0.6%. The readback (swla_dm_getObjectParams, ~13.7%) is now the top slice → Fix #2.
getStationStatsBrief() RPCTo build the reply, the RPC re-marshalled the entire ~70-param
AssociatedDevice object (+ AffiliatedSta) back out of the
datamodel per station — yet the monitor reads only 14 fields.
Fix: a dedicated getStationStatsBrief() that keeps every
datamodel write but builds the reply directly from live state, emitting exactly the
14-key contract. The monitor polls it; getStationStats() is untouched.
| @114 (A/B) | Fix #1 | + Fix #2 |
|---|---|---|
| readback slice | 11.34% | 0% |
| full-reply builder | 34.08% | 0% |
| wld CPU | 28.89% | 18.39% |
Eliminated profile slices @114. wld CPU −10.5 pp (−36%). 14-key values verified identical to the full reply.
Shared s_addStaStatsPrologue keeps the driver pull + wld_vap_sync_assoclist → syncStats + the Fix #1-gated syncInfo. Only the reply build differs: s_addStaStatsValuesBrief reads live pAD instead of re-reading the DM. getStationStats left byte-for-byte unchanged.
14 keys per station, values == full reply (checked station B0:75:0C:DA:40:B4). DM write-txn flat 6.7/s (no regression). RSS unchanged (~70 KB/client). Full curve 8.62/18.05/28.89 → 7.01/12.33/18.39. All 6 gates pass.
Cumulative on the hot path: ~44% → 28.5% → 18.4%
getStationStats hot pathBoth fixes shipped to the forks and validated by strict same-image A/B. Monotonic, tight (stdev ≤ 0.14 pp).
BEFORE — the syncInfo write-back tower (~44% CPU)
AFTER — write-back gated, readback bypassed (~18% CPU)
↗ New tab⤢ Fullscreen for the interactive flamegraph (click a frame to zoom, Esc to close), or open in a new tab. What remains in "after" is the inherent per-station driver pull — the target of Fix #6.
Fix #5 targeted alloc churn in the bulk driver pull
(wld_nl80211_getAllStationsInfo): drop a max-capacity calloc-zero
+ a redundant copy. Written, applies clean, compiles -Werror. Zero behaviour change.
Then we measured it. On this MaxLinear platform the bulk path is structurally absent — the vendor HAL pulls stations one-by-one. The code Fix #5 edits is never executed here.
Decision: do not flash. Keep the zero-risk patch on a branch (valid for generic mac80211), but don't disturb the shared baseline for a no-op.
The method disproved its own hypothesis on real silicon — and we let it.
mfn_wvap_get_station_stats binds to the MXL HAL, which loops per station
— 3 netlink round-trips × N per radio per poll, not one bulk dump.
| @114 path | % wld CPU |
|---|---|
wld_nl80211_getAllStationsInfo (Fix #5 target) | 0.000% |
whm_mxl_vap_getSingleStationStats (actual MXL pull, ×N) | 22–46% |
After Fix #1+#2, the residual is the driver pull itself: on MXL,
3 netlink round-trips per station (generic GET_STATION +
vendor DEV_DIAG3 + PEER_FLOW) ≈ 351 round-trips/sweep @114.
DEV_DIAG3 (~16%) is redundant for the 14-key brief.sahTraceGetZone class resurfaces (~8%) inside the vendor attr-list build.3N → 1), parity-by-construction.Projected: wld CPU on the brief path ~18.4% → ~8–10%. Plus Fix #3 (poll-cadence) as an orthogonal config-only lever under investigation.
The driver's nl80211 GET_STATION is itself fed by the same tr181 helpers
the vendor subcmds use. Every brief key except DownlinkBandwidth derives from
mtlk_sta_get_tr181_peer_stats ⇒ a new bulk subcmd is parity-by-construction.
| source | brief keys covered |
|---|---|
nl80211 GET_STATION | Tx_Retransmissions, Tx/RxErrors, DownlinkBandwidth |
vendor devDiag3 (droppable) | Tx/RxPacketCount, SignalNoiseRatio |
vendor peerFlow | Tx/RxBytes, LastData*Rate, SignalStrength, TxUnicastPacketCount |
Snapshot-safe: tr181 deltas read read-only; reset only on RESET_STATISTICS — no double-consume. The unused lean GET_ASSOCIATED_STATIONS_STATS covers only ~6/14 with absolute (not delta) semantics → a new subcmd, not reuse.
| metric | value |
|---|---|
| Calendar span | 3 days (Jun 20–22) |
| Agent turns | 8,357 |
| User messages | 3,765 |
| Commits (this repo) | 56 |
| Agent-active | ~25 h |
| Human-attended (est.) | ~23–25 h |
Human time estimated from session message density (interactive supervision) — not separately logged; an upper bound where sessions overlapped.
~1.61 B tokens total — dominated by 1,551 M cache-reads (×0.1 price).
Uncached the same tokens would cost ~$24,937. Cache-reads alone: $2,327 (would be $23,266). Figure is a floor — long-context (>200K) turns bill higher.
Fix #1 + Fix #2 — 44% → 18.4% wld CPU @114, validated by strict A/B on real hardware.
Fix #5 — zero-risk, but measured to a no-op on MXL. Kept for generic mac80211.
Fix #6 — bulk vendor subcmd, 3N→1, projected ~8–10%. Fix #3 cadence in study.
A repeatable real-device harness + disciplined measurement turned a vague "WiFi is slow at scale" into named, file-line root causes and validated fixes — most of it automated.