Scaling prplOS WiFi

Cutting wld CPU on real hardware — measured, fixed, validated

A reproducible real-device study of the prplOS WiFi daemon under client scale — and an AI-agent-driven path from a 44% CPU hotspot to 18%.

The problem

`wld` CPU climbs steeply with client density

On the OSPv2 access point, the WiFi daemon's CPU rises hard as more clients associate — long before any data traffic. At 114 clients it burned ~44% of a core just keeping its datamodel in sync.

Not caused by traffic — caused by polling.
Scales with stations-per-radio, independent of offered load.
A real ceiling for high-density deployments.

Measured wld CPU vs associated clients, three build stages. Real hardware, 3-rep sweeps.

What we built · the harness

A push-button, agent-driven measurement rig

Load: CDRouter scaling ladder

6× NTA-3000 radios (mt7915e), CDRouter 16.4.2
Up to 19 real clients/radio → fixed 6-radio topology
Density sweep 18 / 60 / 114 = 3 / 10 / 19 per radio, round-robin even-split
Real associations across 2.4 / 5 / 6 GHz to the DUT's APs
Topology constant at every point — only density changes

Measure: per-point, repeatable

60 s window × 3 repetitions per density point
Idle baseline at the same client count (isolates polling from traffic)
perf record of wld + 1 Hz CPU / IO / RSS sampler
Custom uprobes count RPC calls & datamodel write-txns/s
One HTML report + flamecharts per run; 6 acceptance gates

Build → flash → associate → measure → symbolize → report — orchestrated end-to-end by an LLM agent against the live testbed.

build & deploy pipeline

Deep dive · the pipeline

Fully automated build → deploy → measure loop

Deploy gate

Verify by binary fingerprint (prplMesh git hash + .so sha256), assert overlay wiped — not opkg/DISTRIB_REVISION (which lie when patched-not-bumped).

Determinism

performance governor, turbo off, CLK_TCK verified, dwarf,32768+cycles:pp PEBS, static-IP path so the result is assoc-bound.

Self-healing

Topology assert re-enables scaling VAPs, pins channels, forces ProcessManager on — reboots silently drift these off.

How we measure

Flamecharts that actually resolve, plus live counters

Profiling

perf record -e cycles:pp (PEBS precise), --call-graph dwarf,32768
Symbolized on the host with a symfs + DUT kallsyms (cross-perf can't dwarf-unwind)
Per-window unknown-frame % reported — typically <0.1%

Live counters (our addition)

uprobes via tracefs on libwld/libswla exports
Count getStationStats RPC/s and datamodel write-txns/s
1 Hz sampler: CPU (utime+stime), IO, RSS per window

Profiles say where; counters say how often and how much. Together they pin the mechanism.

uprobe method & gates

Deep dive · instrumentation

uprobes when `perf probe` segfaults

perf probe segfaults on this perf 5.15 musl build — so probes are registered manually via tracefs (/sys/kernel/debug/tracing/uprobe_events) and counted with perf stat -e <probe> -p $(pidof wld).
Functions live in /usr/lib/libwld.so.7.27.8 (the on-DUT plugin is stripped). Offsets = .dynsym st_value (executable PT_LOAD has p_offset==p_vaddr).
DM write-txn counted on swl_object_finalizeTransactionOnLocalDm in libswla (2 mappers) — not amxd_trans_apply in libamxd (92 mappers — would int3 the whole box).

The 6 acceptance gates

monotonic CPU 18<60<114 · unknown-frame % within budget · assoc count matches datamodel · governor/CLK_TCK verified · RSS bounded · functional parity (clients held, disassoc cleans up). Every shipped run passed all six first-pass.

What we measure & why

The cost is the polling, not the traffic

wld CPU% — the headline cost.
DM write-txns/s — the mechanism (ambiorix transaction churn).
RSS — does footprint scale safely with clients?
Offered-load independence — idle baseline ≈ loaded ⇒ polling, not data.

The agent SIGSTOPs the polling consumer for one window, then resumes it — a reversible, decisive attribution test.

Stop the prplMesh monitor for 20 s @114: wld CPU collapses 42.0% → 1.4%. Clients held; no reboot.

the proof, in numbers

Deep dive · attribution

Why it scales with stations, not bytes

The prplMesh monitor polls getStationStats ~1×/s per radio. Each call ran a full per-station datamodel write-back of mostly-static fields.


 event loop → USP "operate" RPC → amxd_object_invoke_function
   _getStationStats                    41.9%   (≈57% of resolved wld CPU)
    └ s_addStaStatsValues              36.0%
       └ wld_vap_sync_assoclist        27.3%   ← loops ALL N stations / VAP
          └ wld_ad_syncInfo            25.2%   ← per-station write of 40+ params
             └ finalizeTransaction     22.4%   ← one DM write-txn / station / poll

@	stations/VAP	syncInfo / call	DM write-txn/s	wld CPU
60	~21	21.1	65.6	28.5%
114	~39	41.1	98.9	42.7%

Write-txns/call ≈ stations-per-VAP ⇒ one datamodel write-transaction per associated station per poll. ~1.9 ms/station-sync, constant — total scales with density, independent of load.

Act 0 · before the main hunt

Retiring the original trace-zone hotspot

Before getStationStats dominated, the top hotspot was sahTraceGetZone + strcmp — a trace-zone lookup on every log check — eating ~28–30% of CPU under DHCP load.

libusp PPW-2084: gate the trace-walk above the effective level
libsahtrace: short-circuit zone-level lookup above max level
MXL trace-zones → 200 (sahtrace option-b)

Collapsed the hotspot to ~1.5% — which uncovered the getStationStats cost underneath.

sahTraceGetZone;strcmp share of CPU, before/after the trace-walk gate.

mechanism & the A/B

Deep dive · Act 0 · trace zones

One `int` beats an 80-zone `strcmp` walk

sahtrace registered ~80 trace zones; every trace check walked the whole list + an "all" fallback, strcmp-ing each — even when the message would never print. The hot swl/pwhm netlink path fires ~1036 INFO traces.


/* libsahtrace fork (patch 0100): short-circuit before the zone walk */
if (lvl > maxEffectiveZoneLevel)   /* one int, recomputed at the 4 zone mutators */
    return;                        /* was: walk ~80 zones + "all" fallback, strcmp each */

Why the short-circuit alone gave only −0.8 pp

The mxl zones were pinned at level 400 (INFO) — exactly the hot trace level — so 400 > 400 is false and the dominant path still walked every zone. Config fix (option-b): lower mxl zones 400 → 200 (re-applied each prepare — they revert to 400 on wld restart) ⇒ maxEffectiveZoneLevel < 400 ⇒ the guard fires.

Measured — strict in-place A/B @114

	zones @400	@200
wld CPU	18.47%	16.55%
sahtrace zone-lookup	7.06%	0.17%
getStationStats/win	~180	~180

−1.92 pp — offered load identical, so the drop is the lever, not less work.

Caught pre-flight: sah_trace_level is unsigned on this toolchain, so the −1 sentinel promoted to UINT_MAX and silently suppressed all tracing — a perf-only run would have passed. A host equivalence test (now a regression guard) caught it; fixed with a signed cast.

Improvement #1 shipped + validated

Gate the static datamodel write-back

The per-poll write-back rebuilt 40+ mostly-static identity/capability params every time, because they shared the volatile-sample freshness trigger.

Fix: gate wld_ad_syncInfo's static block on actual change (first sync / assoc-state), not the per-sample timestamp. Steady-state polls now trans_clean instead of apply.

@114	before	after
DM write-txn/s	98.5	3.47
txn / call	~43.5	~1.3
wld CPU	43.95%	28.5%

Datamodel write-transactions/s @114 — a 96% drop to the NrDev floor. wld CPU −15.5 pp (strict A/B, same image ±patch).

call-tree & file:line

Deep dive · Fix #1

Why the freshness gate never fired

Every poll pulls a fresh driver sample → advances lastSampleTime.
The gate at wld_assocdev.c:1929 compares lastSampleSyncTime != lastSampleTime → always true.
So all 40+ static params rebuilt + applied every poll, per station — though identity hasn't changed since association.


/* before: any new sample re-applies the whole static block */
if (pAD->lastSampleSyncTime != pAD->lastSampleTime) { rebuild_40_params(); apply(); }

/* after (0901): only on genuine change; otherwise clean, no apply */
if (first_sync || assoc_state_changed) { rebuild_40_params(); apply(); }
else                                   { amxd_trans_clean(&trans); }   /* steady state */
pAD->lastSampleSyncTime = pAD->lastSampleTime;   /* watermark still advanced */

Result in the profile: wld_ad_syncInfo 25.2% → 0.5%; finalizeTransaction 22.5% → 0.6%; amxd_trans_apply 21.9% → 0.6%. The readback (swla_dm_getObjectParams, ~13.7%) is now the top slice → Fix #2.

Improvement #2 shipped + validated

A lean `getStationStatsBrief()` RPC

To build the reply, the RPC re-marshalled the entire ~70-param AssociatedDevice object (+ AffiliatedSta) back out of the datamodel per station — yet the monitor reads only 14 fields.

Fix: a dedicated getStationStatsBrief() that keeps every datamodel write but builds the reply directly from live state, emitting exactly the 14-key contract. The monitor polls it; getStationStats() is untouched.

@114 (A/B)	Fix #1	+ Fix #2
readback slice	11.34%	0%
full-reply builder	34.08%	0%
wld CPU	28.89%	18.39%

Eliminated profile slices @114. wld CPU −10.5 pp (−36%). 14-key values verified identical to the full reply.

parity & A/B rigor

Deep dive · Fix #2

Parity by construction, isolated by A/B

Same writes, lean reply

Shared s_addStaStatsPrologue keeps the driver pull + wld_vap_sync_assoclist → syncStats + the Fix #1-gated syncInfo. Only the reply build differs: s_addStaStatsValuesBrief reads live pAD instead of re-reading the DM. getStationStats left byte-for-byte unchanged.

Verified

14 keys per station, values == full reply (checked station B0:75:0C:DA:40:B4). DM write-txn flat 6.7/s (no regression). RSS unchanged (~70 KB/client). Full curve 8.62/18.05/28.89 → 7.01/12.33/18.39. All 6 gates pass.

Cumulative on the hot path: ~44% → 28.5% → 18.4%

Where we are

~58% off the `getStationStats` hot path

−25.6 ppwld CPU @114 (44% → 18.4%)

−15.5pp · Fix #1

−10.5pp · Fix #2

Both fixes shipped to the forks and validated by strict same-image A/B. Monotonic, tight (stdev ≤ 0.14 pp).

see the flamecharts

Deep dive · the flamecharts

Before → after, @114, real symbolized profiles

BEFORE — the syncInfo write-back tower (~44% CPU)

↗ New tab

⤢ click to expand

AFTER — write-back gated, readback bypassed (~18% CPU)

↗ New tab

⤢ click to expand

⤢ Fullscreen for the interactive flamegraph (click a frame to zoom, Esc to close), or open in a new tab. What remains in "after" is the inherent per-station driver pull — the target of Fix #6.

Improvement #5 parked — premise refuted

When the data said no

Fix #5 targeted alloc churn in the bulk driver pull (wld_nl80211_getAllStationsInfo): drop a max-capacity calloc-zero + a redundant copy. Written, applies clean, compiles -Werror. Zero behaviour change.

Then we measured it. On this MaxLinear platform the bulk path is structurally absent — the vendor HAL pulls stations one-by-one. The code Fix #5 edits is never executed here.

Decision: do not flash. Keep the zero-risk patch on a branch (valid for generic mac80211), but don't disturb the shared baseline for a no-op.

0.000%of wld CPU — the bulk path, all 3 reps @114

The method disproved its own hypothesis on real silicon — and we let it.

what's actually hot on MXL

Deep dive · Fix #5

The bulk API is never taken on MaxLinear

mfn_wvap_get_station_stats binds to the MXL HAL, which loops per station — 3 netlink round-trips × N per radio per poll, not one bulk dump.

Fix #5 target search — the bulk path is 0.000% of wld CPU

↗ New tab

⤢ click to expand

@114 path	% wld CPU
`wld_nl80211_getAllStationsInfo` (Fix #5 target)	0.000%
`whm_mxl_vap_getSingleStationStats` (actual MXL pull, ×N)	22–46%

What's next

Fix #6 — collapse the per-station netlink storm

After Fix #1+#2, the residual is the driver pull itself: on MXL, 3 netlink round-trips per station (generic GET_STATION + vendor DEV_DIAG3 + PEER_FLOW) ≈ 351 round-trips/sweep @114.

DEV_DIAG3 (~16%) is redundant for the 14-key brief.
The sahTraceGetZone class resurfaces (~8%) inside the vendor attr-list build.
Fix #6: one new bulk vendor subcmd carrying all 14 fields in one call/VAP (3N → 1), parity-by-construction.

Projected: wld CPU on the brief path ~18.4% → ~8–10%. Plus Fix #3 (poll-cadence) as an orthogonal config-only lever under investigation.

parity map

Deep dive · Fix #6

All 14 fields are reproducible from one bulk read

The driver's nl80211 GET_STATION is itself fed by the same tr181 helpers the vendor subcmds use. Every brief key except DownlinkBandwidth derives from mtlk_sta_get_tr181_peer_stats ⇒ a new bulk subcmd is parity-by-construction.

source	brief keys covered
nl80211 `GET_STATION`	Tx_Retransmissions, Tx/RxErrors, DownlinkBandwidth
vendor `devDiag3` (droppable)	Tx/RxPacketCount, SignalNoiseRatio
vendor `peerFlow`	Tx/RxBytes, LastData*Rate, SignalStrength, TxUnicastPacketCount

Snapshot-safe: tr181 deltas read read-only; reset only on RESET_STATISTICS — no double-consume. The unused lean GET_ASSOCIATED_STATIONS_STATS covers only ~6/14 with absolute (not delta) semantics → a new subcmd, not reuse.

The method · agent + human

An LLM agent drove the whole testbed

3 days · 4 repos · real silicon

metric	value
Calendar span	3 days (Jun 20–22)
Agent turns	8,357
User messages	3,765
Commits (this repo)	56
Agent-active	~25 h
Human-attended (est.)	~23–25 h

Human time estimated from session message density (interactive supervision) — not separately logged; an upper bound where sessions overlapped.

Tokens & cost

~1.61 B tokens total — dominated by 1,551 M cache-reads (×0.1 price).

the dollar figure

Deep dive · cost

List-price equivalent — and what caching saved

≈ $4,144list-price equivalent (Opus 4.8, cache-aware)

≈ $20.8ksaved by prompt caching

Uncached the same tokens would cost ~$24,937. Cache-reads alone: $2,327 (would be $23,266). Figure is a floor — long-context (>200K) turns bill higher.

Where we are — and where it goes

shipped

Fix #1 + Fix #2 — 44% → 18.4% wld CPU @114, validated by strict A/B on real hardware.

parked

Fix #5 — zero-risk, but measured to a no-op on MXL. Kept for generic mac80211.

Fix #6 — bulk vendor subcmd, 3N→1, projected ~8–10%. Fix #3 cadence in study.

A repeatable real-device harness + disciplined measurement turned a vague "WiFi is slow at scale" into named, file-line root causes and validated fixes — most of it automated.

Scaling prplOS WiFi

wld CPU climbs steeply with client density

A push-button, agent-driven measurement rig

Load: CDRouter scaling ladder

Measure: per-point, repeatable

Fully automated build → deploy → measure loop

Deploy gate

Determinism

Self-healing

Flamecharts that actually resolve, plus live counters

Profiling

Live counters (our addition)

uprobes when perf probe segfaults

The 6 acceptance gates

The cost is the polling, not the traffic

Why it scales with stations, not bytes

Retiring the original trace-zone hotspot

One int beats an 80-zone strcmp walk

Why the short-circuit alone gave only −0.8 pp

Measured — strict in-place A/B @114

Gate the static datamodel write-back

Why the freshness gate never fired

A lean getStationStatsBrief() RPC

Parity by construction, isolated by A/B

Same writes, lean reply

Verified

~58% off the getStationStats hot path

Before → after, @114, real symbolized profiles

When the data said no

The bulk API is never taken on MaxLinear

Fix #6 — collapse the per-station netlink storm

All 14 fields are reproducible from one bulk read

An LLM agent drove the whole testbed

3 days · 4 repos · real silicon

Tokens & cost

List-price equivalent — and what caching saved

Where we are — and where it goes

`wld` CPU climbs steeply with client density

uprobes when `perf probe` segfaults

One `int` beats an 80-zone `strcmp` walk

A lean `getStationStatsBrief()` RPC

~58% off the `getStationStats` hot path