Privacy Guard: Learning Privacy-Budgeted Active Sensing Policies via Reinforcement Learning in Smart Home Environments

Mohammed Alshehri — March 2026

Abstract

We present Privacy Guard, a reinforcement learning framework for privacy-budgeted active sensing in smart homes. The agent detects intrusions by selectively activating cameras and microphones under a finite privacy budget, balancing security with minimal surveillance. Using GRPO, we run 12 experiments varying architecture (4B dense vs. 30B MoE), reasoning mode (standard vs. chain-of-thought), curriculum difficulty, and episode length. The main result is that training distribution matters more than model scale: a 4B model trained on medium-difficulty scenarios for 100 steps reaches 0.912 peak reward and 86.4% detection, outperforming a larger 30B MoE and substantially exceeding rule-based baselines. Chain-of-thought fails in this token-constrained multi-turn setting (97% truncation). Longer-horizon tests show strong detection generalisation but weaker privacy-budget management, identifying temporal budget planning as the key remaining challenge.

Introduction and Motivation

Your home camera is probably watching you right now. Not because you’re in danger — because nobody programmed it to know the difference.

The proliferation of smart home sensors — doorbell cameras, PIR motion detectors, always-listening voice assistants — has made residential security powerful and pervasive at the same time. Rich sensor data enables reliable intrusion detection. But continuous, high-fidelity capture of in-home activity is also a form of surveillance that most people find deeply uncomfortable, and that regulators (GDPR Article 25, CCPA) explicitly constrain through data minimisation principles.


Two Bad Bets

Classic security systems resolve this tension through one of two degenerate strategies:

Strategy What it does The problem
Always-On Cameras + mics at full resolution, always Perfectly surveils your entire life
Event-Triggered Fires when PIR crosses a threshold Misses stealth intrusions by design

Neither strategy treats privacy as a finite resource to be allocated. Yet that is exactly how occupants experience it. A resident may accept the system capturing a hallway recording at 2 AM when the door opens. They are far less accepting of a system that captures their kitchen, living room, and bedroom in high resolution throughout every waking hour.


A Better Frame: Graph Search Over Time

The real question isn’t whether to activate sensors — it’s when, where, and at what fidelity. That’s a sequential decision problem. At every timestep, an agent observes the home and must search through possible actions, reserving budget for the moments that actually matter:

graph LR
    A(["Observe: PIR, Door, Audio, Budget"]) --> B{"Threat level?"}
    B -- Low --> C["Sensors OFF - Save budget"]
    B -- Ambiguous --> D["LOW-res camera - Cost: 1 unit"]
    B -- Confirmed --> E["HIGH-res + ESCALATE - Cost: 4-6 units"]
    C --> F(["Next timestep"])
    D --> F
    E --> F
    F --> A

This graph loops for every timestep in an episode. Budget spent on step 3 is unavailable at step 17 when a real threat arrives — creating genuine intertemporal commitment. No fixed-threshold rule set can navigate this. It requires learning which signals actually matter, and when spending is worth it.


The Gap No Classical Policy Reaches

Before any training, we benchmarked six classical strategies — always-on, event-triggered, audio-gated, random, and two rule-based variants. None of them occupies the upper-right region of the detection-privacy tradeoff:

The Pareto frontier of detection vs privacy for all baseline policies