System-Level Safety Monitoring and Recovery for Perception Failures in Autonomous Vehicles
Kaustav Chakraborty*1
Zeyuan Feng*1
Sushant Veer2
Apoorva Sharma2
Boris Ivanovic2
Marco Pavone2
Somil Bansal1
[Paper]
[GitHub]
Submitted to ICRA 2025.

Abstract

The safety-critical nature of autonomous vehicle (AV) operation necessitates development of task-relevant algorithms that can reason about safety at the system level and not just at the component level. To reason about the impact of a perception failure on the entire system performance, such task-relevant algorithms must contend with various challenges: complexity of AV stacks, high uncertainty in the operating environments, and the need for real-time performance. To overcome these challenges, in this work, we introduce a Qnetwork called SPARQ (abbreviation for Safety evaluation for Perception And Recovery Q-network) that evaluates the safety of a plan generated by a planning algorithm, accounting for perception failures that the planning process may have overlooked. This Q-network can be queried during system runtime to assess whether a proposed plan is safe for execution or poses potential safety risks. If a violation is detected, the network can then recommend a corrective plan while accounting for the perceptual failure. We validate our algorithm using the NuPlanVegas dataset, demonstrating its ability to handle cases where a perception failure compromises a proposed plan while the corrective plan remains safe. We observe an overall accuracy and recall of 90% while sustaining a frequency of 42Hz on the unseen testing dataset. We compare our performance to a popular reachability-based baseline and analyze some interesting properties of our approach in improving the safety properties of an AV pipeline.



Motivation

Despite a plethora of work have developed monitors aimed at detecting perception failures, triggering a fail-safe maneuver for every detected perception error is impractical and detrimental to the AV's navigation goals. Not all perception failures are equal; some bear no impact on the AV's safety (e.g., missed parked vehicle far off from the AV's motion plan), while others can be catastrophic (e.g., missed pedestrian along the AV's motion plan). In this work, we address this challenges by developing a Q-network-based monitor called SPARQ that, evaluates the severeness of task-relevant perception failures and reacts to those identified as safety-critical failures by reparing the ego plans. Specifically, SPARQ approximates the safety scores generated from a plausible scene generator, a safety accessor, and a recovery planner using a lightweight transformer-based encoder-decoder architecture. Owing to the advantage of neural approximation, it enables real-time assessment of the system-level impact of perception failures and recovery plan generation.



Approach


SPARQ is modeled as a transformer-based encoder decoder network that takes in the Scene, candidate ego plan and the perception failure as the monitor input and classifies the plan into one of either of the safe, risky or critical class. Using the SPARQ network, our algorithm first checks if a candidate plan is deemed unsafe by SPARQ. It then performs plan repair for the unsafe plans by searching over a set of dynamically feasible plan primitives to return the best one while accounting for the perception error at runtime.
SPARQ evaluates safety of the plans by adding spatial and temporal encodings to each input feature, then passing the unified embeddings to a transformer-based attention mechanism, a decoder and finally an aggregation MLP. SPARQ is trained using supervised learning using a modified version of the nuPlan-vegas dataset. Further details are available in the paper.

The dataset collection for training starts with sample a perception monitor position in the Ground-Truth (GT) scene. If there is an agent nearby the sampled position, then the agent is replaced by the perception monitor in the monitored scene while remains unchanged in the GT scene for evaluation. Conversely, if there doesn't exist an agent nearby the sampled position, we add perception failure in the monitor scene and place an extra agent in the GT scene at the same position. The monitored scenes are available to the planner while the plans are evaluated on the GT scenes to yield the safety scores. Finally, we store the safety scores along with agent histories, candidate plans, monitor inputs, and Road Graph elements to the dataset.



Results

The gif above illustrates how SPARQ acts as a general-purpose safety filter in lieu of perception errors. If SPARQ cannot find a safe plan from all the primitives, it returns the least risky plan.

We further compare SPARQ with FRT, an HJ reachability baseline, and the evaluation uses five different metrics: accuracy, precision, recall, F1 score, and runtime. Accuracy measures the correctness of the detections, precision measures the quality of positive detections (soundness), recall measures the ability to capture all positive detections (completeness), and F1 score balances both precision and recall to provide a single performance measure for detection. The metrics are averaged over the three prediction classes.

The FRT and the Two-Player Game baselines exhibit a high recall for the safety-critical class for ensuring safety, while tending to issue false alarms as evident by the lower precision. On the other hand, SPARQ balances safety and performance as shown by the consistent performance over all the metrics. Furthermore, we computed a strong AUROC score of 0.9 by SPARQ for all three classes, nearing the performance of an ideal detector (AUROC = 1). Interestingly, SPARQ also demonstrated an unexpected ability to detect failure cases not caused by perception failures, functioning as a general-purpose safety filter. It filtered 96% of plans from the base planner that would have led to safety violations, even without perception failures. We hypothesize that the diversity of our dataset enabled SPARQ to identify fundamental features causing safety failures.
Although SPARQ is more time-consuming compare to FRT baseline. Its inference time is still significantly smaller compared to the previous work (RHPlanner-based pipeline), making it feasible for real-time deployment in AV stacks. We demonstrated that our method can classify the task-relevant safety score of approximately 256 plans in 0.024s (0.05s for RHPlanner) for a given scenario. Additionally, it requires only 0.006s to propose a safer alternative plan if the original candidate plan is deemed unsafe.


Acknowledgements

1The authors are with the Department of Electrical and Computer Engineering at the University of Southern California.
2The authors are with NVIDIA Research.
The webpage template was borrowed from colorful ECCV project.