XiYOLO: Energy-Aware Object Detection via Iterative Architecture Search and Scaling

Tran, Tony; Suganda, Richie Ryu; Hu, Bin

XiYOLO: Energy-Aware Object Detection via Iterative Architecture Search and Scaling

Tony Tran¹, Richie Ryu Suganda², Bin Hu^2,3

¹Cullen College of Engineering Research Computing, University of Houston
²Department of Electrical and Computer Engineering, University of Houston
³Department of Engineering Technology, University of Houston

Paper Code

XiYOLO energy-aware NAS framework overview

Energy-aware NAS framework. Candidate detectors from a searchable energy-aware architecture space are ranked by an mAP proxy and a two-stage energy estimator, then refined iteratively to obtain scalable models under hardware-specific energy constraints.

Abstract

Object detection on heterogeneous edge devices must satisfy strict energy, latency, and memory constraints while still providing reliable perception for downstream autonomy. Existing energy-aware NAS methods often target limited deployment settings, while real energy remains difficult to optimize because it is highly device-dependent and costly to measure. We address these challenges with an energy-adaptive framework that combines an energy-aware XiResOFA search space, a two-stage energy estimator, and iterative search to identify a single energy-efficient base architecture. We then apply compound scaling to transform this base design into the XiYOLO family across deployment budgets, enabling interpretable accuracy–energy tradeoffs under sparse hardware measurements.

Experiments on PascalVOC, COCO, and real-device deployment show that XiYOLO achieves a stronger energy–accuracy tradeoff than YOLO baselines. On PascalVOC, the medium XiYOLO model reaches 86.15 mAP50 while reducing energy relative to YOLOv12m by 20.6% on GPU and 35.9% on NPU. On COCO, XiYOLO reduces energy relative to YOLOv12 by up to 53.7% on GPU and 51.6% on NPU at the small scale. The proposed two-stage estimator also improves sample efficiency over a joint predictor under few-shot adaptation with only 2–20 target-device samples.

Contributions

We formulate energy-adaptive perception as the design of an energy-efficient base detector architecture that can be scaled across deployment budgets on heterogeneous edge devices, and we instantiate this idea with XiYOLO, whose medium PascalVOC model achieves the highest 86.15 mAP50 among all compared medium-sized detectors.
We propose an iterative neural architecture search framework for object detection that progressively searches the backbone, FPN, and PAN to identify an energy-efficient base architecture. After scaling, XiYOLO achieves strong accuracy–energy tradeoffs on both PascalVOC and COCO.
We introduce an energy-aware search space based on a novel XiResOFA block with controllable compression ratio, kernel size, and lite/full attention choices, enabling detector search over interpretable accuracy–energy tradeoffs. Randomly sampled XiResOFA medium models reduce average energy by 21.1% relative to the YOLOv12 medium baseline.
We develop a two-stage energy approximation model that combines a generic architecture–device energy prior with a lightweight hardware-specific residual model, validated on HW-NAS-Bench under few-shot adaptation with only 2–20 target-device samples.

XiResOFA Search Block

A central design requirement of the search space is that its architectural choices should have a direct and interpretable relationship to both detection quality and hardware energy. The proposed XiResOFA block preserves a residual shortcut while introducing configurable architectural choices compatible with once-for-all search. This design makes the block both trainable and searchable, while allowing its internal structure to adapt to different accuracy–energy operating points.

(a) Default Bottleneck Block

(b) XiResOFA Block

Comparison between the standard bottleneck block and the proposed XiResOFA block. XiResOFA augments a residual design with elastic architectural choices for energy-aware detector search.

Elastic Architectural Choices

Each XiResOFA block is parameterized by a tuple (r, k, t), where r ∈ {1, 0.5, 0.25} is the compression ratio, k ∈ {1, 3, 5} is the kernel size, and t ∈ {lite, full} is the attention type. These dimensions induce clear accuracy–energy tradeoffs across compute, memory traffic, and feature recalibration.

(a) Compression Ratio

(b) Kernel Size

(c) Lite / Full Attention

Two-Stage Energy Estimation

A major obstacle in energy-aware detector search is that true hardware energy is expensive to measure at scale. Simple analytical surrogates such as FLOPs, MACs, or parameter count are insufficient because measured energy also depends on memory traffic, intermediate activations, operator composition, and hardware-specific execution behavior.

We decompose energy prediction into a generic prior and a device-specific correction: Ê(a, h) = f_base(a) + r(a, h). The first stage captures transferable architecture-level energy structure, while the second stage adapts that prior to device-specific effects such as backend implementation and memory hierarchy. The residual can be calibrated with only a few measured samples on the target device.

Sparse energy samples from the target hardware are combined with architecture and hardware encodings to learn a generic energy prior and a device-specific residual.

Accuracy–Energy Tradeoffs

We compare XiYOLO against YOLOv5, YOLOv8, YOLO11, and YOLOv12 on the ModalAI Sentinel Development Drone, which contains a Qualcomm QRB5165 CPU, a Qualcomm Adreno 650 GPU, and a 15 TOPS NPU. All models are exported to FP16 TFLite, and we measure energy per inference, latency, and detection accuracy. Power is monitored using the VOXL Power Module v3.

PascalVOC Pareto frontier on GPU and NPU

PascalVOC. XiYOLO achieves the strongest overall tradeoff on both GPU (left) and NPU (right). At the medium scale, XiYOLO reaches the highest 86.15 mAP50, improving over YOLOv12m by +0.07 mAP50, while reducing energy by 20.6% on GPU and 35.9% on NPU.

COCO. At the small scale, XiYOLO reduces energy relative to YOLOv12s by 53.7% on GPU and 51.6% on NPU, while improving accuracy by +1.6% mAP50-95 over YOLOv8s on GPU and +2.4% mAP50-95 over YOLOv12s on NPU.

Sustained Deployment Energy

Both XiYOLO-VOC and XiYOLO-COCO accumulate substantially less energy over time than the YOLO baselines on both devices. By the end of the run, XiYOLO-COCO uses roughly 65.6 J less cumulative energy than the strongest YOLO baseline on GPU and about 165.3 J less on NPU.

Cumulative energy over time on GPU and NPU

Energy consumption vs. time benchmarks against other YOLO-series detectors on the GPU (left) and NPU (right) for the medium model scales.

Few-Shot Energy Estimation on HW-NAS-Bench

We evaluate the proposed two-stage energy estimator under limited device-specific supervision on HW-NAS-Bench, using the NAS-Bench-201 search space with EdgeGPU and Eyeriss as target devices. For each target, we pretrain on the remaining hardware targets and fine-tune using only 2–20 target-device samples. The two-stage estimator achieves lower test RMSE than a joint model across nearly the full few-shot range, indicating more sample-efficient adaptation.

HW-NAS-Bench few-shot adaptation results

Two-stage energy estimator vs. joint model under few-shot adaptation with 2–20 samples on HW-NAS-Bench, on EdgeGPU and Eyeriss hardware targets.

Search Space Characteristics

Compared with the YOLOv12 medium baseline, randomly sampled XiResOFA models consistently occupy a lower-energy region, with average energy reduced by 21.1%. The iterative search space yields about 3.5% higher average predicted mAP50 than the global search space, suggesting that iterative decomposition concentrates search on stronger candidate regions.

Energy distribution of XiResOFA medium models

(a) Energy distribution of randomly sampled XiResOFA medium models.

(b) Predicted accuracy of sampled models from global and iterative search spaces.

BibTeX

@article{tran2026xiyolo,
  title={XiYOLO: Energy-Aware Object Detection via Iterative Architecture Search and Scaling},
  author={Tran, Tony and Suganda, Richie Ryu and Hu, Bin},
  journal={Preprint},
  year={2026}
}