Conference Program - ASP-DAC 2026

January 19-22, 2026 | Hong Kong Disneyland Hotel

Tutorial I

09:00-12:00 | Monday, January 19, 2026
(Tutorial 1) On-Device AI to Better Mobile and Implantable Devices in Healthcare
09:00-12:00
Sleeping Beauty 1/2

Yiyu Shi (University of Notre Dame)

Abstract
The increasing prevalence of chronic diseases, an aging population, and a shortage of healthcare professionals have prompted the widespread adoption of mobile and implantable devices to effectively manage various health conditions. In recent years, there is growing interest in leveraging rapid advances in artificial intelligence (AI) to enhance the performance of these devices, resulting in better patient outcomes, reduced healthcare costs, and improved patient autonomy. Due to privacy, security, and safety considerations, inferences must often be performed on the edge, with limited hardware resources. This challenge is compounded by inter-patient and intra-patient variability, heavy dependence on medical domain knowledge, and lack of diversified training data.
In this tutorial, we will explore how hardware-AI co-design techniques, such as joint hardware and neural architecture optimization and fairness-aware pruning, can fundamentally transform mobile and implantable devices. We will share case studies, including the world's first smart Implantable Cardioverter Defibrillator (ICD) enabled by our research, illustrating how advanced edge AI methodologies can make these devices safer, more efficient, and more personalized. Attendees will gain actionable insights into deploying AI models under stringent constraints while addressing fairness, adaptability, and reliability challenges unique to healthcare applications.
(Tutorial 2) Design Methodologies and Toolchains for Compute-in-Memory: From Architectures to Systems
09:00-12:00
Sleeping Beauty 3

Xiaoming Chen (Institute of Computing Technology, Chinese Academy of Sciences)
Jianlei Yang (Beihang University)
Zhenhua Zhu (Tsinghua University)

Abstract
As the demand for computational efficiency in modern AI applications continues to rise, Compute-in-Memory (CIM) has emerged as a promising computation paradigm. By performing computations directly within memory arrays, CIM architectures overcome the von-Neumann bottleneck within traditional architectures. While recent CIM hardware designs have demonstrated impressive efficiency gains for neural network workloads, architectural innovation has significantly outpaced the development of cohesive software toolchains necessary to program, optimize, and evaluate these novel architectures.
This tutorial addresses a critical gap in system-level CIM design by presenting comprehensive design methodologies and software frameworks, which bridge the divide between algorithm development and hardware implementation. Specifically, it aims to:
☆ Introduce the fundamentals of CIM design and analyze the algorithmic and architectural design space for CIM systems.
☆ Present state-of-the-art open-source frameworks that enable end-to-end design, simulation, compilation, and evaluation.
☆ Demonstrate practical workflows for algorithm mapping, performance modeling, and hardware-aware optimization.
Through detailed examination of existing design tools, intuitive examples, and hands-on demonstrations, this tutorial will offer attendees an opportunity to gain comprehensive insights into the current landscape of CIM design automation and methodologies that are essential for developing efficient AI accelerators.
(Tutorial 3) Design Automation for the Early Fault Tolerant Quantum Computing
09:00-12:00
Sleeping Beauty 5

Shigeru Yamashita (Ritsumeikan University)
He Li (Southeast University)
Zhiding Liang (CUHK)
Robert Wille (Technical University of Munich)

Abstract
As quantum computing transitions from NISQ experimentation to the early fault-tolerant (Early FTQC) era, progress hinges on crosslayer methods that can solve critical design automation challenges. Success in Early FTQC will require (i) reducing expensive non-Clifford resources (T-count/T-depth), (ii) co-designing algorithms and ansätze with problem structure and hardware constraints, and (iii) sustaining low physical error rates through scalable, hardware-aware calibration. This tutorial brings together four complementary perspectives to address these needs. We will explore T-depth-aware decomposition for MCT-intensive oracles, application-driven algorithm/ansatz co-design using contextual subspace strategies, and fine-grained, graph-parallel calibration protocols validated on real devices. Finally, we will present a unifying design-automation (QDA) view that connects today’s tools to the emerging requirements of Early FTQC. Attendees will leave with concrete techniques, open-source pointers, and evaluation checklists to apply immediately in their research and development.

Tutorial II

14:00-17:00 | Monday, January 19, 2026
(Tutorial 4) Bi-Directional Synergy: A Tutorial on Hardware Design for Agentic AI and Agentic AI for Hardware Design
14:00-17:00
Sleeping Beauty 1/2

Chaojian Li (The Hong Kong University of Science and Technology)
Zhongzhi Yu (NVIDIA Research)
Zhiyao Xie (The Hong Kong University of Science and Technology)

Abstract
Agentic AI systems, capable of reasoning, planning, and autonomous decision-making, are transforming how we design and deploy both AI algorithms and hardware systems. This tutorial focuses on the bi-directional synergy between hardware design for agentic AI and agentic AI for hardware design. We will cover three representative works: (1) ORCHES, which accelerates Large Language Model (LLM) reasoning for agentic AI using collaborative GPU–Processing-In-Memory(PIM) heterogeneous architectures, (2) Spec2RTL-Agent, an LLM-agent system that automates RTL code generation from complex design specifications, and (3) SLM-Agents, which makes the case that Small Language Models (SLMs) will be the future of agentic AI because of their efficiency and scalability. Participants will gain insights into (1) hardware challenges and opportunities in supporting reasoning-centric agentic AI applications, (2) LLM-based multi-agent workflows that can revolutionize hardware design automation, and (3) the significance of SLMs in shaping efficient and sustainable agentic AI systems. The tutorial concludes with a discussion on how hardware design and agentic AI can together drive a virtuous cycle of progress.
(Tutorial 5) APS: An MLIR-Based Hardware-Software Co-design Framework for Agile Processor Specialization
14:00-17:00
Sleeping Beauty 3

Yun (Eric) Liang (Peking University)
Youwei Xiao (Peking University)
Yuyang Zou (Peking University)

Abstract
The rapid evolution of domain-specific applications demands specialized processors with competitive performance and efficiency. While the open RISC-V instruction set architecture (ISA) simplifies the adoption of custom instruction extensions (ISAXs), the overall process of processor specialization remains challenging. It involves a complex interplay of multiple tasks, including behavioral architecture description, hardware synthesis and implementation, processor-ISAX adaptation, and compiler co-generation. Existing RISC-V ecosystems often address these challenges manually, lacking a fully automated and integrated solution. This tutorial introduces APS for agile processor specialization based on Multi-Level Intermediate Representation (MLIR). MLIR can support multiple different requirements in a unified infrastructure. APS provides a unified framework of powerful, open-source EDA tools for seamless hardware-software co-design, empowering designers to navigate the complexities of specialization with greater ease and efficiency.
(Tutorial 6) Post-Silicon Validation & Hardware Security in Modern Processors
14:00-17:00
Sleeping Beauty 5

Ravi Monani (Senior System Design Engineer, AMD; former Intel)

Abstract
Modern processors face a dual challenge: achieving peak performance while ensuring robust security and reliability. With increasing complexity in CPU/GPU/SoC architectures, post-silicon validation has become critical in detecting design flaws, mitigating microarchitectural vulnerabilities, and balancing power-performance tradeoffs. This tutorial provides a practitioner’s perspective, drawing on experiences from AMD and Intel, to bridge the gap between academic research and industrial practice. Topics will include silicon bring-up methodologies, case studies of hardware security vulnerabilities (e.g., speculative execution, side-channel attacks), debug and measurement techniques, and future challenges in secure processor design. Participants will gain insights into practical validation flows, security-hardening strategies, and opportunities for research collaboration with industry.

Opening and Keynote Session I

Opening Ceremony
08:05-08:20 | Tuesday, January 20, 2026 | Cinderella Ballroom 1/6/7/8
Keynote Addresses
08:20-09:50 | Tuesday, January 20, 2026 | Cinderella Ballroom 1/6/7/8
Chenming Hu
TSMC Distinguished Professor Emeritus
University of California, Berkeley
08:20-09:05
Keynote Address

FinFET - from Lab to Foundry to EDA/Fabless

Biography
Chenming Hu is the Emeritus TSMC Chair Professor of UC Berkeley and former CTO of TSMC. He led the creation of the BSIM standard model and the 3D transistor FinFET used in all phones, computers, data centers, and AI chips.
He received the US National Medal of Technology from President Obama and IEEE's highest honor (Medal of Honor). EDA industry's Kaufman Award cited his “tremendous career of creativity and innovation that fueled the past four decades of the semiconductor industry, including its adoption of FinFET.”
Abstract
25 years ago, the keynote speaker of the 2001 ISSCC in San Francisco projected that processor chips would dissipate more heat per area than nuclear reactor cores and rocket engine nozzles in a decade. His projection echoed the 1996 industry concensus of an end to Moore's Law in 2007 "with no known solution" (in ITRS - International Technology Rodmap of Semiconductors).
What was the cause of that chip heating crisis? How did FinFET prevent it from happening? How did FinFET find its way from the research laboratory to fabs, and the EDA/DAC and fabless communities? These and other FinFET stories will be told.
Yiran Chen
John Cocke Distinguished Professor
Duke University
09:05-09:50
Keynote Address

Edge AI: Everything, Everywhere, All at Once

Biography
Dr. Yiran Chen is the John Cocke Distinguished Professor of Electrical and Computer Engineering at Duke University. He serves as the Principal Investigator and Director of the NSF AI Institute for Edge Computing Leveraging Next Generation Networks (Athena) and Co-Director of the Duke Center for Computational Evolutionary Intelligence (DCEI). His research group focuses on innovations in emerging memory and storage systems, machine learning and neuromorphic computing, and edge computing. Dr. Chen has authored or coauthored over 700 publications and holds 96 U.S. patents. His work has received widespread recognition, including two Test-of-Time Awards and 14 Best Paper/Poster Awards. He is the recipient of the IEEE Circuits and Systems Society's Charles A. Desoer Technical Achievement Award and the IEEE Computer Society's Edward J. McCluskey Technical Achievement Award. He also serves as the inaugural Editor-in-Chief of the IEEE Transactions on Circuits and Systems for Artificial Intelligence (TCASAI) and the founding Chair of the IEEE Circuits and Systems Society's Machine Learning Circuits and Systems (MLCAS) Technical Committee. Dr. Chen is a Fellow of the AAAS, ACM, IEEE, and NAI, and a member of the European Academy of Sciences and Arts.
Abstract
Edge Artificial Intelligence (Edge AI) refers to systems that execute AI models directly on devices located at or near the point of data generation. Operating locally, these interconnected systems collect and process diverse forms of data, offering distinct advantages such as enhanced privacy and reduced latency. However, deploying AI models on resource-constrained platforms remains a major challenge. Such devices are limited in computing power, memory, energy, and communication capacity, creating a gap between the demands of advanced AI models and the capabilities of current hardware—ultimately hindering the widespread adoption of Edge AI systems. In this talk, we will explore algorithmic and hardware innovations that enable Edge AI to process multimodal data efficiently and effectively (Everything), operate reliably under stringent resource constraints (Everywhere), and collaborate seamlessly across heterogeneous platforms (All at Once).

Session 1A

(T4-C) AI Applications for Edge and Domain-Specific Systems
10:20-12:00 | Tuesday, January 20, 2026 | Snow White 1
Chair(s):
Sunmean Kim (Kyungpook National University)
Jeongwoo Park (Sungkyunkwan University)
1A-1
10:20-10:45

Video-based Visible-Event Cross-modal Person Re-identification for Edge AI Surveillance Systems

*Xinyun Zhang, Zixiao Wang (The Chinese University of Hong Kong), Yurui Kuang (The Chinese University of Hong Kong), Bei Yu (The Chinese University of Hong Kong)
Keywords
Edge AI, Event Camera, Person Re-idenfification, Deep Learing
Abstract
Video-based cross-modal person re-identification (ReID) is a critical task for video surveillance and security systems, particularly in resource-constrained edge AI environments. While existing cross-modal ReID methods primarily focus on thermal-visible matching, event cameras, with their low power consumption, high temporal resolution, and sparse data representation, offer significant advantages for edge-based surveillance systems by reducing data processing overhead and enabling robust performance under challenging lighting conditions. In this paper, we introduce a novel task: video-based visible-event person re-identification (VE ReID), which aims to match identities across RGB and event camera modalities. To the best of our knowledge, this is the first work to systematically define and investigate this cross-modal task in the context of event-driven edge AI. Specifically, we curate evaluation benchmarks from existing RGB-event datasets and synthesize a new RGB-event dataset, explicitly adapting them to the cross-modal ReID setting to enable a more comprehensive evaluation of VE ReID. Extensive experiments reveal that existing cross-modal state-of-the-art (SOTA) methods fail to effectively address the unique challenges posed by event data, highlighting the importance of tailored solutions for this task. To this end, we propose a novel method that constructs auxiliary modalities using frequency information from RGB and event tracklets, aligning them effectively through a fine-grained metric learning loss. Our approach not only achieves significant accuracy improvements over existing methods but also demonstrates the potential of event cameras for efficient and scalable edge AI surveillance applications. All codes and benchmarks will be made publicly available.
1A-2
10:45-11:10

REDM: Regression-Guided Diffusion Modeling for Universal Soft Sensor Enhancement in Semiconductor Process Control

*Weiping Xie, Yumeng Shi, Pang Guo, Yining Chen (Zhejiang University)
Keywords
Soft sensor technology, Diffusion model, Sample selection, Data generation
Abstract
In semiconductor manufacturing, soft sensors play a key role in Advanced Process Control (APC) by enabling real-time wafer-to-wafer monitoring. However, their performance is often limited by sparse labeled data, process variability, and model-specific tuning. To address these challenges, we propose REDM: a Regression-Guided Diffusion Modeling framework designed to boost the accuracy and robustness of soft sensor prediction across diverse fabrication stages. REDM generates high-fidelity virtual data guided by predictive regression objectives and incorporates a quality-aware filtering mechanism based on Sliced Wasserstein Distance and intra-subset Cosine Similarity. Through multi-objective selection techniques, REDM identifies informative virtual samples that balance distributional similarity and internal diversity, thereby enhancing downstream model training. We evaluate REDM on real-world datasets from three major semiconductor process stages: Chemical Vapor Deposition (CVD), Etching, and Chemical Mechanical Polishing (CMP). Across various regression models, REDM consistently enhances soft sensor performance, with an average R^2 improvement of 3.27%. Its independence from process-specific customization makes REDM a scalable and process-aware solution for soft sensor enhancement in smart manufacturing.
1A-3
11:10-11:35

Benchmarking Continual Learning on Netlists with Circuit-Targeted Graph Neural Networks

*Rupesh Raj Karn, Johann Knechtel, Ozgur Sinanoglu (New York University Abu Dhabi)
Keywords
Continual Learning, Circuit Netlist, Lifelong Learning, Graph Convolutional Network (GCN), Catastrophic Forgetting, ISCAS85, EPFL
Abstract
The rapid evolution of integrated circuits demands that machine learning (ML) for electronic design automation (EDA) adapts to new circuit semantics without catastrophic forgetting (CF) of prior knowledge—a challenge unaddressed by commonly established, static training paradigms. Continual learning (CL) offers a promising approach, but its application to evolving netlists remains unexplored. Here, we present the first benchmarking study of CL on netlists with circuit-targeted graph neural networks (GNNs). We evaluate six CL methods, including spanning parameter regularization, replay-based, and hybrid approaches, all for a fixed GNN architecture for fair comparison. Our benchmarking covers three distinct GNN problems commonly used in EDA: gate-level node classification, connectivity-based link prediction, and structural graph classification. We find that replay-based CL techniques are particularly suitable for hindering CF in such circuit-targeted GNN applications. This work paves the way for future adaptive EDA tools for emerging design landscapes.
1A-4
11:35-12:00

LiveHPS-Lite: A Lightweight LiDAR-based Motion Capture System for Edge Applications

*Yiren Zhu, Junsheng Zhou, Yiming Ren, Hanshu Hezi, Yuexin Ma, Xin Lou (ShanghaiTech University)
Keywords
Human motion capture, LiDAR, Lightweight, minGRU, Edge Devices
Abstract
Recent advances in LiDAR-based 3D human motion capture have demonstrated significant potential for large-scale applications in unconstrained environments. However, achieving real-time performance remains challenging, particularly under the computational constraints of edge devices where deploying large deep learning models is often impractical. To address these limitations, we propose LiveHPS-Lite, a lightweight single-LiDAR-based human motion capture system, offering enhanced computational efficiency with competitive performance. In particular, we introduce a novel architecture by streamlining backbone components across all processing stages in the LiveHPS++ framework and replacing inconsistent sequential modules with parallelizable minGRUs. We implement the proposed architecture on NVIDIA Jetson Xavier NX with TensorRT acceleration, achieving real-time performance on edge. Comprehensive evaluations on benchmark datasets show that LiveHPS-Lite achieves comparable or superior accuracy while significantly reducing computational complexity. Experimental results demonstrate that LiveHPS-Lite achieves up to 6.71x faster inference speeds compared to existing solutions, delivering real-time performance even on a computationally limited edge device.This work contributes a practical solution for deploying high-performance 3D human pose estimation models in real-world applications.

Session 1B

(SS-4) Toward Fully Automated DTCO: ML Frameworks across Technology, Cell, and Library Layers
10:20-11:35 | Tuesday, January 20, 2026 | Snow White 2
Chair(s):
Taewhan Kim (Seoul National University)
1B-1
10:20-10:45

ML-driven Design Technology Co-Optimization Framework for Advanced Technology Nodes

Hyunbae Seo, Handong Cho, Sehyeon Chung (Seoul National University), Kyumyung Choi (SungKyunKwan University), *Taewhan Kim (Seoul National University)
Keywords
Design and technology co-optimization, standard cells, machine learning, physical design
Abstract
The goal of design and technology co-optimization (DTCO) is to find a combination of parameter options (i.e., parameter setting values) of target process technology that enables to produce a target design implementation of optimal PPA (performance, power, area). Since the number of parameters sharply increases as the technology scales, recently, lots of attention has been paid to automating this DTCO process in both semiconductor foundry and academic research community. This paper addresses the problem of a full DTCO automation that deals with analyzing the numerous parameter options at advanced technology nodes. Precisely, we develop a machine learning (ML) based DTCO automation framework, which supports three key features: (1) an effective analysis on the changes of DTCO parameter options within an acceptable runtime; (2) a full exploration of chip/block-level PPA metrics through automatic standard cell (SC) library generation, for which we develop a new technique that enables to accelerate the iterative physical design process; (3) supporting both of Complementary FET (CFET) based SCs and multi-row-height SCs to account for future generation technology. Through experiments with benchmark circuits, it is shown that our DTCO automation framework is able to accurately predict the direction and magnitude of PPA changes of target designs with 5x sampling efficiency. In addition, it is shown that our SC layout generator supporting CFET and multi-row height SCs provides a timely DTCO process relevant to ongoing technology advancements.
1B-2
10:45-11:10

Standard Cell Layout Generation: Methodological Evolution and Architectural Impacts

Junghyun Yoon, Ikkyum Kim, Gyumin Kim, Sojung Park, *Heechun Park (Ulsan National Institute of Science and Technology)
Keywords
standard cell layout
Abstract
Over the past decade and beyond, standard cell layout generation has been a cornerstone of digital integrated circuit design, evolving from manual design practices to advanced AI-assisted methodologies. This survey systematically reviews the evolution of standard cell layout generation across two major axes: methodological advances and transistor and cell architecture changes. First, we trace the methodological progression from manual design practices and early design-rule-based approaches to heuristic algorithms, exact optimization methods, and the most recent AI-driven paradigms. Then, we examine the evolution of transistor and cell structures: The transistor from planar CMOS to the emerging vertical devices (CFETs and Flip-FETs); and the cell architecture from the multi-row structures to the integration of buried power rails (BPR) and backside metal layers. Finally, we highlight future research directions including transistor-cell-chip co-design methodologies, optimization techniques that leverage unique electrical characteristics of emerging CMOS devices. This comprehensive survey aims to provide insights into current state-of-the-art techniques while highlighting promising avenues for future innovation in standard cell layout generation for next-generation technologies.
1B-3
11:10-11:35

Fast Timing Library Characterization Through Selective Use of Regression Models

Manikanta Prahlad Manda (Sejong University), Seunggyu Lee (Korea Advanced Institute of Science and Technology), *Daijoon Hyun (Sejong University)
Keywords
Library characterization, regression model, table entry, timing parameter
Abstract
Timing behavior of standard cells is represented as two-dimensional tables in a timing library, where each table entry is obtained through transistor-level simulation. As technology scales, the number of design corners and standard cells has increased dramatically, leading to a substantial increase in simulation time for timing characterization. This may delay the design schedule or impose additional demands on tool licenses. To address this challenge, we propose a fast timing characterization method that selectively uses transistor-level simulation and model-based prediction. In this method, a subset of table entries is obtained through simulation, while the remaining entries are predicted by regression models trained on the simulated data. Multiple regression models are employed to capture the diverse characteristics of each entry location, and the most accurate model for each entry is identified at one corner, called an anchor corner. The selected models are then used to predict the corresponding entry at target corners. Experimental results show that the proposed method achieves high accuracy with a 40% reduction in runtime; the mean and 3-sigma absolute errors are 0.4% and 2.3%, respectively, representing a significant improvement over conventional methods. The accuracy of the proposed method is further validated on 7-nm technology libraries.

Session 1C

(T12-D) Physical Attacks and Countermeasures
10:20-12:00 | Tuesday, January 20, 2026 | Snow White 3
Chair(s):
Qiang Liu (Tianjin University)
Jiaji He (Tianjin University)
1C-1
10:20-10:45

HyFault: Targeted Fault Injection Attacks on Hyperdimensional Computing Accelerators

*Brojogopal Sapui, Mehdi Tahoori (Karlsruhe Institute of Technology)
Keywords
Fault attack, Profiling, AI Accelerators, HDC, countermeasure
Abstract
Edge AI accelerators are a critical building block of numerous AI-driven applications deployed in resource-constrained environments such as IoT, automotive systems, and wearable de- vices. Hyperdimensional Computing (HDC) has recently emerged as a promising lightweight AI model for these edge scenarios, offering efficiency, simplicity, and inherent robustness against random computational faults. However, despite its advantages, the security implications of deploying HDC accelerators, particularly their resilience against targeted fault injection attacks, remain insufficiently explored. Such attacks pose tangible security risks, including intentional misclassification leading to denial-of-service or reliability degradation in critical decision-making systems. In this work, we precisely attack FPGA-based HDC accelerators using profiling and advanced voltage-level fault injection methods to evaluate their vulnerability. Our experiments reveal significant susceptibility during the critical similarity computation phase of the HDC inference pipeline, achieving targeted misclassification rates of up to ≈89% in BRAM-based implementations. To address these security vulnerabilities, we propose dual XOR masking and query hypervector randomization as practical, hardware-friendly countermeasures. Extensive real-hardware evaluations confirm that these defenses substantially reduce misclassification rates to ≈2%, significantly enhancing the security and reliability of edge- deployed HDC accelerators.
1C-2
10:45-11:10

PIR-Cache: Mitigating Conflict-Based Cache Side-Channel Attacks via Partial Indirect Replacement

*Hao Ma, Zhidong Wang, Wei Song (Institute of Information Engineering, Chinese Academy of Sciences)
Keywords
micro architecture, conflict-based cache side-channel attacks, cache randomization, eviction set searching algorithms, partial indirect replacement
Abstract
Conflict-based side-channel attacks allow attackers to monitor victims' access patterns by asserting malicious cache conflicts. While cache randomization has emerged as a potential defense, existing solutions face critical limitations. CEASER-S and DT4+EV10 fail to fully prevent existing eviction set searching algorithms. MIRAGE suffers from intolerable area and power overheads. Chameleon's relocation mechanism faces the problem of excessive power/energy consumption. To alleviate these limitations, we employ a randomized skewed set-associative cache with partial indirect replacement (PIR-Cache). Our approach effectively mitigates conflict-based side-channel attacks while incurs negligible runtime performance impact with moderate area and power overhead.
1C-3
11:10-11:35

An Efficient Defense Method Based on Progressive Fault-Aware Training and JS Divergence-Guided TMR for DNNs against Bit-Flip Attacks

*Huarun Zhou, Ran Dong, Zhaohui Guo, Qiang Liu (Tianjin University)
Keywords
Security for Deep Neural Networks, Progressive Fault-Aware Training, Bit-flip Attacks
Abstract
Deep neural networks (DNNs) have been increasingly deployed on edge devices, enabling edge-intelligent applications. However, this introduces significant security vulnerabilities, especially the degradation of network performance caused by external attacks such as bit-flip attacks (BFAs). Traditional defense methods based on spatial redundancy, such as triple modular redundancy (TMR), consume a significant amount of hardware resources, and existing fault-aware training (FAT) methods do not significantly improve robustness. To address these issues, we propose an efficient defense method, which combines fault-aware training and spatial redundancy, against BFAs for DNNs. Specifically, a progressive fault-aware training method is proposed to enhance the inherent robustness of DNN models. Subsequently, a JS divergence-guided TMR approach is developed, which identifies a small number of critical weights in the trained model that significantly impact model accuracy by JS divergence analysis and applies TMR only to the critical weights, to further enhance the model's robustness. The experimental results obtained on the VGG-13/ResNet-20/ResNet-34 models and the CIFAR-10 dataset show that compared with the FAT-type methods, our proposed method improves the robustness of the models by up to 5.6x; compared with the spatial redundancy methods, our method improves the robustness by 1.9x and reduces memory storage overhead 23% in the experiment platform.
1C-4
11:35-12:00

X-Matrix Shield: Defeating Tilted FIB and Rerouting Attacks through 3D-Interlaced Protection

Jiaji He, *Junfeng Cai (Tianjin University), Yaohua Wang (National University of Defense Technology), Fei Zhao (Peking University), Mao Ye (Tianjin University), Yongqiang Lyu (Tsinghua University)
Keywords
Integrated Circuits Security, Active Shield, Dual-Layer Structure, Direction Entropy, Artificial Fish-Swarm Algorithm
Abstract
As focused ion beam (FIB) technology advances and invasive attack methods evolve, tilted FIB and rerouting attacks pose serious threats to chip security. Existing single-layer active shields exhibit inherent limitations in defending against these advanced invasive attacks. This paper presents the X-Matrix Shield, a dual-layer structure that fortifies integrated circuits against sophisticated physical tampering. This research establishes a novel information theory-based evaluation framework with quantifiable direction entropy metrics. Additionally, the work develops an improved artificial fish-swarm algorithm (D-AFSA) to generate optimized 3D-interlaced protection paths. X-Matrix Shield can be directly integrated into the standard EDA layout flow, enabling security-aware physical design co-optimization. When implemented in 55 nm standard process technology, the X-Matrix Shield achieves a quality metric of 1.99. GDS3D simulations of protected AES modules further confirm the shield's effectiveness against advanced physical attacks compared to single-layer protection.

Session 1D

(SS-2) Design Automation for Quantum Error Correction: From Algorithms to Architectures
10:20-12:00 | Tuesday, January 20, 2026 | Sleeping Beauty 1/2
Chair(s):
Zhiyao Xie (The Hong Kong University of Science and Technology)
1D-1
10:20-10:45

Fault-tolerant State Preparation for Quantum Error Correction Codes: Leveraging Design Automation

*Robert Wille (TUM) with contributions from Lucas Berent (TUM), Markus Müller (Forschungszentrum Jülich), Tom Peham (TUM), Ludwig Schmid (TUM), Erik Weilandt (TUM)
Keywords
TBA
Abstract
State preparation is a foundational element of QEC, enabling the initialization of logical qubits in protected subspaces. This task is often performed manually and is prone to inefficiency, particularly as system sizes grow. This talk presents design automation techniques to synthesize fault-tolerant state preparation circuits for large Calderbank-Shor-Steane (CSS) codes. These techniques are implemented in the open-source Munich Quantum Toolkit (MQT-QECC) and allow for the automated generation of optimized circuits that are infeasible to construct by hand. This contribution highlights the vital role of CAD tools in expanding the scalability and reliability of quantum fault-tolerant workflows. Open source implementations of the presented methods are available at https://github.com/munich-quantum-toolkit/qecc.
1D-2
10:45-11:10

Hardware-Efficient Union-Find Decoder Towards Scalable Topological Quantum Codes

Shuang Liang, Jubo Xu, Yuncheng Lu, Hao (Mark) Chen (Imperial College London), Bo Yuan (Rutgers University), *Hongxiang Fan (Imperial College London)
Keywords
Quantum Error Correction, Union-Find, Hard_x0002_ware Design, QEC Decoder, Surface Code
Abstract
Quantum Error Correction (QEC) is essential for realizing large-scale, fault-tolerant quantum computing. Among them, topological quantum codes, which encode logical qubits in a lattice of physical qubits, have attracted considerable attention in both academia and industry. A key challenge, however, lies in designing decoders that meet the stringent latency and hardware efficiency requirements of practical quantum systems. Among mainstream approaches, the Union-Find (UF) decoder offers exceptionally low latency, but existing implementations often suffer from substantial hardware inefficiency. To address this limitation, we focus on developing a scalable UF decoder through general-purpose hardware architecture optimization. Experimental results demonstrate that our optimized UF decoder achieves good scalability in hardware overhead. Future directions include exploring advanced memory technologies and scheduling strategies to further improve memory efficiency in UF-based decoders, co-designing decoding algorithms and custom hardware to achieve better accuracy-latency tradeoffs, and optimizing QEC solutions for large-scale distributed quantum computing systems.
1D-3
11:10-11:35

Reinforcement Learning for Enhanced Advanced QEC Architectures Decoding

Yidong Zhou (Rensselaer Polytechnic Institute), Lingyi Kong (The Chinese University of Hong Kong), Yifeng Peng (Stevens Institute of Technology), *Zhiding Liang (The Chinese University of Hong Kong)
Keywords
quantum computing, quantum error correction, reinforcement learning, quantum low-density parity check codes
Abstract
The advent of promising quantum error correction (QEC) codes with efficient resource utilization and highperformance fault-tolerant quantum memories signifies a critical step towards realizing practical quantum computation. While surface codes have been a dominant approach, their limitations have spurred the development of more advanced QEC architectures. These advanced codes often present increased complexity, demanding innovative decoding methodologies. This work investigates the application of reinforcement learning (RL) techniques, including hybrid and multi-agent approaches, to enhance the decoding of various advanced QEC architectures. By leveraging the ability of RL to learn optimal strategies from noisy syndrome measurements, we explore the potential for achieving improved logical error rates and scalability compared to traditional decoding methods. Our approach examines the adaptation of reinforcement learning to exploit the structural properties of these modern QEC models. We also explore the benefits of combining different RL algorithms to address the multifaceted nature of the decoding problem, considering factors such as code degeneracy and real-world noise characteristics. With our proposed method, we are able to demonstrate that an autonomously trained agent can derive decoding schemes for the complex decoding requirement of advanced QEC architectures.
1D-4
11:35-12:00

Quantum Instruction Set Architecture: The Good, the Bad, and the Future

*Jianxin Chen (Tsinghua University)
Keywords
quantum instruction set, quantum error correction, fault-tolerant quantum computing, quantum design automation
Abstract
This presentation provides an overview of my team’s recent work, spanning the design of a quantum instruction set and its implications for system performance, quantum error correction, and chip architecture.
Since Shor’s algorithm demonstrates quantum computing’s potential for exponential speedups in solving critical problems like integer factorization, the field has drawn sustained, intense interest from both academia and industry. From fundamental physics experiments in the 1990s laboratories—where only a handful of qubits could be manipulated—–to today’s ability to precisely control hundreds of qubits for preliminary computing trials, humanity stands analogous to our ancestors when they first mastered fire: we are now learning to harness the revolutionary power of qubit manipulation. In this talk, I will focus on key advancements in the design principles and engineering implementation of quantum computing instruction set architectures, outline the major challenges currently faced, and discuss directions for future development.
The systematic exploration of non-conventional quantum instruction sets only commenced a few years back. demonstrates that the underexplored √'iSWAP' significant advantages in both expressivity and fidelity—challenging the longheld assumption of a trade-off between these two properties. Partially inspired by parallel efforts to explore alternative twoqubit instructions, introduced a unified control scheme capable of implementing arbitrary two-qubit gates efficiently. By tuning physical parameters such as pulse envelope amplitudes and frequency detuning, this approach enables, for the first time, direct and flexible realization of any two-qubit unitary operation. The so-called AshN scheme was subsequently validated through a series of experiments on superconducting quantum processors], confirming its feasibility.
This line of research, while striving to achieve a better balance between expressivity and accuracy, may nonetheless introduce unforeseen drawbacks—particularly concerning compatibility with established techniques. For instance, the widely used virtual-Z technique can fail when applied to two-qubit gates that do not preserve phase (i.e., those that are not phase carriers). To address this limitation, proposes a compilation scheme for arbitrary single-qubit gates on superconducting processors. The method leverages tunable phase shifts of microwave pulses to realize a continuous gate set, is compatible with any two-qubit gate, and requires calibration of only the X(π) and X( π 2 ) pulses.
By integrating all the aforementioned advances, it is unsurprising that quantum algorithms can now be implemented far more effectively than with conventional synthesis into CNOT and single-qubit gates. Notably—and somewhat surprisingly—early work in this direction, such as, not only demonstrates significant performance gains but also shows promise in mitigating the challenges posed by limited qubit connectivity, a key limitation of superconducting platforms compared to trappedions or neutral atoms.
However, in the regime of quantum error correction—where stabilizer operations and Clifford gates are of primary interest—it remains unclear how much benefit the aforementioned nonconventional instruction sets will offer. By leveraging both CNOT and iSWAP instructions, mitigates the impact of ancilla qubit defects during surface code stabilizer measurements, thereby enhancing the robustness and reliability of quantum computation. Moreover, similar techniques can be extended to quantum low-density parity-check (qLDPC) codes, offering the advantage of halving the number of required long-range interactions.
Much like early fire-builders learned to shape flame into tools, we are now shaping quantum interactions into reliable, programmable computation—turning raw physical potential into engineered reality. And just as the spark was essential to kindling fire, the quantum instruction set serves as the ignition point in this transformation—deserving far greater attention and dedicated research.

Session 1E

(T11-A) Ensuring High Quality Designs through Simulation and Verification Advances
10:20-11:35 | Tuesday, January 20, 2026 | Sleeping Beauty 3
Chair(s):
Yutaka Masuda (Nagoya University)
Senling Wang (Ehime University)
1E-1
10:20-10:45

Old School Never Die: A Classic Yet Novel Algorithm for Computing RC Current Response in VLSI

*Zongfeng Ma, Zhong Guan (Sun Yat-sen University)
Keywords
Signal line, RC network, Current response, Time domain model, Effective capacitance
Abstract
Accurate computation of signal line current response is paramount for timing, power, signal integrity, and electromigration analyses. The intricate interplay between nonlinear transistor characteristics and parasitic interconnect effects poses significant computational challenges, creating bottlenecks for rapid and precise waveform evaluation. Departing from emerging neural network approaches requiring extensive training datasets and complex models, this work revisits classical circuit principles to propose an “old school yet novel” algorithm that efficiently computes current waveforms without SPICE simulation. Our method achieves exceptional accuracy, with merely 1% deviation in key metrics versus SPICE references, while delivering approximately 100x speedup. Rigorously validated at the 7nm FinFET technology node, the algorithm has been integrated into our commercial EDA tool’s latest trial build. Comparative evaluation against state-of-the-art industrial solutions demonstrates superior accuracy and substantially reduced runtime, reaffirming the practical advantages and relevance of traditional methodologies amidst the rising tide of AI-driven modeling solutions.
1E-2
10:45-11:10

TargetFuzz: Enabling Directed Graybox Fuzzing via SAT-Guided Seed Generation

*Raghul Saravanan, Sai Manoj Pudukotai Dinakarrao (George Mason University)
Keywords
Fuzzing, Coverage-Guided-Fuzzing, Directed Fuzzing
Abstract
The ever-increasing complexity of design specifications for processors and intellectual property (IP) presents a formidable challenge for early bug detection in the modern IC design cycle. The recent advancements in hardware fuzzing have proven effective in the design verification of complex hardware designs. The modern IC design flow involves incremental updates and modifications to the hardware designs, necessitating rigorous verification and extending the overall verification period. A major challenge lies in generating high-quality seeds that maximize coverage and verification efficiency. While Coverage-Guided Fuzzing (CGF) enhances overall exploration, it lacks precision when targeting specific sites. DirectFuzz addresses this with directed test generation but suffers from key limitations, including limited HDL support, abstraction mismatches, and poor scalability for large target regions. In this work, to overcome these challenges, we propose TargetFuzz, a Directed Graybox Fuzzing (DGF) framework that integrates SAT (Boolean satisfiability) engines for precise and scalable seed generation. Our experimental results demonstrate its capability to effectively scale 30x greater in terms of handling target sites, achieving 100% state coverage and 1.5x faster in terms of site coverage, and show 90x improvement in target state coverage compared to Coverage-Guided Fuzzing, demonstrating its potential to advance the state-of-the-art in directed hardware fuzzing.
1E-3
11:10-11:35

VeriRAG: A Knowledge Graph-Augmented RAG for Verilog and Assertion Generation

Jayanth Thangellamudi, Raghul Saravanan, *Sai Manoj Pudukotai Dinakarrao (George Mason University)
Keywords
Hardware Description Language (HDL), Electronic Design Automation (EDA), Large Language Model(LLM), Retrieval-augmented generation(RAG), Verilog, System Verilog Assertions
Abstract
The adoption of Large Language Models (LLMs) in Electronic Design Automation (EDA) has demonstrated significant potential for automating HDL generation and verification; however, conventional prompt-based or fine-tuned approaches often fail to produce structurally consistent RTL and mean- ingful assertions for complex designs. We present VeriRAG, a hybrid retrieval-augmented generation framework that combines hardware-specific knowledge graphs with semantic vector embeddings. This hybrid retrieval strategy provides both symbolic structural context and semantic content, enabling the LLM to generate synthesizable Verilog and valid SystemVerilog Assertions (SVAs) without relying on rigid manual intervention or costly retraining. Experimental results across a diverse set of representative designs show that VeriRAG achieves up to 97% syntax correctness and 100% functional success for RTL generation, with SVAs reaching 100% syntax validity and 95% Formal Property Verification(FPV) pass rates using standard EDA tools. These results highlight the potential of combining symbolic knowledge graphs with retrieval-augmented generation for scalable, verifiable hardware design workflows.

Session 1F

(T7-A) Efficient Design of Spiking Neural Network Accelerators
10:20-12:00 | Tuesday, January 20, 2026 | Sleeping Beauty 5
Chair(s):
Can Li (Hong Kong University)
Tingting Zhang (McGill University)
1F-1
10:20-10:45

Spiking-NeRF: Neural Graphics Acceleration With Spiking Feature Encoding for Edge 3D Rendering

*Jianzhen Gao, Wei Liu, Yue Liu, Hengyi Zhou, Zhiyi Yu, Shanlin Xiao (Sun yat-sen University)
Keywords
Neural Rendering, Spiking Neural Network, Hardware Accelerator, Neural Networks
Abstract
Neural Radiance Fields (NeRF) have demonstrated remarkable potential for high-fidelity 3D scene reconstruction and rendering. However, achieving real-time performance on edge GPUs and accelerators remains a major challenge due to two critical bottlenecks: the high memory demand of multi-resolution hash encoding and the considerable computational cost of floating-point interpolation. To address these limitations, we propose Spiking-NeRF, a brain-inspired algorithm-hardware co-design framework. On the algorithm side, we introduce a spiking feature encoding scheme based on Integrate-and-Fire (IF) neurons, which transforms continuous voxel features into sparse spikes, reducing hash storage overhead by 75%. We further propose a global importance-based pruning strategy that compresses hash tables by 71.3% by removing low-accessed entries. To reduce interpolation complexity, we design a hard-threshold weight discretization method that eliminates floating-point operations in favor of bitwise logic. On the hardware side, we accelerate critical stages of the NeRF pipeline by integrating a spike-skipping mechanism that dynamically bypasses hash entries, reducing memory traffic by 32.46%. We also co-optimize on-chip storage by leveraging access locality patterns across different resolution levels of the hash structure. Experimental results demonstrate that Spiking-NeRF achieves real-time rendering performance while maintaining high visual fidelity. Compared to edge GPUs, our design improves throughput by 111.2x and reduces power consumption by 41.67x. Against SOTA NeRF accelerators, Spiking-NeRF achieves up to 2.48x higher throughput and 5.45x lower energy usage, underscoring the potential of spike-based computing for next-generation low-power neural graphics systems.
1F-2
10:45-11:10

FlowQ: Fixed-point Low-precision Post-Training Quantization Framework for Efficient and Accurate SNN Inference

*Faaiz Asim, Sanhtet Aung, Jongeun Lee (Ulsan National Institute of Science and Technology)
Keywords
Spiking neural networks, Brain-inspired Computing, Post training quantization
Abstract
We propose FlowQ, a post-training quantization (PTQ) framework for spiking neural networks (SNNs) that balances accuracy and hardware-efficiency through quantizer design and calibration-based scale optimization. While using different scales for weights and membrane potentials preserves accuracy, it typically incurs high hardware cost. In contrast, shared scale factors reduce hardware complexity but lead to significant accuracy degradation. FlowQ bridges this gap by using a hardware-friendly quantizer with different scales that differ by a power-of-two, allowing multiplications to be replaced with simple bit-shift operations for negligible overhead. To further improve accuracy, we present FlowTune, a calibration algorithm that iteratively optimizes FlowQ’s scale factors by minimizing mean-square error (MSE), outperforming the commonly used absolute max-based scaling in SNN PTQ. Extensive experiments on CIFAR-10, CIFAR-100, DVS-Gestures, and ImageNet demon- strate the effectiveness of our approach. For example, on CIFAR- 10 with VGG-16 at 4-bit precision, FlowQ achieves only a 1.48% accuracy drop. Compared to shared scale factors, FlowQ improves accuracy by 77.4% with just 1% energy and 0.25% area overhead.
1F-3
11:10-11:35

An Algorithm-Hardware Co-Design for Efficient and Robust Spiking Neural Networks via Sparsity

*Wei Liu, Yinsheng Chen, Jilong Luo, Yusa Wang, Zhiyi Yu, Shanlin Xiao (Sun Yat-Sen University)
Keywords
hardware acceleration, fault tolerance, neuromorphic computing, sparse coding, sparsity-aware architecture
Abstract
Spiking Neural Network (SNN) deployment is trapped by a fundamental dilemma: robust rate codes are power-intensive, while energy-efficient temporal codes lack noise resilience. This paper resolves this conflict through an algorithm-hardware co-design that strategically leverages sparsity to achieve unprecedented efficiency and fault tolerance simultaneously. Our approach pairs a novel sparse coding scheme, which encodes information into a few highly significant spikes to inherently prune redundancy and enhance noise immunity, with a purpose-built accelerator. This accelerator features a hierarchical zero-skipping architecture that dynamically eliminates redundant computations. The resulting synergy is profound. Our system slashes network spike activity by 88% compared to rate coding while demonstrating superior resilience to injected faults. When benchmarked against state-of-the-art accelerators, our design consumes 88% less energy and delivers 4.5x higher throughput than a leading rate-coding design, while using 82% fewer LUTs. Furthermore, it outperforms an advanced temporal-coding accelerator with an 89% energy reduction and a staggering 26.8x increase in throughput. This work establishes that strategically engineered sparsity is not a compromise, but a direct pathway to creating SNNs that are simultaneously efficient, robust, and primed for mission-critical edge applications.
1F-4
11:35-12:00

LOKI: a 0.266 pJ/SOP Digital SNN Accelerator with Multi-Cycle Clock-Gated SRAM in 22nm

*Rick Luiken, Lorenzo Pes, Manil Dev Gomony, Sander Stuijk (TU Eindhoven)
Keywords
Spiking Neural Networks, Neuromorphic Computing, Edge Computing
Abstract
Bio-inspired sensors like Dynamic Vision Sensors (DVS) and silicon cochleas are often combined with Spiking Neural Networks (SNNs), enabling efficient, event-driven processing similar to biological sensory systems. To realize the low-power constraints of the edge, the SNN should run on a hardware architecture that can exploit the sparse nature of the spikes. In this paper, we introduce LOKI, a digital architecture for Fully-Connected (FC) SNNs. By using Multi-Cycle Clock-Gated (MCCG) SRAMs, LOKI can operate at 0.59 V, while running at a clock frequency of 667 MHz. At full throughput, LOKI only consumes 0.266 pJ/SOP. We evaluate LOKI on both the Neuromorphic MNIST (N-MNIST) and the Keyword Spotting (KWS) tasks, achieving 98.0% accuracy at 119.8 nJ/inference and 93.0 % accuracy at 546.5 nJ/inference respectively.

Luncheon Talk I

12:30-13:15 | Tuesday, January 20, 2026 | Cinderella Ballroom 1/6/7/8
Patrick Groeneveld
Senior Fellow at AMD
Adjunct Professor, Stanford University
12:30-13:15
Luncheon Talk

When Moore Surpasses Mind: The Impact of 6 decades of Relentless Design Automation

Biography
Dr. Patrick Groeneveld is Senior Fellow at AMD and adjunct lecturer in Stanford University's Department of Electrical Engineering. With an extensive career in Electronic Design Automation, he has held roles at both Cadence and Synopsys and served as Chief Technologist at Magma Design Automation, where he contributed to the development of a pioneering RTL-to-GDS2 synthesis tool. Patrick has also worked with AI hardware startups and held a Full Professorship in Electrical Engineering at Eindhoven University. He is the Finance Chair on the Executive Committee of the Design Automation Conference. Patrick earned his MSc and PhD degrees from Delft University of Technology in the Netherlands.
Abstract
After sixty years of scaling, we've crossed a symbolic threshold: a single chip now contains more transistors than the human brain has neurons. Machines built from these devices are beginning to rival—or surpass—human intelligence. This transformation forces us to revisit a question raised at the very first Design Automation Conference in 1964: how does automation reshape our work and our society? Today, that question is more urgent than ever—not only for electronic designers but for the broader world that depends on automation. Decades of progress in Electronic Design Automation made these trillion-transistor systems possible. Synthesis, placement, and routing of billions of components—while balancing cost, performance, power, and reliability—represent one of the most intricate engineering achievements in human history.

Session 2A

(T5-B) Vision and Transformer Acceleration Architectures
13:30-15:35 | Tuesday, January 20, 2026 | Snow White 1
Chair(s):
Caiwen Ding (University of Minnesota - Twin Cities)
Sungju Ryu (Sogang University)
2A-1
13:30-13:55

PipeViT: Accelerating Vision Transformers via Intra-Layer Pipelining

*Xilang Zhou, Yiheng Xu, Haodong Lu, Jun Yu, Kun Wang (Fudan University)
Keywords
Vision Transformers, FPGA, Accelerator
Abstract
Vision Transformers (ViTs) have achieved high performance across various computer vision tasks by leveraging the attention mechanism. However, the attention module in ViTs severely hinders inference performance due to its low operational intensity. Existing approaches improve ViTs efficiency through pruning, sparsity, and linearization, but at the cost of fine-tuning overhead and accuracy degradation. In this paper, we propose PipeViT, a memory-efficient and low-latency accelerator for ViTs inference. The key insight of PipeViT is to exploit intra-layer acceleration opportunities. Specifically, we first fuse the attention operations into a single operator to reduce memory access overhead. Then, we divide the input of attention into multiple tiles to reduce the on-chip memory requirement. Finally, we pipeline the tiled attention computation to improve overall throughput. Based on the optimized dataflow, we design a heterogeneous dual-core architecture for efficient pipeline execution. Furthermore, to maximize hardware utilization, the architecture can be reconfigured into a single core with higher parallelism during the execution of the feed-forward network. Experimental results show that PipeViT achieves up to 19.3x, 1.5x, 2.1x, and 2.0x improvements in Frames Per Second (FPS) compared to state-of-the-art accelerators including ViTA, Auto-ViT, ME-ViT, and HeatViT. Additionally, PipeViT achieves up to 8.0x and 2.6x higher energy efficiency compared to CPU and GPU implementations, respectively.
2A-2
13:55-14:20

ConfASR: A Conformer Block Accelerator for Speech Recognition Optimized for Edge Devices

*Malte Wabnitz, Max Nilovic, Finn Scholz, Dominik Friedrich, Christian Lanius, Jie Lou, Tobias Gemmeke (RWTH Aachen University)
Keywords
Automatic Speech Recognition, Transformer, Attention, Convolution, ASIC, Edge Devices
Abstract
Attention-based neural networks, like transformers, have significantly improved automatic speech recognition (ASR). Adding convolution operations to transformers results in conformers, which enable better learning of local dependencies and reduce word error rate. We introduce ConfASR, the first conformer block accelerator designed for efficient ASR inference on edge devices. Our system is optimized both in terms of algorithms and hardware to support all transformer operations and additional features required for conformers, including depthwise-separable convolution and learned positional encoding. We propose a hardware-friendly normalization, shared scaling factors for non-linear functions, and an efficient dataflow with a shared MAC array that keeps all activations on chip. Implemented in a 22 nm FDSOI technology, ConfASR operates at 250 MHz with a power consumption of 359 mW, and a die area of 1.19 mm^2. It performs over 900 times faster than necessary for real-time streaming requirements. This makes the architecture suitable not only for ASR but also for other transformer-based applications. ConfASR reduces latency by over 4x and power consumption by 16x during real-time use compared to previous solutions, while supporting more functionality.
2A-3
14:20-14:45

LAPA: Log-Domain Prediction-Driven Dynamic Sparsity Accelerator for Transformer Model

Huizheng Wang, *Hongbin Wang, Shaojun Wei, Yang Hu, Shouyi Yin (Tsinghua University)
Keywords
Transformers, dynamic sparsity, low complexity, attention
Abstract
Attention-based Transformers have revolutionized natural language processing (NLP) and shown strong performance in computer vision (CV) tasks. However, as the input sequence varies, the computational bottlenecks in Transformer models exhibit dynamic behavior across stages, calling for a cross-stage sparse acceleration strategy. Unfortunately, most existing sparse Transformer approaches are single-stage based, and their sparsity prediction mechanisms lead to significant power overhead when applied across multiple stages. To this end, this paper proposes a log-domain attention prediction algorithm-architecture co-design, named LAPA. First, an asymmetric leading one computing (ALOC) scheme is designed to eliminate expensive multiplications. Next, a mixed-precision multi-round shifting accumulation (MRSA) mechanism is further proposed to mitigate the accumulation overhead. A data-feature dependent filter (DFF) strategy is designed to work in concert with the MRSA process. Finally, an elaborate accelerator is designed to translate the theoretical enhancement into practical hardware improvement. Experimental results show that LAPA achieves 3.52x, 3.24x and 2.79x higher energy efficiency than the state-of-the-art (SOTA) works Spatten, Sanger and FACT, respectively.
2A-4
14:45-15:10

BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination

Huizheng Wang, *Hongbin Wang, Shaojun Wei, Yang Hu, Shouyi Yin (Tsinghua University)
Keywords
Transformer, attention sparsity, stage-fusion, bit-grained processing, out-of-order execution
Abstract
Attention-based large language models (LLMs) have transformed natural language processing, but the quadratic cost of self-attention imposes significant compute and memory overhead. Dynamic sparse attention mitigates this, yet its hardware efficiency is limited by the added prediction stage, coupled with costly memory accesses. To address these limitations, this paper proposes BitStopper, a fine-grained algorithm-architecture co-design tailored for sparse attention. First, a bit-serial enable stage fusion (BSF) mechanism is proposed to reuse and minimize the memory access by progressively terminating trivial tokens and merging the prediction stage into the execution stage. Second, an adaptive and lightweight max-oriented threshold selection (MOTS) strategy is developed to work in concert with the bit-wise processing. Third, a bit-level out-of-order processing (BOOP) scheme is employed to enhance hardware utilization during the bit-wise termination. Finally, an elaborate architecture is designed to translate the theoretical complexity reduction into practical performance improvement. Extensive evaluations demonstrate that, compared to state-of-the-art (SOTA) Transformer accelerators, BitStopper achieves 2.03x and 1.89x speedups over Sanger and SOFA, respectively, while delivering 2.4x and 2.1x improvements in energy efficiency.
2A-5
15:10-15:35

JaneEye: A 12-nm 2K-FPS 18.9-μJ/Frame Event-based Eye Tracking Accelerator

*Tao Han, Ang Li (Delft University of Technology), Qinyu Chen (Leiden University), Chang Gao (Delft University of Technology)
Keywords
Eye Tracking, Extended Reality, ASIC, Deep Neural Network
Abstract
Eye tracking has become a key technology for gaze-based interactions in Extended Reality (XR). However, conventional frame-based eye-tracking systems often fall short of XR's stringent requirements for high accuracy, low latency, and energy efficiency. Event cameras present a compelling alternative, offering ultra-high temporal resolution and low power consumption. In this paper, we present JaneEye, an energy-efficient event-based eye-tracking hardware accelerator designed specifically for wearable devices, leveraging sparse, high-temporal-resolution event data. We introduce an ultra-lightweight neural network architecture featuring a novel ConvJANET layer, which simplifies the traditional ConvLSTM by retaining only the forget gate, thereby halving computational complexity without sacrificing temporal modeling capability. Our proposed model achieves high accuracy with a pixel error of 2.45 on the 3ET+ dataset, using only 17.6K parameters, with up to 1250 Hz event frame rate. To further enhance hardware efficiency, we employ custom linear approximations of activation functions (hardsigmoid and hardtanh) and fixed-point quantization. Through software-hardware co-design, our 12-nm ASIC implementation operates at 400 MHz, delivering an end-to-end latency of 0.5 ms (equivalent to 2000 Frames Per Second (FPS)) at an energy efficiency of 18.9 μJ/frame. JaneEye sets a new benchmark in low-power, high-performance eye-tracking solutions suitable for integration into next-generation XR wearables.

Session 2B

CEDA-EDS Special Session: 100 Years of the FET: From Technology Foundations to the EDA Ecosystem
13:30-15:35 | Tuesday, January 20, 2026 | Snow White 2
Chair(s):
Yu [Kevin] Cao (University of Minnesota)
2B-1
13:30-13:55

FET100: Celebrating the Past and Inspiring the Future

*Bin Zhao (Jr. Past President, IEEE Electron Devices Society)
Abstract
The field-effect transistor (FET) stands as one of the most consequential inventions in the history of modern electronics. From Julius Lilienfeld’s pioneering patents filed in 1925 and 1926 to today’s highly integrated nanoscale devices, the FET has enabled decades of transformative advances in electronics, computing, and communications. At the CEDA–EDS Special Session, this talk marks the FET100 milestone—100 Years of the FET—by reflecting on the origins and evolution of FET technologies, their central role in the transition from vacuum tubes to solid-state electronics, and their foundational impact on integrated circuits. The talk highlights how continued innovations in device architectures, materials, and 3D integration increasingly demand closer interactions between device technology, circuit design, and electronic design automation (EDA). EDA plays a critical role in translating FET innovations into scalable, manufacturable systems and in bridging device technology with circuit and system design. To further advance AI, energy-efficient computing, and emerging applications, FET100 serves not only as a celebration of the past, but also as an opportunity to strengthen the connection between FET technology foundations and the future EDA ecosystem.
2B-2
13:55-14:20

The FET at 100: Old and Needing Assistance

*Greg Yeric
Abstract
This special session celebrates the 100th anniversary of J.E. Lilienfeld's patent of the Field Effect Transistor. Lilianfeld did not make a working FET, due to lack of materials quality as well as the understanding of semiconductor physics in 1926. He did live to see the first working FET, but the success and impact of his invention 100 years on would have been unfathomable to him. In 2026, the FET is in some ways a victim of its own success. The FET is used at such a scale that it contributes to appreciable amounts of global energy demand, and it has unlocked such value in fields such as communications and artificial intelligence that there is no end in sight to a continued increase in demand for more FETs. Yet it has scaled to such extreme nanoscale dimensions that it is no longer the controlling factor in addressing energy use. This talk will address challenges for the FET going forward, the needs and opportunities to replace it with something else, and challenges faced to bring any new technology up to the task of challenging the mighty FET.
2B-3
14:20-14:45

CMOS 2.0: UnFETtering the Scaling of CMOS

*Julien Ryckaert (IMEC)
Abstract
The Field Effect Transistor has long been a cornerstone of VLSI scaling. Indeed, together with interconnect scaling, they enabled a long series of technology generations that exponentially enhanced computing performance. Staring the late 2000 it has then been supported by Design and System-Technology Co-Optimization. This evolution has shifted the focus from pure dimensional scaling to re-engineering the device architecture, better balancing current drive and device capacitance under geometric scaling. VLSI has more recently been supported by system-technology boosters that offered an efficient scaling at the level of the SoC. These innovations include Backside technology and die stacking in 3D. Moving forward, in order to keep supporting the high demand in compute scaling we will need to double down on the heterogeneity offered by 3D and backside processing. This transition from VLSI to Heterogeneous Large Scale Integration (HLSI) will require profound readjustments of the semiconductor ecosystem. Built on a more intimate optimization between geometric scaling and the assembly of various technologies, it not only requires innovation in process and material, but more importantly in system architecture and design enablement. Indeed, it will require breaking the SoC paradigm by better capitalizing on two properties offered by 3D technologies: heterogeneity and volumetric optimization.
2B-4
14:45-15:10

Compact Modeling - A Bridge between Foundry and Circuit Design

*Yogesh Singh Chauhan (IIT, Kanpur)
Abstract
Compact Models have been the backbone of IC design. In this talk, I will discuss the history and future of compact models.
2B-5
15:10-15:35

Nanoelectronic Modeling (NEMO): From Esoteric Quantum Theory to Software that Helps Design Tomorrow's Atomic-scaled Transistors and Global Impact in nanoHUB

*Gerhard Klimeck (Purdue University)
Abstract
TBA

Session 2C

(T13-B) From HDL to Hardware: Scalable Design Automation for Quantum Computing
13:30-15:35 | Tuesday, January 20, 2026 | Snow White 3
Chair(s):
Zhiding Liang (Chinese University of Hong Kong)
Johannes Geier (Technical University of Munich)
2C-1
13:30-13:55

Scalable Optimization with GIS-PIM: A Generalized Integer-State Probabilistic Ising Machine

*Chirag Garg, Sayeef Salahuddin (University of California, Berkeley)
Keywords
Ising Machine, Integer-state optimization, Combinatorial Optimization Problems (COP), Quantum-inspired computing
Abstract
Physics-inspired hardware solvers based on Ising machines have gained significant attention for addressing combinatorial optimization problems across diverse domains, including logistics, communication networks, and financial optimization. While the problems in these domains typically involve integer states, they are often reformulated to fit the quadratic unconstrained binary optimization (QUBO) framework supported by Ising machines. This is usually achieved by introducing penalty terms, which can degrade solution quality. In this work, we present a generalized integer-state probabilistic Ising machine (GIS-PIM) that natively supports integer-state encoding using compact binary representations. By avoiding one-hot encoding and associated constraints, GIS-PIM efficiently explores the solution space with fewer spin variables. The framework is implemented on a GPU and evaluated on publicly available graph coloring benchmarks, including both synthetic and real-world datasets. It demonstrates strong scalability, solving problem instances with up to 13 colors and graph sizes reaching approximately 20,000 nodes, a level not previously reported for other Ising-based methods. Our results indicate that the proposed approach: i) achieves solution accuracies exceeding 98%, competitive with leading classical methods such as Tabucol and graph neural network-based solvers; and ii) significantly performs better over state-of-the-art simulated bifurcation (SBM) and QUBO-based probabilistic Ising machines (Qu-PIM), achieving higher success probability and faster time-to-solution.
2C-2
13:55-14:20

A Scalable and High-Quality Qubit Mapping and Shuttling Framework for Neutral Atom Quantum Devices

*Sung-Ying Hsieh, Wai-Kei Mak (National Tsing Hua University)
Keywords
neutral atom quantum computing, quantum circuit compilation, shuttle scheduling
Abstract
Advanced neutral atom quantum systems are being developed by the industry due to its unique hardware features that enable efficient quantum circuit execution. In fact, neutral atom quantum systems are the only platforms that simultaneously support both long-range qubit interactions and native multi-qubit gates. However, the unique characteristics of neutral atom devices limit the applicability of existing qubit mapping methods developed for other quantum devices, and the state-of-the-art mapping approach [13] for neutral atom devices suffers from long runtime and low shuttle scheduling parallelism. We introduce a novel mapping and shuttling framework for neutral atom devices. We first partition the input circuit into a sequence of subcircuits, each associated with a mapping that enables the execution of all gates in the subcircuit. To find these mappings efficiently, we adopt a flexible strategy that dynamically switches between two mapping search methods. Then, for each pair of consecutive mappings, we schedule the shuttling operations required to transition between them and prioritize timing-critical operations to improve parallelism. We performed experiments on three benchmark sets, including circuits with up to 1617 qubits and more than 100,000 gates. Our proposed framework resulted in high-quality routing solutions and consistently achieves higher fidelity with shorter runtime compared to existing approaches.
2C-3
14:20-14:45

Quantum Oracle Synthesis from HDL Designs via Multi Level Intermediate Representation

*Giacomo Lancellotti, Filippo Buda, Giacomo Carugati, Daniele Gazzola, Alessandro Barenghi, Giovanni Agosta, Gerardo Pelosi (Politecnico di Milano)
Keywords
Quantum circuit synthesis, compiler optimizations, MLIR
Abstract
Quantum computing is increasingly recognized as a promising approach for tackling computationally intractable problems. However, achieving the scalability necessary for real-world applications requires substantial advancements in the quantum software stack. In this work, we introduce a compiler toolchain based on a Multi-Level Intermediate Representation (MLIR) that automatically synthesizes quantum circuits from Hardware Description Language (HDL) specifications of classical functions into quantum assembly languages. Many quantum algorithms rely on combinatorial circuits as subroutines, which traditionally require extensive resources in terms of quantum gates and qubits and are often manually optimized. Our toolchain integrates a sequence of optimization passes that combine classical compiler techniques with quantum-specific improvements, resulting in an average qubit reduction of 30% and an average gate-count reduction of 20% in widely adopted benchmark circuits, including those used in cryptographic applications.
2C-4
14:45-15:10

Survival of the Optimized: An Evolutionary Approach to T-depth Reduction

*Archisman Ghosh, Avimita Chatterjee, Swaroop Ghosh (The Pennsylvania State University)
Keywords
Quantum Error Correction, T-depth optimization, Surface Code, Genetic Algorithm
Abstract
Quantum Error Correction (QEC) is the cornerstone of practical Fault-Tolerant Quantum Computing (FTQC), but incurs enormous resource overheads. Circuits must decompose into Clifford+T gates, and the non-transversal T gates demand costly magic-state distillation. As circuit complexity grows, sequential T-gate layers ("T-depth") increase, amplifying the spatiotemporal overhead of QEC. Optimizing T-depth is NP-hard, and existing greedy or brute-force strategies are either inefficient or computationally prohibitive. We frame T-depth reduction as a search optimization problem and present a Genetic Algorithm (GA) framework that approximates optimal layer-merge patterns across the non-convex search space. We introduce a mathematical formulation of the circuit expansion for systematic layer reordering and a greedy initial merge-pair selection, accelerating the convergence and enhancing the solution quality. In our benchmark with ~90-100 qubits, our method reduces T-depth by 79.23% and overall T-count by 41.86%. Compared to the standard reversible circuit benchmarks, we achieve a 2.58x average improvement in T-depth over the state-of-the-art methods, demonstrating its viability for near-term FTQC.
2C-5
15:10-15:35

Subgraph-based Qubit Mapping for Noisy Intermediate-Scale Quantum Computing

*Junyi Gao, Wei-Hsiang Tseng, Yao-Wen Chang (National Taiwan University)
Keywords
Noisy intermediate-scale quantum (NISQ) computing, Qubit mapping, Graph matching, Graph isomorphism
Abstract
The noisy intermediate-scale quantum (NISQ) computer significantly advances quantum computing technology. Due to the physical connectivity constraints of the NISQ device, its induced qubit mapping problem becomes more challenging. Recent works employ heuristics to achieve promising outcomes. However, they are limited to using only one type of center for graph matching, and their exhaustive traversal of the coupling graph results in high computation time. This paper strategically generates specific subgraphs during the initial mapping stage to reduce the solution space for the coupling graph. Then, we employ a bidirectional graph isomorphism search to improve initial mapping. In the main mapping stage, we develop an efficient search algorithm to minimize the number of inserted gates. Experimental results show that our method significantly outperforms the state-of-the-art work in reducing the number of inserted CNOT gates by 13.22% and the runtime by 19.61%.

Session 2D

University Design Contest
13:30-15:35 | Tuesday, January 20, 2026 | Sleeping Beauty 1/2
Chair(s):
Fengbin Tu (The Hong Kong University of Science and Technology)
2D-1

A 100V 86.2% Efficiency Fibonacci-Dickson Hybrid Boost Converter for Acoustic Screen Applications

*Chen Hu, Yifan Jiang (Southern University of Science and Technology), Yan Lu (Tsinghua University), Junmin Jiang (Southern University of Science and Technology)
Keywords
DC-DC converter, boost converter, high VCR, hybrid converter, switched-capacitor converter, acoustic surface audio
Abstract
This summary presents a 100V output hybrid boost converter for an acoustic display driver. The proposed design incorporates two switched-capacitor topologies, the Fibonacci and Dickson topologies, to balance the on-chip area of power switches and the number of off-chip capacitors. The power stage utilizes laterally diffused N-type MOS (LDMOS) devices to enhance overall performance. A stacked active bootstrap circuit is employed to ensure sufficient gate driving voltage in multilevel bootstrapping, thereby preventing forward voltage drops from the passive diode. The converter achieves a voltage conversion ratio (VCR) of 20-40x, with a maximum output voltage of 100 V from an input of 2.5-5 V. It delivers a maximum output power of 2 W, peak efficiency of 86.2%, and a power density of 7.43 W/cm³.
2D-2

A 5-to-1V DLDO-Hybrid-Sigma Converter Achieving Fast Transient for High-Density Power Delivery

*Zizhe Huang, Yuxiang Li, Yuekang Guo, Jing Jin, Jianjun Zhou (State Key Laboratory of Radio Frequency Heterogeneous Integration (Shanghai Jiao Tong University)), Junmin Jiang (Southern University of Science and Technology)
Keywords
digital low dropout (DLDO) regulator, hybrid dc-dc converter, sigma converter, auxiliary loop, fast transient response, high power density
Abstract
A hybrid sigma converter integrated with a digital low dropout (DLDO) regulator is presented in this paper and designed for high-density power delivery applications. The proposed converter employs an input-series and output-parallel (ISOP) topology, combining a high-side hybrid converter for high power efficiency and a low-side DLDO for rapid transient regulation. An auxiliary control loop minimizes the DLDO’s dropout voltage across wide input and output voltage ranges, eliminating efficiency degradation under non-optimal operating conditions. The converter achieves a peak efficiency of 92%, with a system power density of 101.3W/cm³. A 1.6µs response time and 36mV voltage droop are achieved with a 2A load transient.
2D-3

Full-Stack System Design and Prototyping for Fully Programmable Electronic-Photonic Neurocomputing

*Yinyi Liu (The Hong Kong University of Science and Technology), Peiyu Chen, Bohan Hu (MICS Thrust, The Hong Kong University of Science and Technology(Guangzhou)), Wei Zhang (ECE Department, The Hong Kong University of Science and Technology), Jiang Xu (MICS Thrust, The The Hong Kong University of Science and Technology(Guangzhou))
Keywords
electronic-photonic computing, programmable architecture, neuromorphic system, full-stack solution, chip prototyping, RISC-V, MLIR
Abstract
This paper presents a fully-programmable electronic-photonic computing system designed for neuromorphic applications. It integrates diverse photonic arithmetic units, capable of matrix multiplications, dot-product operations, and 1D convolutions, fabricated using the LIGENTEC 200nm silicon nitride process. It also features a customized RISC-V instruction set architecture (ISA) for cross-domain control and scheduling. Electronic logics and peripheral drivers are implemented on an FPGA attached to custom-designed PCB boards. To streamline neural network migration and deployment onto our chip, we propose an auto-compilation and optimization framework built upon Torch and Multi-Level Intermediate Representation (MLIR). This framework is compatible with the RISC-V ecosystem and our customized photonic-involved ISA. Our approach enables efficient and flexible mapping of arbitrary neural networks to electronic-photonic hardware.
2D-4

Analysis and Design of Oblong Coils and Standard-Cell-Based Receiver for Area-Efficient Edge-Coupled Inductive Coupling Transceiver

*Yuki Mitarai, Mototsugu Hamada, Atsutake Kosuge (The University of Tokyo)
Keywords
3D integration, inductive coupling, hysteresis comparator
Abstract
Proximity inductive coupling interfaces provide a low-cost, high-yield solution for 3D assembly, thanks to their compatibility with standard CMOS processes. However, they suffer from challenges related to the design complexity of both the coil and the receiver. To address these issues, this work proposes a comprehensive approach that includes an analytical coil design methodology applicable to edge-coupled configurations, an oblong coil structure to improve layout efficiency, and a standard-cell-based receiver architecture that enables simplified and scalable implementation. The proposed oblong coil achieves a 4.5 times improvement in area efficiency compared to traditional square coils, while maintaining adequate coupling strength and crosstalk tolerance, as validated through a test chip fabricated in a 40 nm CMOS process. The proposed receiver leverages bias sharing and a digitally tunable, standard-cell-based hysteresis comparator, resulting in 0.23 times the area and 0.37 times the energy consumption relative to a conventional analog comparator, as confirmed through simulations in a 16 nm FinFET process.
2D-5

A RHP-Zero-Free Hybrid Step-Up Converter With 95.1% Peak Efficiency for Fast-Transient Applications

*Junyi Ruan (The Chinese University of Hong Kong, Shenzhen), Junmin Jiang (Southern University of Science and Technology), Chenzhou Ding (The Chinese University of Hong Kong, Shenzhen), Ka Nang Leung (The Chinese University of Hong Kong), Xun Liu (The Chinese University of Hong Kong, Shenzhen)
Keywords
DC-DC, Step-up converters, fast response, RHP zero elimination
Abstract
This paper presents a hybrid step-up converter with only one inductor and one flying capacitor. The proposed converter features a left-half-plane zero rather than a right-half-plane one. With a broad bandwidth up to a tenth of the switching frequency, the converter can attain a fast transient response as a buck converter. In addition, the voltage conversion ratio of the proposed converter is identical to that of the conventional boost converter, thereby enabling a wide range of output voltage in systems powered by lithium-ion batteries. Measurement results demonstrate that a peak efficiency of 95.1% is obtained. Given a load current stepping from 50 mA to 500 mA in less than 160 ns, the settling time is merely 2.8 μs.
2D-6

A Relaxation Oscillator with 2.93µJ/cycle Energy Efficiency and 0.068% Period Jitter

*Yongjuan Shi (Southern University of Science and Technology), Xun Liu (Chinese University of Hong Kong, Shenzhen), Chen Hu (Southern University of Science and Technology), Xiyuan Tang (Peking University), Junmin Jiang (Southern University of Science and Technology)
Keywords
Relaxation Oscillator, Low Jitter, Low Power, temperature coefficient (TC), Supply Variations, Phase Noise.
Abstract
This paper presents a 2MHz relaxation oscillator (ROSC) designed for ultra-low power internet-of-things (IoT) applications. Dynamic comparator with dual slope booster (DSB) is utilized to decrease the output jitter of oscillating frequency. A feedback loop with cascaded floating inverter amplifier (C-FIA) is adopted such that (1) the requirement of comparator speed is significantly alleviated and (2) the power consumption of the amplifier is further reduced. The proposed ROSC was fabricated in a 180nm CMOS process and occupies only 0.1mm2 active area. The measurement results with 8 samples show that the average power consumption is only 2.93µJ/cycle (µW/MHz) at 1V supply voltage at room temperature. The average standard variation of the period jitter is 345ps, which is as low as 0.068% of the 500ns typical oscillation period (TOSC).
2D-7

TFLOP: Towards Energy-Efficient LLM Inference: An FPGA-Affinity Accelerator with Unified LUT-based Optimization

*Zongwu Wang (Shanghai Jiao Tong University), Zhongyi Tang (Shanghai Qizhi Institute), Fangxin Liu, Chenyang Guan, Li Jiang, Haibing Guan (Shanghai Jiao Tong University)
Keywords
LLM, FPGA, Accelerator, Production Quantization
Abstract
Large Language Models (LLMs) suffer from significant performance and energy efficiency bottlenecks during the memory-bound decoding stage, where GPUs are often underutilized. We propose TFLOP, a novel CPU-FPGA heterogeneous prototype system that addresses this challenge by employing a 4-bit product quantization scheme on model weights and the KV cache. This approach decomposes GEMV operations in decoding tage into two hardware-friendly steps: centroid reconstruction and table lookup, which are efficiently mapped onto an FPGA's heterogeneous resources. Our key innovation is a unified FPGA rchitecture that can handle both row- and column-wise quantization, simplifying hardware design and improving efficiency. Evaluations show that TFLOP achieves superior performance, elivering a 2.76x speedup over the NVIDIA A100 GPU on e LLaMA-2-7B model, while maintaining high accuracy and xceptional energy efficiency.

Session 2E

(T10-A) Reliability-Driven and Low-Power Design
13:30-15:35 | Tuesday, January 20, 2026 | Sleeping Beauty 3
Chair(s):
Bei Yu (The Chinese University of Hong Kong)
Yu-Guang Chen (National Central University)
2E-1
13:30-13:55

MF-ECC: Memory-Free Error Correction for Hyperdimensional Computing Edge Accelerators

*Mahboobe Sadeghipour Roodsari, Mahta Mayahinia, Mehdi Tahoori (Karlsruhe Institute of Technology)
Keywords
Hyperdimensional Computing, Error-Correcting Code, Memory-free, Edge devices, fault tolerant, reliability
Abstract
Brain-inspired Hyperdimensional Computing (HDC) is emerging as a compelling paradigm for learning at the edge because of its one-shot learning capability, inherent scalability, and exceptionally low computational overhead. While HDC is robust to noise, soft and hard memory faults in the memory components of HDC accelerator can still significantly degrade accuracy. Conventional error correction codes (ECC) are commonly used to mitigate such faults, but their associated overhead make them impractical for resource-constrained edge devices. In this paper, we present a novel memory-free error correction technique to enhance the fault tolerance of HDC systems without requiring any dedicated memory to store check-bits.
2E-2
13:55-14:20

Thermo-NAS: Thermal-resilient ultralow-cost IGZO-based Flexible Neuromorphic Circuits

*Priyanjana Pal, Tara Gheshlaghi (Karlsruhe Institute of Techology Germany), Suman Balaji, Emre Ozer (Pragmatic Semiconductor), Mehdi B. Tahoori (Karlsruhe Institute of Techology Germany)
Keywords
Thermal-resilient design, Temperature modeling, Neuromorphic Computing, Flexible Electronics, Neural Architecture Search, Activation Function, Neural Networks
Abstract
The demand for next-generation flexible electronics (FE) is rapidly increasing, especially in cost-sensitive consumer markets such as smart packaging, smart bandages, drug delivery systems, RFID tags, and wearable devices. Traditional silicon-based electronics, constrained by high manufacturing costs and rigid form-factor, are inadequate for these emerging applications. However, the lack of rigid packaging in FE, combined with their complex and variable operating conditions, makes them more susceptible to thermal issues, therby leading to significant performance degradation, abnormal heating, and potential risks to device reliability and safety. To address these thermal challenges, we propose a novel approach to design thermal-resilient (TR) flexible analog neuromorphic circuits (f -NCs) based on amorphous indium-gallium-zinc oxide (a-IGZO) thin-film transistors (TFTs). This cross-layer approach integrates TR circuit design for activation functions (AFs) and evolutionary algorithm (EA) based TR training using Neural Architecture Search (NAS) optimizing both the circuit-level thermal resilience and the architecture-level training, ensuring robust performance of f -NCs under varying thermal conditions. Experiments on 13 benchmark datasets demonstrate that thermal variations result in up to its 50.3% accuracy loss and the proposed evolutionary algorithm-based thermal-resilient training fully recovers this accuracy at the expense of 1.81x area and 1.39x power overhead.
2E-3
14:20-14:45

FIawase: A SET Fault Injection Framework Towards Exhaustive System-Level Impact Evaluation

*Mingtao Zhang, Quan Cheng, Masanori Hashimoto (Kyoto University)
Keywords
Reliability, fault injection, soft errors, single event transients, hardware emulation
Abstract
Single-event transients (SETs) threaten modern reliability-demanding SoCs equipped with error correction codes (ECC) for single event upset (SEU) mitigation. However, conventional gate-level SET fault injection (FI) remains prohibitively slow for practical reliability evaluation. This work presents FIawase, a high-throughput SET injection framework that enables comprehensive system-level evaluation of SET-induced soft errors. FIawase consists of two phases: a netlist-level inject-and-capture simulation, which systematically flips the output of every gate-cycle pair during program execution to record flip-flop changes one cycle later, and a scan-chain-based replay-and-measure emulation, which replays these patterns at hardware speed to quantify system-level impact. Implemented on an open-source RISC-V system, FIawase reduces a comprehensive SET injection campaign from decades of pure simulation to nearly a day, achieving over four orders of magnitude end-to-end speedup. FIawase takes a critical first step toward exhaustive, cycle-accurate SET analysis, enabling architectural and reliability research at previously infeasible scales.
2E-4
14:45-15:10

MASS: A Masking-aware Search Framework for Reliable QC-LDPC Code Construction in SSDs

*Xiaolu Li, Dingxin Wang, Zhengyao Ding, Jinye Wu, Qingnan Hu (Huazhong University of Science and Technology), Patrick Lee (The Chinese University of Hong Kong), Yuchong Hu, Dan Feng (Huazhong University of Science and Technology)
Keywords
QC-LDPC codes, SSD reliability, SSD performance, SSD simulation
Abstract
Quasi-Cyclic Low-Density Parity-Check (QC-LDPC) codes have been widely adopted in flash-based solid-state drives (SSDs) to ensure storage reliability due to their efficient encoding and decoding operations, as well as their compact structure. However, as SSD capacities grow, existing QC-LDPC codes face higher error rates, leading to both lifetime and performance degradation. We propose MASS, a new QC-LDPC code construction framework that leverages masking to remove error-prone substructures from the coding matrix during code construction, so as to enhance SSD reliability. MASS further adopts a smart decoding policy that selectively bypasses decoding operations to boost I/O performance. We implement MASS in the SSD simulator, MQSim. Evaluation shows that MASS reduces the decoding failure rate by up to 93%, and its smart decoding policy reduces the average response time by up to 72.6%.
2E-5
15:10-15:35

TIMBER: A Fast Algorithm for Timing and Power Optimization using Multi-bit Flip-flops

Aditya Das Sarma (University of Wisconsin at Madison), *Shui Jiang (The Chinese University of Hong Kong), Wan Luan Lee (University of Wisconsin at Madison), Tsung-Yi Ho (The Chinese University of Hong Kong), Tsung-Wei Huang (University of Wisconsin at Madison)
Keywords
power optimization, multi-bit flip-flop banking and debanking, ICCAD CAD Contest
Abstract
Multi-bit flip-flop (MBFF) banking and debanking is a widely adopted technique for optimizing power and total negative slack (TNS) during the post-placement stage of digital design. While banking flip-flops can reduce both power and area, excessive banking may lead to increased TNS due to significant register displacement, as well as bin density violations (BDVs) caused by over-placing MBFFs in legalized regions. To address these challenges, the EDA community recently organized a CAD Contest seeking innovative solutions from both academia and industry. In response, we present TIMBER, a fast and effective optimization algorithm that balances competing objectives in MBFF placement. Unlike existing methods, TIMBER employs a bin-density-aware placement strategy that simultaneously minimizes BDVs and TNS, while also achieving gains in power and area efficiency. To further enhance the runtime performance, TIMBER incorporates a parallelization strategy. Experimental results on the official 2024 CAD Contest benchmarks demonstrate that TIMBER outperforms the first-place winner, delivering on average 13.08x better solution quality, zero BDVs, 5.06x faster single-threaded runtime, 3.56x lower memory usage and up to 72.49x speedup in multi-threaded execution.

Session 2F

(T8-C) Application of Generative and Predictive Methods to Design Optimization
13:30-15:35 | Tuesday, January 20, 2026 | Sleeping Beauty 5
Chair(s):
Victor Kravets (IBM Inc.)
Xinyu Chen (The Hong Kong University of Science and Technology (Guangzhou))
2F-1
13:30-13:55

GENIAL: Generative Design Space Exploration via Network Inversion for Low Power Algorithmic Logic Units

Maxence Bouvier, *Ryan Amaudruz, Felix Arnold, Renzo Andri, Lukas Cavigelli (Huawei Technologies Switzerland AG)
Keywords
RTL Generation, Logic Synthesis, High-level Synthesis, Artificial Intelligence, Machine Learning, Design Space Exploration
Abstract
As AI workloads proliferate, optimizing arithmetic units is becoming increasingly important to reduce the footprint of digital systems. Conventional design flows, which often rely on manual or heuristics-based optimization, are limited in their ability to thoroughly explore the vast design space. In this paper, we introduce GENIAL, a machine learning-based framework for the automatic generation and optimization of arithmetic units, more specifically multipliers. At the core of GENIAL is a Transformer-based surrogate model trained in two stages, involving self-supervised pretraining followed by supervised fine-tuning, to robustly forecast key hardware metrics such as power and area from abstracted design representations. By inverting the surrogate model, GENIAL efficiently searches for new operand encodings that directly minimize power consumption in arithmetic units for specific input data distributions. Extensive experiments on large datasets demonstrate that GENIAL is consistently more sample efficient than other methods, and converges faster towards optimized designs. This enables to deploy a high-effort logic synthesis optimization flow in the loop, improving the accuracy of the surrogate model. Notably, GENIAL automatically discovers encodings that achieve up to 18% switching activity savings within multipliers on representative AI workloads compared with the conventional two’s complement. We also demonstrate the versatility of our approach by achieving significant improvements on Finite State Machines, highlighting GENIAL's applicability for a wide spectrum of logic functions. Together, these advances mark a significant step toward automated Quality-of-Results-optimized combinational circuit generation for digital systems.
2F-2
13:55-14:20

REvolution: An Evolutionary Framework for RTL Generation driven by Large Language Models

Kyungjun Min, *Kyumin Cho, Junhwan Jang, Seokhyeong Kang (Pohang University of Science and Technology)
Keywords
Large Language Models, Evolutionary Computation, RTL Generation, Electronic Design Automation
Abstract
Large Language Models (LLMs) are used for Register-Transfer Level (RTL) code generation, but they face two main challenges: functional correctness and Power, Performance, and Area (PPA) optimization. Iterative, feedback-based methods partially address these, but they are limited to local search, hindering the discovery of a global optimum. This paper introduces REvolution, a framework that combines Evolutionary Computation (EC) with LLMs for automatic RTL generation and optimization. REvolution evolves a population of candidates in parallel, each defined by a design strategy (Thought), RTL implementation (Code), and evaluation feedback. The framework includes a dual-population algorithm that divides candidates into Fail and Success groups for bug fixing and PPA optimization, respectively. An adaptive mechanism further improves search efficiency by dynamically adjusting the selection probability according to the success rates. Experiments on the VerilogEval and RTLLM benchmarks show that REvolution increased the initial pass rate of various LLMs by up to 24.0 percentage points. The DeepSeekV3 model achieved a final pass rate of 95.5%, comparable to state-of-the-art results, without the need for separate training or domain-specific tools. Additionally, the generated RTL designs showed significant PPA improvements over reference designs. This work introduces a new RTL design approach by combining LLMs' generative capabilities with EC's broad search power, overcoming the local-search limitations of previous methods.
2F-3
14:20-14:45

AC-Refiner: Efficient Arithmetic Circuit Optimization Using Conditional Diffusion Models

*Chenhao Xue (Peking University), Kezhi Li (The Chinese University of Hong Kong), Jiaxing Zhang, Yi Ren (Peking University), Zhengyuan Shi (The Chinese University of Hong Kong), Chen Zhang (Shanghai Jiao Tong University), Yibo Lin, Lining Zhang (Peking University), Qiang Xu (The Chinese University of Hong Kong), Guangyu Sun (Peking University)
Keywords
Diffusion models, Arithmetic circuits, Design automation
Abstract
Arithmetic circuits, such as adders and multipliers, are fundamental components of digital systems, directly impacting the performance, power efficiency, and area footprint. However, optimizing these circuits remains challenging due to the vast design space and complex physical constraints. While recent deep learning-based approaches have shown promise, they struggle to consistently explore high-potential design variants, limiting their optimization efficiency. To address this challenge, we propose AC-Refiner, a novel arithmetic circuit optimization framework leveraging conditional diffusion models. Our key insight is to reframe arithmetic circuit synthesis as a conditional image generation task. By carefully conditioning the denoising diffusion process on target quality-of-results (QoRs), AC-Refiner consistently produces high-quality circuit designs. Furthermore, the explored designs are used to fine-tune the diffusion model, which focuses the exploration near the Pareto frontier. Experimental results demonstrate that AC-Refiner generates designs with superior Pareto optimality, outperforming state-of-the-art baselines. The performance gain is further validated by integrating AC-Refiner into practical applications.
2F-4
14:45-15:10

DeepCut: Structure-Aware GNN Framework for Efficient Cut Timing Prediction in Logic Synthesis

*Lingfeng Zhou, Yilong Zhou, Hao Gong (Hangzhou Dianzi University), Zhengyuan Shi, Qiang Xv (The Chinese University of Hong Kong), Zhufei Chu (Ningbo University), Yue Wu, Xiaoyan Yang (Hangzhou Dianzi University)
Keywords
Machine Learning in EDA, Graph Neural Networks, Delay Prediction, Logic Synthesis
Abstract
Achieving timing convergence during logic synthesis remains a significant challenge due to the weak correlation between pre-mapping optimization metrics and post-mapping performance. While recent approaches have introduced Graph Neural Networks (GNNs) to predict Quality-of-Result (QoR) during optimization, these methods often suffer from scalability limitations and considerable timing overhead. To address these issues, we present DeepCut, a learning-based framework for efficient and adaptable cut-level QoR prediction in logic optimization. DeepCut introduces a heterogeneous graph representation that encodes cut structures as supernodes and integrates skip-list connections to broaden the receptive field of the GNN. The proposed GNN architecture incorporates a structure-aware attention aggregator that dynamically captures both local and hierarchical features, enhancing prediction performance and speeding up inference. Evaluated on a comprehensive benchmark dataset of 46 designs, DeepCut achieves significant improvements over baseline models, with up to 48.8% higher ^2 scores and a 22.8% reduction in Mean Absolute Percentage Error (MAPE), while also delivering superior inference efficiency. Further experiments explored the application of DeepCut within the ABC rewrite framework, showing that the modified rewrite outperforms previous work in terms of delay optimization.
2F-5
15:10-15:35

Lorecast: Layout-Aware Performance and Power Forecasting from Natural Language

Runzhi Wang, Prianka Sengupta, Cristhian Roman Vicharra (Texas A&M University), *Yiran Chen (Duke University), Jiang Hu (Texas A&M University)
Keywords
LLM-assisted circuit design prediction, electronic design automation, performance and power estimation
Abstract
In chip design planning, obtaining reliable performance and power forecasts for various design options is of critical importance. Traditionally, this involves using system-level models, which often lack accuracy, or trial synthesis, which is both labor-intensive and time-consuming. We introduce a new methodology, called Lorecast, which accepts English prompts as input to rapidly generate layout-aware performance and power estimates. To the best of our knowledge, Lorecast is the first approach to enable performance and power forecasting directly from natural language descriptions. This approach bypasses the need for HDL code development and synthesis, making it both fast and user-friendly. Experimental results show that Lorecast achieves accuracy within a few percent of error compared to post-layout analysis, while significantly reducing turnaround time.

Session 3A

(T5-A) Advanced Accelerators for Emerging AI Workloads
15:55-18:00 | Tuesday, January 20, 2026 | Snow White 1
Chair(s):
Guohao Dai (Shanghai Jiao Tong University)
Zhenhua Zhu (HKUST)
3A-1
15:55-16:20

DeepPiC: xPU-PIM Cluster Architecture with Adaptive Resource-Aware Task Orchestration for DeepSeek-Style MoE Inference

*Zixu Li, Manni Li, Zijian Huang, Jiayu Yang, Wending Zhao, Yinyin Lin (Fudan University), Chengchen Wang, Haidong Tian, Xiankui Xiong (ZTE Corporation)
Keywords
Processing-in-Memory (PIM), DRAM, Large Language Model (LLM), Mixture-of-Experts (MoE), Cluster
Abstract
The success of DeepSeek has driven demand for deploying high-performance inference clusters. However, due to its Transformer-based autoregressive structure, DeepSeek remains severely bandwidth-bound, limiting the scalability of traditional xPU (e.g., GPU/TPU). While DRAM-based processing-in-memory (PIM) offers a promising solution to overcome memory bottlenecks, its use in inference clusters for DeepSeek remains underexplored due to three challenges: (1) non-trivial inter-device communication overhead; (2) the need for expert parallelism in the mixture-of-experts (MoE) module; and (3) lack of efficient task offloading to PIM. To this end, we propose DeepPiC, a novel xPU-PIM cluster architecture designed for DeepSeek-style models with multi-latent attention (MLA) and MoE modules. DeepPiC introduces a heterogeneous xPU+HBM-PIM device to accelerate low arithmetic intensity operations. It can seamlessly replace conventional xPU devices without any modification to cluster-level interconnect topology. However, DeepPiC cannot fully realize its performance potential under static scheduling, which fails to adapt to shifting compute and memory demands driven by multidimensional variability (model heterogeneity, cluster-scale volatility, runtime dynamics). This induces inter-device communication overhead and intra-device underutilization. Thus, we propose Adaptive Resource-Aware Task Orchestration (ARTO), a two-phase strategy that decouples global model partitioning from local task assignment by dynamically coordinating (1) cross-device parallelism optimization and (2) intra-device xPU/PIM mapping. Evaluated on DeepSeek V3-671B using H20-, A100-, and H200-Clusters (H20 serves as a compute-limited alternative to high-end GPUs), DeepPiC (H20+HBM-PIM) achieves up to 3x, 2x and 1.3x speedup over H20-, A100-, and H200-Cluster at small batch sizes, while maintaining 74% and 54% of A100- and H200-Cluster performance at large batch sizes. These results demonstrate that DeepPiC enables low-end xPU to approach or even exceed premium ones by fundamentally overcoming memory bottlenecks via adaptive scheduling that orchestrates PIM and xPU heterogeneous resources.
3A-2
16:20-16:45

MoEA: A Mixture of Experts Accelerator with Direct Token Access and Dynamic Expert Scheduling

*Zifeng Zhao (Fudan University; Jiashan Fudan Institute), Jiewen Zheng, Tianxing Xie, Xinghao Zhu (Fudan University), Gengsheng Chen (Fudan University; Jiashan Fudan Institute)
Keywords
AI Accelerator, Transformer, Mixture-of-Experts, Hardware Architectural Design
Abstract
Transformer-based large language models (LLMs) have achieved widespread adoption, but their growing model sizes impose substantial computational costs. Mixture-of-Experts (MoE) mechanism alleviates this by sparsely activating a subset of experts for each token. However, it still faces two critical challenges: (1) token rearrangement incurs non-trivial overhead of off-chip data movement, and (2) static execution flow fails to exploit inter-expert token reuse. In this paper, we present MoEA, a specialized accelerator designed to address the inefficiencies in MoE inference. MoEA introduces two key innovatons: a Direct Token Access mechanism that leverages a hardware-managed metadata queue to eliminate off-chip token rearrangement, and a Dynamic Expert Scheduler that captures inter-expert token reuse patterns and optimizes expert execution order to maximize token reuse. Evaluations on representative MoE benchmarks show that MoEA reduces off-chip memory access by 12.06% and achieves 259.24x, 9.63x and 1.16x speedups over CPU, GPU and EdgeMoE accelerator, respectively.
3A-3
16:45-17:10

MoEA: A Mixed-Precision Edge Accelerator for CNN-MSA Models with Fine-Tuning Support

*Qiwei Dang, Chengyu Ma, Zhiwang Huo, Guoming Yang, Tian Xia, Wenzhe Zhao, Pengju Ren (Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University)
Keywords
Deep Neural Network, Low-precision data format, RISC-V, Hardware Accelerator
Abstract
Recent vision models integrating Convolutional Neu- ral Networks (CNNs) and attention-based Transformers have achieved unprecedented accuracy, but face significant obstacles for edge deployment: sensitivity to quantization formats (espe- cially for nonlinear functions in Transformers), high compu- tational instructions complexity, and the inability to perform on-device fine-tuning for distribution shifts. To address these challenges, we propose Mixture-of-Edge-Architectures(MoEA), a RISC-V-based accelerator with three key innovations. First, the Mixed Fixed/Floating-Point(MFP) format unifies INT8 and Shared Exponent Floating-Point(SFP8) into a single datapath, enabling optimal mixed-precision strategies. Second, the typical VLIW instruction is condensed into a single direct memory access instruction to improve the computation performance. Third, a specialized engine integrates an FPGA-optimized Matrix Processing Unit(MPU), a Single Instruction Multi Data(SIMD)- based Vector Processing Unit(VPU), and an enhanced Direct Memory Access(DMA) for back-propagation. Implemented on FPGAs, MoEA delivers 420 GOPS (0.99 GOPS/DSP) on ZCU102, with ResNet18 inference at 16 ms and fine-tuning at 70 ms. An 8-cluster variant on VX690T reduces ViT-Base latency to 23 ms, 1.81x ∼ 2.13x faster than prior accelerators, supporting versatile model deployment and adaptive edge intelligence.
3A-4
17:10-17:35

Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems

*En-Ming Huang (National Taiwan University), Li-Shang Lin (National Tsing Hua University), Chun-Yi Lee (National Taiwan University)
Keywords
CPU-GPU optimization, LLM Inference, MoE
Abstract
Large Language Models (LLMs) have achieved impressive results across various tasks, yet their high computational demands pose deployment challenges, especially on consumer-grade hardware. Mixture of Experts (MoE) models provide an efficient solution through selective activation of parameter subsets, which reduces computation requirements. Despite this efficiency, state-of-the-art MoE models still require substantial memory beyond typical consumer GPU capacities. Traditional offloading methods that transfer model weights between CPU and GPU introduce latency that limits inference performance. This paper presents a novel CPU-GPU collaborative inference framework that incorporates an expert caching mechanism on the GPU to reduce data transfer requirements and enable faster inference through cache hits. Computations are offloaded to CPU for efficient cache miss handling, which benefits from CPU multithreading optimizations. The evaluations of our framework demonstrate performance improvements and highlight the potential of CPU-GPU collaboration to maximize hardware utilization for single-request inference scenarios on consumer-grade systems.
3A-5
17:35-18:00

BOA-3DGS: Backward-Striding Optimized Accelerator for Reduced Memory Contention in 3D Gaussian Splatting Training

*Hyukjun Kweon (Department of Semiconductor Convergence Engineering SungKyunKwan University), Jongyeop Kim (Department of Electrical and Computer Engineering Sungkyunkwan University), Jeongwoo Park (Sungkyunkwan University)
Keywords
Neural Rendering, 3D Gaussian Splatting, Real-time Rendering, Scene Reconstruction, SLAM, Augmented Reality, Virtual Reality, Differentiable Rasterization, High-throughput Backpropagation
Abstract
3D Gaussian Splatting (3DGS) algorithms have gained increasing attention due to their ability to enable realistic 3D scene reconstruction with faster runtime. However, training these models on graphical processing units (GPU) face unique challenges during the backpropagation stage primarily due to memory barriers in the execution model, where every pixel within a tile computes the gradient for a single Gaussian per thread. In this paper, we propose the Backward-Striding Optimized Accelerator for 3D Gaussian Splatting (BOA-3DGS), a hardware-software co-optimized accelerator designed to optimize backward rasterization of 3D Gaussian Splatting by enabling pixels to stride through and select only relevant Gaussians for computing. However, such processing styles lead to a large number of memory stalls due to conflicting gradient storage and Gaussian parameter fetches. By introducing a pixel-independent, funnel-like multi-Gaussian alpha computation and a majority-based gradient accumulation method, we avoid such memory stalls to efficiently accelerate backpropagation of 3DGS. Together, these enhancements improve gradient calculation efficiency and accelerate the backpropagation stage of 3DGS rasterization. Experimental results demonstrate that BOA-3DGS achieves up to 1.58x speedup during backpropagation compared to prior work, while utilizing only 0.86x the area and consuming 0.99x power.

Session 3B

(T1-C) Advances in Agile Design Acceleration
15:55-18:00 | Tuesday, January 20, 2026 | Snow White 2
Chair(s):
Hiroki Nishikawa (The University of Osaka)
Shih-Hao Hung (National Taiwan University)
3B-1
15:55-16:20

Scalarium: A Unified Scala-based Co-Simulation Framework for Agile Chip Development

*Yuefeng Zhang, Cheng Zhang, Wenkai Zhou, Binzhe Yuan, Junsheng Chen, Xiangyu Zhang, Hao Geng, Xin Lou (ShanghaiTech University)
Keywords
Scala, SpinalHDL, Co-simulation, Unified Design Workflow, Agile Development
Abstract
Modern digital integrated circuit and system design workflows rely on hardware/software co-simulation that often employs multi-language methodologies (e.g., SystemC for modeling and VerilogHDL for implementation), introducing significant overhead from manual interface synchronization, cross-toolchain integration, and loss of high-level abstraction. To address these limitations, we propose Scalarium, a unified Scala-based co-simulation framework that integrates a cycle-driven simulator and SpinalHDL hardware modules within a single Scala environment, eliminating Verilog translation and proprietary DPI/PLI glue code. In particular, we propose: 1) a Scala-based iterative hardware design workflow for large-scale digital chip design; 2) an extensible cycle-driven simulation library for agile system modeling and accurate simulation, leveraging Scala’s expressive syntax and type system and 3) a unified co-simulation platform enabling automatic type-safe hardware/software binding, enabling direct data exchange. Evaluation on a neural rendering accelerator design project demonstrates a 74.8 times simulation speedup over register-transfer level (RTL) with minimal functional deviation (6.5%) and performance mismatch (2.4%), attributable to design differences rather than simulator inaccuracy. Scalarium enhances productivity, debuggability, and maintainability while preserving SpinalHDL's verification advantages.
3B-2
16:20-16:45

VFlow: Discovering Optimal Agentic Workflows for Verilog Generation

*Yangbo Wei, Zhen Huang (Shanghai Jiao Tong University), Lei He (Eastern Institute of Technology, Ningbo), Huang Li (Shanghai Jiao Tong University), Ting-Jung Lin (Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo), W. Xing Wei (The University of Sheffield)
Keywords
Hardware Description Languages, Verilog, Large Language Models, Automated Workflow Optimization, Monte Carlo Tree Search, Digital Circuit Design
Abstract
Hardware design automation faces challenges in generating high-quality Verilog code efficiently. This paper introduces VFlow, an automated framework that optimizes agentic workflows for Verilog code generation. Unlike traditional approaches relying on fixed prompts or manually designed flows, VFlow treats workflow discovery as a search over graph-structured LLM invocation sequences. It introduces a multi-population cooperative evolution (CEPE-MCTS) algorithm that balances multiple hardware objectives—functional correctness, area, power, timing and token cost—while sharing successful patterns and avoiding repeated failures. Integrated multi-level verification ensures syntactic correctness, functional behavior, and synthesizability. Experiments on VerilogEval and RTLLM2.0 show VFlow improves pass@1 by 20-30% over prompting baselines and closely matches designer-level area/power. Remarkably, VFlow enables small LLMs to outperform larger models with up to 10.9x ROI, offering a cost-effective solution for RTL design. This work paves the way for intelligent, automated hardware development, advancing LLM applications in EDA.
3B-3
16:45-17:10

SemanticBBV: A Semantic Signature for Cross-Program Knowledge Reuse in Microarchitecture Simulation

*Zhenguo Liu (The Hong Kong University of Science and Technology (Guangzhou)), Chengao Shi (The Hong Kong University of Science and Technology (HKUST)), Chen Ding (University of Rochester), Jiang Xu (The Hong Kong University of Science and Technology (Guangzhou))
Keywords
Semantic, Program Analysis, Microarchitecture Simulation, Embedding, Binary, Representation Learning
Abstract
For decades, sampling-based techniques have been the de facto standard for accelerating microarchitecture simulation, with the Basic Block Vector (BBV) serving as the cornerstone program representation. Yet, the BBV's fundamental limitations: order-dependent IDs that prevent cross-program knowledge reuse and a lack of semantic content predictive of hardware performance—have left a massive potential for optimization untapped. To address these gaps, we introduce SemanticBBV, a novel, two-stage framework that generates robust, performance-aware signatures for cross-program simulation reuse. First, a lightweight RWKV-based semantic encoder transforms assembly basic blocks into rich Basic Block Embeddings (BBEs), capturing deep functional semantics. Second, an order-invariant Set-Transformer aggregates these BBEs, weighted by execution frequency, into a final signature. Crucially, this stage is co-trained with a dual objective: a triplet loss for signature distinctiveness and a Cycles Per Instruction (CPI) regression task, directly imbuing the signature with performance sensitivity. Our evaluation demonstrates that SemanticBBV not only matches traditional BBVs in single-program accuracy but also enables unprecedented cross-program analysis. By sim- ulating just 14 universal program points, we estimated the performance of ten SPEC benchmarks with 86.3% average accuracy, achieving a 7143x simulation speedup. Furthermore, the signature shows strong adaptability to new microarchitectures with minimal fine-tuning.
3B-4
17:10-17:35

CausalTuner: Will Causality Help High-Dimensional EDA Tool Parameter Tuning

*Ziyang Yu, Peng Xu, Su Zheng (The Chinese University of Hong Kong), Siyuan Xu (Huawei Noah’s Ark Lab), Hao Geng (ShanghaiTech University), Bei Yu (The Chinese University of Hong Kong), Martin Df Wong (Hong Kong Baptist University)
Keywords
Design space exploration, Causal inference, Bayesian optimization, VLSI
Abstract
Electronic Design Automation (EDA) tools are central to Very Large Scale Integration (VLSI) design, where nu- merous parameters govern the Quality-of-Result (QoR) metrics, including performance, power, and area. The high dimensionality of the parameter space, coupled with complex interactions, makes manual tuning inefficient and hinders the scalability of automated methods. Existing methods typically treat parameters as flat vectors, neglecting the EDA flow's hierarchical causal structure, where early-stage decisions constrain later downstream stages. To address this, we propose CausalTuner, a causality-aware design space exploration framework for efficient parameter tuning. It employs a hybrid causal attention mechanism to capture stage- wise parameter interactions and embeds them into deep kernel Gaussian processes for accurate and generalizable surrogate modeling. The causal exploration strategies enhance sampling efficiency. Experiments show that CausalTuner outperforms state-of-the-art methods in both final QoR and efficiency.
3B-5
17:35-18:00

Synergistic Bayesian Optimization and Reinforcement Learning with Bidirectional Interaction for Efficient VLSI Constraint Tuning

*Jiayi Tu (Southeast University), Jindong Tu (The Chinese University of Hong Kong, Shenzhen), Yuxuan Cai, Yi Zhang, Meng Zhang (Southeast University), Tinghuan Chen (The Chinese University of Hong Kong, Shenzhen)
Keywords
Design space exploration, Co-Adaptive Optimization, Closed-loop interaction, Automatic constraint tuning
Abstract
The exponential complexity growth of the very large-scale integration (VLSI) design space demands efficient automated parameter tuning of constraints. Confronting the dual limitations of Bayesian optimization (BO) in high-dimensional spaces, namely low efficiency and dependence on initial data, and the defect of reinforcement learning (RL) resulting in excessively high simulation costs due to inefficient exploration, this work proposes a dynamic closed-loop co-adaptive framework. The framework establishes a synergistic enhancement cycle through a bidirectional interaction mechanism: BO generates high-quality initial candidate points to guide RL exploration, while RL refines the Bayesian surrogate model through strategy-driven adaptive search parameter space and iterative feedback of new points. A novel coupled HV-WA reward function enhances Pareto frontier diversity while ensuring convergence. Compared to state-of-the-art methods, evaluations on TV80 CPU across multiple process nodes demonstrate significantly improved Pareto frontier quality, reduced electronic design automation (EDA) tool flows, and consistent robustness without historical data, establishing a new paradigm for automated constraint tuning.

Session 3C

(T13-A) Beyond Silicon: Emerging Paradigms in EDA for Atomic Scale Computing, Photonics, and Microfluidics
15:55-18:00 | Tuesday, January 20, 2026 | Snow White 3
Chair(s):
Xunzhao Yin (Zhejiang University)
Andy Yu-Guang Chen (National Central University)
3C-1
15:55-16:20

Mastering the Exponential Complexity of Exact Physical Simulation of Silicon Dangling Bonds

*Willem Lambooy, Jan Drewniok, Marcel Walter (Technical University of Munich), Robert Wille (Technical University of Munich & SCCH GmbH)
Keywords
Silicon Dangling Bonds, Atom-Scale Computation, Physical Simulation
Abstract
Silicon Dangling Bond (SiDB) logic is a promising technology for energy-efficient computation, supported by significant advancements in manufacturing and design automation. However, physical simulation, essential for accurately predicting the behavior of SiDB logic prior to costly manufacturing, lags behind these developments. In particular, exact physical simulation, which scales exponentially with base 3, remains infeasible for larger SiDB assemblies, limiting its utility to small structures such as single gates. This computational bottleneck slows progress in SiDB technology and hinders the establishment of reliable ground truths for heuristic approaches. To address the challenge, this work presents a novel methodology for exact SiDB simulation that restructures the exponential search space according to a hierarchical clustering. The hierarchy structure enables a systematic pruning of the search space at its different levels: it provides an ordering of interactions between clusters of SiDBs to facilitate efficacious exploitation of dynamically-inferred problem-specific constraints--like solving a Sudoku. Experimental results demonstrate that the effective exponential base can be lowered to approximately 1.3, enabling, for the first time, the exact physical simulation of entire multi-gate SiDB circuits in minutes that would take the state of the art millions of years to compute. This breakthrough establishes a robust ground truth for SiDB logic validation, marking a pivotal step toward scalable, energy-efficient, and atomic-scale computing.
3C-2
16:20-16:45

Built-In Self-Test for Locating Leakage Defects on Continuous-Flow Microfluidic Chips

*Jiahui Peng, Mengchu Li, Tsun-Ming Tseng, Ulf Schlichtmann (Technical University of Munich)
Keywords
BIST, coding design, CFMBs
Abstract
Continuous-flow microfluidic chips (CFMBs) are effective platforms for biochemical experiments using small volumes of fluids. Inside these chips, fluids are transported through flow channels and controlled by valves, which are actuated by pressure applied via control channels. As the scale of continuous-flow microfluidic chips grows, the likelihood of defects, such as blockage and leakage, increases, highlighting the growing importance of thorough chip testing. However, current leakage testing methods focus on detecting the existence of defects but cannot precisely locate the defective channels. This paper proposes the first coding methodology for the design of a built-in self-test (BIST) module that can precisely locate one or multiple leakage defects, generally with an extra cost of no more than two flow channels. Based on the new BIST module, a novel locating method is then proposed to reduce the number of test operations. For example, given 128 control channels and compared to the state-of-the-art approaches, our method reduces the average number of test operations by up to 85.9% and 76.6%, given one or two pair(s) of random leaky control channels, respectively.
3C-3
16:45-17:10

Accessible Ratio-Specific Mixing: Single-Pressure-Driven Multi-Reagent Mixer Design and Synthesis for 3D-Printed Microfluidics

*Yushen Zhang, Debraj Kundu, Tsun-Ming Tseng (Technical University of Munich), Sudip Roy (Indian Institute of Technology Roorkee), Shigeru Yamashita (Ritsumeikan University), Ulf Schlichtmann (Technical University of Munich)
Keywords
Microfluidics, Biochip, Lab-on-a-Chip, Sample Preparation, Design Synthesis, Design Automation, EDA, 3D-Printing, Bioengineering, Micromixing
Abstract
Precise reagent mixing in user-defined ratios is a fundamental requirement in many microfluidic applications, including diagnostics, chemical synthesis, and biological assays. However, existing solutions for ratio-specific mixing often rely on complex active components, such as multiple pressure sources, flow controllers, or on-chip valves, making them costly, bulky, and unsuitable for portable or low-resource settings. In this work, we present a mixer design and a synthesis method for generating 3D-printable microfluidic devices that achieve ratio-specific mixing using only a single constant pressure source. Our method decomposes the desired mixing ratio into additive subcomponents, each represented by a dedicated inlet channel with a tailored length to enforce the correct hydraulic resistance. The method outputs a complete microfluidic layout, ready for direct fabrication via 3D printers. We validate our approach through numerical simulations and physical prototyping across eight diverse mixing scenarios. Results show that the achieved mixing ratios closely resemble the target, demonstrating the method’s accuracy and robustness. This work enables low-cost, portable, and accessible microfluidic devices for ratio-specific solution delivery, broadening the scope of microfluidics in settings where simplicity, reproducibility, and affordability are critical.
3C-4
17:10-17:35

ENLighten: Lighten the Transformer, Enable Efficient Optical Acceleration

*Hanqing Zhu (UT Austin), Zhican Zhou (KAUST), Shupeng Ning (UT Austin), Xuhao Wu (KAUST), Ray Chen (UT Austin), Yating Wan (KAUST), David Pan (UT Austin)
Keywords
optical ai, transformer, prune, accelerator
Abstract
Photonic computing has emerged as a promising substrate for accelerating the dense linear-algebra operations at the heart of AI, but its adoption for large Transformer models remains in its infancy. In supporting these massive models, we identify two key bottlenecks: (1) costly electro-optic conversions and data-movement overheads that erode energy efficiency as model sizes scale; (2) a mismatch between limited on-chip photonic resources and the scale of Transformer workloads, which forces frequent reuse of photonic tensor cores and dilutes throughput gains. To address these challenges, we introduce a hardware-software co-design framework. First, we propose Lighten, a PTC-aware compression flow that post-hoc decomposes each Transformer weight matrix into a low-rank component plus a structured sparse component aligned to photonic tensor-core granularity, all without lengthy retraining. Second, we present ENLighten, a reconfigurable photonic accelerator architecture featuring dynamically adaptive tensor cores, driven by a broadband light redistribution, for fine-grained sparsity support and full power gating of inactive parts. On ImageNet, Lighten prunes Base-scale vision transformer by 50% with only a ~1% drop in top-1 accuracy after three epochs of fine-tuning, and when deployed on ENLighten it achieves a 2.5x improvement in energy-delay product over the previous state-of-the-art photonic Transformer accelerator.
3C-5
17:35-18:00

DCPPC: Digital Computation in Programmable Photonic Circuits

*Jun-Wei Liang, Iris Hui-Ru Jiang, Kai-Hsiang Chiu (National Taiwan University)
Keywords
Programmable photonic circuits, Digital computation, Boolean logic
Abstract
Photonic integrated circuits (PICs) have become a promising alternative to CMOS circuits due to their high speed and energy-saving characteristics. Programmable photonic circuits (PPCs), the field programmable gate array (FPGA) version of PICs, further offers reconfigurability and rapid integration. In this work, we propose a comprehensive methodology for digital computation on PPCs. We first construct the logic building blocks, which serve as unit cells. We devise a novel garbage collection scheme to resolve the signal distortion issue in these cells. A hybrid electronic-photonic scheme and three purely photonic schemes are further proposed to realize the PPC implementation flow of multilayer optical paths. Experimental validation confirms that the proposed framework successfully synthesizes all 3-input Boolean functions under NPN equivalence. We further extend our approach to support 4-input functions and evaluate its scalability through a case study on implementing a majority voting function, comparing our different proposed schemes.

Session 3D

(SS-3) Advances in AI-Driven Circuit Verification and Reliability Analysis
15:55-17:35 | Tuesday, January 20, 2026 | Sleeping Beauty 1/2
Chair(s):
Zhiyao Xie (The Hong Kong University of Science and Technology)
3D-1
15:55-16:20

SuperSAGA: A Supervisor-Subordinate Agentic Workflow for the Generation of Assertions

Subhajit Paul, *Ansuman Banerjee, Sumana Ghosh (Indian Statistical Institute), Sudhakar Surendran (Texas Instruments India), Raj Kumar Gajavelly (IBM Systems India)
Keywords
SystemVerilog Assertion, Agentic Workflow, Code Coverage, Formal Verification, Large Language Model
Abstract
We present SuperSAGA, an agentic semi-automated formal verification framework that assists in generating, debugging, and refining SystemVerilog Assertions (SVA) from natural language specifications. Rather than relying on full manual workflows, SuperSAGA combines Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) to guide assertion development based on human-reviewed verification plans using an agentic workflow. The framework translates specifications into syntactically correct assertions, integrates feedback from formal verification tools, and supports iterative refinement using an orchestration of supervisor and subordinate agents. Evaluation on OpenTitan IP modules shows improved quantitative coverage over state of the art and reduced manual effort, demonstrating the potential of guided automation in simplifying the assertion generation process for hardware designers.
3D-2
16:20-16:45

Understanding and Predicting Vmin Failures in Power Delivery Networks through Multi-Order Droop Signatures

Songyu Sun, Jingchao Hu, Zhou Jin, *Cheng Zhuo (Zhejiang University)
Keywords
Power Delivery Network, Voltage Droop, Minimum Supply Voltage, Power Integrity, Machine Learning
Abstract
As voltage margins continue to shrink in modern high-performance ICs, circuits become increasingly vulnerable to power supply noise, making the minimum supply voltage (Vmin) a critical metric for reliable operation. These voltage fluctuations arise from the multi-level characteristics of the power delivery network (PDN), whose frequency-dependent impedance induces multi-order voltage droops under dynamic loads. This paper presents a systematic framework for understanding and predicting Vmin failures in PDNs through multi-order droop signatures. We examine how varying input current profiles affect the relative impact of each PDN level and conduct a comprehensive statistical study to quantify the relationship between droop characteristics and multi-level contributions to Vmin. A machine-learning model is further developed to rapidly and accurately predict multilevel contribution ratios from input current profiles and droop signatures, offering insights into Vmin failures and facilitating efficient PDN optimization for improved power integrity.
3D-3
16:45-17:10

AssertMiner: Module-Level Spec Generation and Assertion Mining using Static Analysis Guided LLMs

Hongqin Lyu (Institute of Computing Technology, CAS; University of Chinese Academy of Sciences), Yonghao Wang (Institute of Computing Technology), Jiaxin Zhou (Beijing Normal University), Zhiteng Chao, Tiancheng Wang (Institute of Computing Technology), *Huawei Li (Institute of Computing Technology, CAS; University of Chinese Academy of Sciences)
Keywords
Functional Verification, Assertion Mining, Large Language Model, Specification Extraction
Abstract
Assertion-based verification (ABV) is a key approach to checking whether a logic design complies with its architectural specifications. Existing assertion generation methods based on design specifications typically produce only top-level assertions, overlooking verification needs on the implementation details in the modules at the micro-architectural level, where design errors occur more frequently. To address this limitation, we present AssertMiner, a module-level assertion generation framework that leverages static information generated from abstract syntax tree (AST) to assist LLMs in mining assertions. Specifically, it performs AST-based structural extraction to derive the module call graph, I/O table, and dataflow graph, guiding the LLM to generate module-level specifications and mine module-level assertions. Our evaluation demonstrates that AssertMiner outperforms existing methods such as AssertLLM and Spec2Assertion in generating high-quality assertions for modules. When integrated with these methods, AssertMiner can enhance the structural coverage and significantly improve the error detection capability, enabling a more comprehensive and efficient verification process.
3D-4
17:10-17:35

LLM-Assisted Circuit Verification: A Comprehensive Survey

Hongduo Liu, Yuntao Lu, Mingjun Wang, Xufeng Yao, *Bei Yu (The Chinese University of Hong Kong)
Keywords
LLM-Assisted Circuit Verification
Abstract
Circuit verification constitutes a significant bottleneck in modern electronic design, often consuming up to 70% of the development cycle due to the escalating complexity of circuits. The emergence of large language models (LLMs) presents a promising new frontier, demonstrating profound capabilities in code generation, debugging, and automated reasoning. This paper provides a comprehensive survey of LLM-assisted hardware verification. We systematically review state-of-the-art approaches that leverage LLMs for critical verification tasks, including assertion synthesis, testbench generation, automated debugging, and the development of collaborative verification frameworks. We analyze the effectiveness of various LLM strategies specifically designed for hardware verification applications, while also discussing the integration of LLMs with existing verification workflows and tools. Furthermore, the paper identifies key remaining challenges and outlines a forward-looking perspective on future research directions.

Session 3E

(T9-E) Timing Analysis and Timing-Aware Physical Synthesis
15:55-18:00 | Tuesday, January 20, 2026 | Sleeping Beauty 3
Chair(s):
Jeong-Tyng Li (National Tsing Hua University)
Yi-Yu Liu (National Taiwan University of Science and Technology)
3E-1
15:55-16:20

HeteroLatch: A CPU-GPU Heterogeneous Latch-Aware Timing Analysis Engine

*Xizhe Shi, Zizheng Guo, Yibo Lin, Zuodong Zhang, Yun Liang, Runsheng Wang (Peking University)
Keywords
Static timing analysis, CPU-GPU Heterogeneous, Latch-Aware
Abstract
Latches, prevalent in high-frequency circuits, challenge timing analysis due to time borrowing and latch loops, complicating static timing analysis (STA) algorithms and parallelization strategies. To address these issues, we propose HeteroLatch, a CPU-GPU cooperative framework that enables efficient latch-aware timing analysis. By integrating adaptive loop handling with hierarchical parallel timing propagation, our method mitigates sequential bottlenecks through CPU-GPU collaboration, hiding graph decomposition overhead via early termination, while optimizing GPU throughput with dynamic workload allocation. Experimental results demonstrate an average speed-up of 14.43x, 10.46x and 2.02x compared to industrial timers PrimeTime, OpenSTA and SOTA work, respectively. HeteroLatch bridges the gap between latch-specific timing complexities and GPU acceleration, offering a scalable solution for advanced-node verification.
3E-2
16:20-16:45

Novel Multi-Corner Delay Padding using Path Relationship Analysis and Dual Decomposition

*Kaixiang Zhu, Jiangnan Li, Lingli Wang, Wai-Shing Luk (Fudan University)
Keywords
Clock skew scheduling, Delay padding, Process variation, Dual decomposition
Abstract
Multi-corner timing analysis is essential for ensuring the robustness of circuits under variations in process, voltage, and temperature (PVT). Along with clock skew scheduling, delay padding is used to address hold violations. However, applying padding consistently in multiple corners is challenging due to conflicting constraints and the prevalence of "ping-pong" effects. This paper presents a novel methodology that uses dual decomposition to tackle this challenge. The problem is divided into a set of network flow problems, one for each corner. These problems are coupled through shared delay variables. Coordinating these subproblems using Lagrange multipliers ensures consistent padding assignments across corners. Additionally, traditional padding methods often struggle with physical feasibility. The incorporation of path relationship analysis is proposed to identify viable, physically feasible padding locations. Experimental results on industrial benchmarks demonstrate that the proposed method efficiently identifies feasible padding solutions and achieves the minimum clock period that satisfies the setup and hold time constraints for all corners. Compared to the single worst-case corner baseline, the optimized clock period is reduced by up to 9%, highlighting the effectiveness of our approach.
3E-3
16:45-17:10

GNN-Based Timing Yield Prediction From Statistical Static Timing Analysis

*Chenbo Xi, Liang Zhang (ShanghaiTech University), Biwei Xie (Chinese Academy of Sciences), Pingqiang Zhou (ShanghaiTech University)
Keywords
Timing yield, Graph neural network, Statistical static timing analysis
Abstract
As one of the key indicators in digital IC design, timing yield is closely related to the topological correlation among individual timing paths. Timing yield analysis slows down the design cycle, as this time-consuming step needs to be performed repeatedly during post-routing optimization iterations. To improve the design efficiency, in this work, we propose a fast yet accurate GNN-based framework to predict the timing yield for the entire design from SSTA (statistical static timing analysis) results. The experimental results show that our method can achieve a speedup of more than one order of magnitude when compared with the conventional analysis flow.
3E-4
17:10-17:35

MIMIC: Machine Intelligence for Scalable Generation of Synthetic Timing Cone Datasets

*Ajay Yadav, Vinodh Kumar Ramasamy (Arizona State University), Juan Arturo Garza, Taylor Hannan, Mark Lee, Mahesh Sharma (Tenstorrent), Vidya A. Chhabria (Arizona State University)
Keywords
AI, synthetic data generation, timing cones, STA, dataset
Abstract
Many machine learning (ML)-based approaches have been proposed to predict and optimize timing. However, the effectiveness of these techniques is often limited by the lack of large-scale, diverse, and realistic datasets—particularly those that reflect the structural and timing complexities of industrial designs. In this work, we introduce MIMIC (Machine intelligence for scalable generation of synthetic timing cone datasets), a methodology for generating high-quality synthetic timing cones that mimic industry-scale netlist topologies and timing characteristics. The framework synthesizes a diverse dataset using a three-stage ML pipeline: (1) timing cone shape generation, (2) node type prediction for technology mapping, and (3) edge prediction to model internal cone connectivity. The generated cones closely resemble real designs. MIMIC produces synthetic timing cones for a large-scale netlist within a few seconds. We evaluate the dataset using quantitative statistical EDA and ML world metrics for realism and structural diversity. We also demonstrate the effectiveness and generalizability of the MIMIC dataset for an ML timing prediction task.
3E-5
17:35-18:00

Differentiable Tier Assignment for Timing and Congestion-Aware Routing in 3D ICs

*Yuan-Hsiang Lu, Hao-Hsiang Hsiao (Georgia Institute of Technology), Yi-Chen Lu, Haoxing Ren (NVIDIA), Sung Kyu Lim (Georgia Institute of Technology)
Keywords
3D ICs, Congestion, Concurrent optimization, Metal layer sharing, Tier assignment
Abstract
State-of-the-art (SOTA) 3D physical design (PD) flows extend commercial 2D place-and-route (P&R) tools to enable signoff-quality 3D IC implementation through double metal stacking and inter-die metal layer sharing. While metal layer sharing introduces additional routing resources, the substantially higher manufacturing cost of face-to-face (F2F) inter-die vias compared to intra-die vias necessitates 3D-aware routing strategies to manage routability-cost trade-offs. To address this, we propose differentiable routing guidance for 3D ICs (DRG-3D), a GPU-accelerated differentiable optimization framework that provides routing guidance for 3D ICs. DRG-3D formulates a fully differentiable objective that simultaneously optimizes key 3D design metrics: routing congestion, wirelength, via cost, and F2F-via cost, which enables efficient and scalable gradient-based optimization over large-scale netlists. Experimental results show that DRG-3D outperforms the SOTA Pin-3D flow, achieving up to 8.37% reduction in routing overflow, 23.99% reduction in total negative slack (TNS), and 18.05% reduction in post-route timing violations.

Session 3F

(T8-B) Adanced Performance Optimization for High-level Synthesis and Scheduling
15:55-18:00 | Tuesday, January 20, 2026 | Sleeping Beauty 5
Chair(s):
Wei Zhang (The Hong Kong University of Science and Technology)
Zeke Wang (Zhejiang University)
3F-1
15:55-16:20

Automatic Recursion Elimination using Recurrence Relations for Synthesis of Stack-free Hardware

*Adam Musa, Christophe Dubach (McGill University)
Keywords
Recursion, High-Level Synthesis, Incrementalization, Automated Refactoring, Static Analysis, FPGA
Abstract
HLS eases hardware design by offering a higher level of abstractions. However, high-level programming concepts, such as recursion, are costly to synthesize, if at all possible. Recursion typically relies on a dynamic call stack, whose hardware implementation is resource-intensive and inefficient. Existing approaches solve this issue by replacing recursion with iteration using explicit stack arrays or by detecting specific patterns (e.g., tail recursion) to avoid using the stack. This paper introduces a novel technique for transforming recursive functions into equivalent stack-free iterative implementations. Using static analysis, a recurrence relation is extracted from the function, representing the function as a sequence bounded by the order of the recurrence relation. This relation is then used to optimize the process of incrementalization, constructing an iterative, synthesizable, and stack-free version of the function that uses a bounded static array. This approach is evaluated on a set of recursive benchmarks used in prior work. It eliminates recursion from 9 out of 19 benchmarks and achieves a 2.0x performance speedup over state-of-the-art solutions. Additionally, it removes the need for BRAM and reduces LUT usage by 12% over prior work.
3F-2
16:20-16:45

FIFOAdvisor: A DSE Framework for Automated FIFO Sizing of High-Level Synthesis Designs

Stefan Abi-Karam, Rishov Sarkar (Georgia Institute of Technology), Suhail Basalama, Jason Cong (University of California, Los Angeles), *Callie Hao (Georgia Institute of Technology)
Keywords
FPGA, HLS, DSE, Dataflow Architecture, Streaming, Optimization
Abstract
Dataflow hardware designs are important for efficient algorithm implementations across various domains using high-level synthesis (HLS) targeting FPGAs. However, these designs pose a challenge: correctly and optimally sizing first-in-first-out (FIFO) channel buffers. FIFO sizes are user-defined parameters, introducing a trade-off between latency and area—undersized FIFOs cause stalls and increase latency, while oversized FIFOs waste on-chip memory. In many cases, insufficient FIFO sizes can also lead to deadlocks. Deciding the best FIFO sizes is non-trivial. Existing methods make limiting assumptions about FIFO access patterns, overallocate FIFOs conservatively, or use time-consuming RTL simulations to evaluate different FIFO sizes. Furthermore, we highlight that runtime-based analyses (i.e., simulation) are the only way to solve the FIFO optimization problem while ensuring a deadlock-free solution for designs with data-dependent control flow. To tackle this challenge, we propose FIFOAdvisor, a framework to automatically decide FIFO sizes in HLS designs. Our approach is powered by LightningSim, a fast simulator that is 99.9% cycle-accurate and supports millisecond-scale incremental simulations with new FIFO configurations. We formulate FIFO sizing as a dual-objective black-box optimization problem and explore various heuristic and search-based methods to analyze the latency-resource trade-off. We also integrate FIFOAdvisor with Stream-HLS, a recent framework for optimizing affine dataflow designs lowered from C++, MLIR, or PyTorch, enabling deeper optimization of the heavily-used FIFOs in these workloads. We evaluate FIFOAdvisor on a suite of Stream-HLS benchmarks, including linear algebra and deep learning workloads, to demonstrate our approach's ability to optimize large and dynamic dataflow patterns. Our results show Pareto-optimal latency-memory usage frontiers for FIFO configurations generated via different optimization strategies. Compared to baseline designs with naïvely-sized FIFOs, FIFOAdvisor identifies configurations with much lower memory usage and minimal delay overhead. Additionally, we measure the runtime of our optimization process and demonstrate significant speedups compared to traditional HLS/RTL co-simulation-based approaches, making FIFOAdvisor practical for rapid design space exploration. Finally, we present a case study using FIFOAdvisor to optimize a complex hardware accelerator with non-trivial data-dependent control flow. Code and results open-sourced at https://anonymous.4open.science/r/fifo-advisor.
3F-3
16:45-17:10

HLS-Timer: Fine-Grained Path-Level Timing Estimation for High-Level Synthesis

*Zibo Hu (Beijing University of Posts and Telecommunications), Zhe Lin (Sun Yat-sen University), Renjing Hou, Xingyu Qin, Jianwang Zhai, Kang Zhao (Beijing University of Posts and Telecommunications)
Keywords
High-Level Synthesis, Timing Estimation, Data Delay, Timing Path
Abstract
Electronic Design Automation(EDA) requires early stage timing guidance to maximize optimization potential. Accurate timing estimation is essential in the High-Level Synthesis (HLS) stage. However, current HLS tools often produce inaccurate timing predictions, resulting in unmet performance targets. While post-synthesis EDA tool chains can provide precise timing analysis, their exhaustive methodologies are prohibitively time-consuming and computationally expensive. Recent machine learning approaches have demonstrated promising results in predicting design-level timing metrics in HLS designs, such as Worst Negative Slack (WNS) and Critical Path (CP) delay. Nevertheless, fine-grained, path-level timing estimation remains an unresolved challenge. In this work, we present HLS-Timer, the first path-level timing estimator for HLS. The proposed framework employs a graph-based representation of the HLS de- sign, integrating local structural features with global contextual information to model timing paths and provide accurate, fine-grained delay predictions. Experimental results demonstrate that on previously unseen designs HLS-Timer achieves exceptional accuracy in path-level delay estimation (Pearson R = 0.94, R² = 0.93, MAPE = 18.96%), highlighting its strong generalization capability. Furthermore, it surpasses state-of-the-art baselines in design-level timing prediction, reducing MAPE to 9.97% for WNS and 6.79% for CP delays.
3F-4
17:10-17:35

FESTAL: Dataflow Accelerator Synthesis Framework with Graph-Based Fusion for FPGA

*Ruifan Xu, Yuyang Zou, Yun Liang (Peking University)
Keywords
High-level Synthesis, MLIR, Dataflow Architecture, FPGA, Streaming
Abstract
High-Level Synthesis (HLS) provides a promising approach to design hardware at the software level. However, recent research efforts primarily focus on computational optimization while assuming perfect memory system. As a result, issues such as limited on-chip buffer capacity and high-latency off-chip memory access frequently become the performance bottlenecks. Dataflow architectures address this by enabling parallel task execution with direct on-chip communication, reducing the need for external memory access. However, dataflow implementation presents significant challenges, such as determining inter-task communication, and balancing compute and memory resources. A comprehensive modeling approach is necessary to fully leverage the benefits of dataflow for enhanced hardware performance. In this paper, we present Festal, a holistic FPGA synthesis framework that automatically generates efficient dataflow accelerators. Festal introduces a novel graph-based algorithm that systematically explores task fusion opportunities, optimizing inter-task communication patterns entirely on-chip and thereby reducing the need for off-chip memory access. By explicitly modeling memory constraints, the framework achieves a critical balance between computational workload and memory resources. Built on the MLIR infrastructure, Festal proposes a two-level IR to model memory management and streaming channels, providing an efficient solution for dataflow designs. Experimental results show that Festal achieves an average speedup of 2.06X on standard benchmark suites, outperforming the state-of-the-art synthesis framework. For real-world applications, Festal demonstrates performance comparable to custom FPGA accelerators, underscoring its practical effectiveness.
3F-5
17:35-18:00

ARCS: Architecture-Responsive CGRA Scheduling

*Omar Ragheb (Fujitsu Consulting (Canada) Inc., University of Toronto), Jason Anderson (University of Toronto)
Keywords
CGRAs, Mapping, Scheduling
Abstract
Scheduling is a key aspect in mapping applications to coarse-grained reconfigurable architectures (CGRA). During scheduling, the number of pipeline registers on each path is determined to ensure that the data for an operation arrives at the correct cycle. Traditional scheduling methods, such as as soon as possible (ASAP) and as late as possible (ALAP), determine the pipeline registers without considering the placement of operations. This can restrict the mapping algorithm, forcing placement to accommodate scheduling and limiting routing options to satisfy both scheduling and placement. To overcome the limitations of traditional schedulers, we propose a method that adaptively adjusts the schedule during the initial routing phase. In this approach, the mapping algorithm begins with an ASAP schedule to establish initial schedule constraints. We then utilize simulated annealing for placement, and we employ a architecture-responsive scheduling algorithm post-placement to update the schedule of each edge based on the placement and generate an initial routing solution with overlaps. Afterwards, the PathFinder algorithm is applied to the initial routing solution, along with the generated schedule, to find a valid routing with no overlaps. Our results demonstrate that the architecture-responsive scheduling approach maintains a quality comparable to that of conventional ASAP scheduling. Furthermore, architecture-responsive scheduling enables generic mapping of applications onto restricted architectures that do not allow routes to bypass pipeline registers, a challenge that traditional schedulers do not address.

Keynote Session II

Keynote Addresses
08:20-09:50 | Wednesday, January 21, 2026 | Cinderella Ballroom 1/6/7/8
Yuan Xie
Fang Professor of Engineering
Chair Professor, Department of Electronic and Computer Engineering
The Hong Kong University of Science and Technology
08:20-09:05
Keynote Address

Déjà Vu: From 3D to Chiplet and PIM/NDP — A Historical Perspective

Biography
Dr. Yuan Xie is currently with The Hong Kong University of Science and Technology as Chair Professor in ECE department and FANG Professor of Engineering. He received B.S degree in Electronic Engineering from Tsinghua University and Ph.D. degree in Computer Engineering from Princeton University. Before joining HKUST, he has rich industry experience. He was with Alibaba DAMO Academy and T-Head Semiconductor, a Professor at the University of California, Santa Barbara (UCSB), a Professor at Pennsylvania State University, with AMD Research and was with IBM Microelectronics as an Advisory Engineer. Yuan Xie is a Fellow of IEEE, ACM, and AAAS, and a recipient of many awards.
Abstract
In this talk, the speaker reflects on a career journey marked by exploration at the intersection of technology and architecture, transitioning between academia and industry. The discussion highlights how technological advancements can drive architectural innovation, while architectural choices, in turn, influence the adoption and evolution of new technologies. Drawing from personal experience, the speaker offers a historical perspective on the dynamic interplay between 3D integration, chiplet-based design, and Processing-in-Memory (PIM) / Near-Data Processing (NDP) paradigms.
Charles Alpert
Cadence AI Fellow
Cadence Design Systems, Inc.
09:05-09:50
Keynote Address

Harnessing Agentic AI to Accelerate Designer Productivity

Biography
Dr. Charles (Chuck) Alpert is Cadence's AI Fellow and drives cross-functional Agentic AI solutions throughout Cadence's software stack. Prior to this, has led various pioneering teams in digital implementation, including Global Routing, Clock Tree Synthesis, Genus Synthesis, and Cerebrus AI. Charles has published over 100 papers and received over 100 patents in the EDA space. He is a Cadence Master inventor. He has served as Deputy-Editor-in-Chief for IEEE Transactions on Computer-Aided Design, chaired the IEEE/ACM Design Automation Conference, and earned IEEE Fellow. He received a B.S. and B.A. Degree from Stanford University and a Ph.D. in Computer Science from UCLA.
Abstract
As the complexity of chip designs continues to escalate and design cycles shrink, the demand for enhancing designer productivity becomes imperative. Currently, designers are entrenched in manual tasks such as writing RTL, creating verification test plans, and arduously debugging physical design flows. The industry is eagerly turning to agentic AI to elevate the abstraction level for design engineers. This talk delves into how to leverage frontier models to address vexing EDA problems and outlines the challenges ahead. By harnessing the power of agentic AI, we can accelerate the design process, reduce manual effort, and optimize outcomes, meeting the growing demands of the industry.

Session 4A

(T4-B) AI for Hardware, Systems, and Verification
10:20-12:00 | Wednesday, January 21, 2026 | Snow White 1
Chair(s):
Lin Ting-Jung (Ningbo Institute of Digital Twin, EIT)
Heechun Park (UNIST)
4A-1
10:20-10:45

CoLoRA: A Collaborative Scheduling Framework for Multi-Tenant LoRA LLM Inference

*Zechao Lin, Xingbin Wang, Yiming Xie, Dan Meng, Rui Hou (Institute of Information Engineering, Chinese Academy of Sciences)
Keywords
Multi-Tenant LoRA LLM Inference, Adaptive Priority Scheduling, Adapter-Aware Scheduling, Load-Aware Batch Scheduling, Unified Scheduler
Abstract
Large Language Models (LLM) incur substantial resource costs during inference, driving widespread interest in Parameter-Efficient Fine-Tuning (PEFT) techniques. Among these, LoRA dramatically reduces overhead by updating only a few low-rank adapters. However, Multi-tenant LoRA LLM inference faces challenges from heterogeneous requests and latency-throughput trade-offs. Moreover, inefficient adapter reuse, poor cache management, and non-adaptive batching strategies severely restrict inference efficiency, service quality, resource utilization, and fairness. To address these challenges, we propose CoLoRA—a collaborative scheduling framework for multi-tenant LoRA LLM inference, comprising four core modules: (1) Adaptive Priority Scheduling (APS), which dynamically integrates queue waiting time, adapter residency status, and SLA urgency to compute task priorities; (2) Adapter-Aware Scheduling (AAS), which enhances cache management by prioritizing SLA-critical, frequently used, and fairly shared adapters, thus reducing cold-start latency and fragmentation; (3) Load-Aware Batch Scheduling (LBS), which combines real-time GPU utilization and queue depth to adaptively form batches and coalesce tasks targeting the same adapter, thereby improving parallelism while controlling latency; and (4) Unified Scheduler (US), which periodically gathers system metadata to orchestrate the submodules collaboratively and employs a feedback loop to optimize global strategies online. Evaluation on realistic multi-tenant workloads and popular open-source LLM shows that CoLoRA, compared to conventional baselines, increases overall system throughput by 56.5%, reduces P95 latency of online requests by 34%, and significantly enhances GPU utilization and tenant-level fairness, demonstrating its promise for large-scale inference services.
4A-2
10:45-11:10

AutoVeriFix: Automatically Correcting Errors and Enhancing Functional Correctness in LLM-Generated Verilog Code

*Yan Tan (The Hong Kong University of Science and Technology (Guangzhou)), Xiangchen Meng, Zijun Jiang, Yangdi Lyu (The Hong Kong University of Science and Technology (Guangzhou))
Keywords
LLM-Generated Verilog, Functional Correctness, Automated Testing
Abstract
Large language models (LLMs) have demonstrated impressive capabilities in generating software code for high-level languages like Python and C++. However, their application to hardware description languages (HDLs), such as Verilog, is challenging due to the scarcity of high-quality training data. Current approaches to Verilog code generation with LLMs often focus on syntactic correctness, resulting in code with functional errors. To address these challenges, we present AutoVeriFix, a novel Python-assisted two-stage framework designed to enhance the functional correctness of LLM-generated Verilog code. In the first stage, LLMs are employed to generate high-level Python reference models that define the intended circuit behavior. In the second stage, these Python models facilitate the creation of automated tests that guide the generation of Verilog RTL implementations. Simulation discrepancies between the reference model and the Verilog code are iteratively used to fix errors and improve the LLM-generated Verilog code, thereby improving functional accuracy and reliability. Experimental results demonstrate that our approach significantly outperforms existing state-of-the-art methods in improving the functional correctness of Verilog code.
4A-3
11:10-11:35

HDLxGraph: Bridging Large Language Models and HDL Repositories via HDL Graph Databases

*Pingqing Zheng, Jiayin Qin, Fuqi Zhang, Niraj Chitla (University of Minnesota, Twin Cities), Zishen Wan (Georgia Institute of Technology), Shang Wu (Northwestern University), Yu Cao, Caiwen Ding, Yang Zhao (University of Minnesota, Twin Cities)
Keywords
Graph RAG, Hardware description language, Semantic code search
Abstract
Large Language Models (LLMs) have demonstrated their potential in hardware design tasks, such as Hardware Description Language (HDL) generation and debugging. Yet, their performance in real-world, repository-level HDL projects with thousands or even tens of thousands of code lines is hindered. To this end, we propose HDLxGraph, a novel framework that integrates Graph Retrieval Augmented Generation (Graph RAG) with LLMs, introducing HDL-specific graph representations by incorporating Abstract Syntax Trees (ASTs) and Data Flow Graphs (DFGs) to capture both code graph view and hardware graph view. HDLxGraph utilizes a dual-retrieval mechanism that not only mitigates the limited recall issues inherent in similarity-based semantic retrieval by incorporating structural information, but also enhances its extensibility to various real-world tasks by a task-specific retrieval adaption. Additionally, to address the lack of comprehensive HDL search benchmarks, we introduce HDLSearch, a multi-granularity evaluation dataset derived from real-world repository-level projects. HDLxGraph improves search, debugging, and completion accuracy by 12.04%, 12.22%, 5.04% and 11.59%, 8.18%, 4.07% over similarity-based RAG and SOTA Graph RAG, respectively. The code of HDLxGraph and HDLSearch benchmark are available at https://anonymous.4open.science/r/HDLxGraph-87CC/.
4A-4
11:35-12:00

Chat-A^2: An LLM-aided Design Space Exploration Framework for High-Performance CPU Design

*Zhantong Zhu, Kangbo Bai, Tianyu Jia (Peking University)
Keywords
Design space exploration, High-performance CPU, LLM-aided methodology, CPU microarchitecture
Abstract
Multi-objective design space exploration (DSE) for complex high-performance CPUs presents a significant challenge due to extensive parameter ranges and the vastness of the design space. Prior works either overlook microarchitectural information during DSE or necessitate intricate analysis for power, performance and area (PPA) evaluations. In this work, we present a novel DSE methodology for CPU, which leverages large language models (LLMs) as an assistive tool to accelerate and automate the DSE flow. We develop an LLM-aided architecture-oriented DSE framework, i.e. Chat-A^2 with satisfactory effectiveness and flexibility. Chat-A^2 is able to quickly explore 98% of the Pareto hypervolume covered by the real Pareto optimal set without utilizing pre-sampled datasets or complex, customized analytical mechanisms. Experiments on an open-source high-performance RISC-V XiangShan CPU conclude that Chat-A^2 can obtain up to 19% normalized performance improvement while achieving 21.3% normalized area reduction compared to default CPU architecture optimized by professional human architects.

Session 4B

(T3-D) Accelerating LLMs with Near- and In-Memory Computing
10:20-12:00 | Wednesday, January 21, 2026 | Snow White 2
Chair(s):
Shanshi Huang (The Hong Kong University of Science and Technology (Guangzhou))
Shanlin Xiao (Sun Yat-sen University)
4B-1
10:20-10:45

BitROM: Weight Reload-Free CiROM Architecture Towards Billion-Parameter 1.58-bit LLM Inference

*Wenlun Zhang (Keio University), Xinyu Li (Nanjing University), Shimpei Ando, Kentaro Yoshioka (Keio University)
Keywords
Compute-in-Memory, Read-Only-Memory, BitNet, eDRAM, KV-Cache, Large Language Model
Abstract
Compute-in-Read-Only-Memory (CiROM) accelerators offer outstanding energy efficiency for CNNs by eliminating runtime weight updates. However, their scalability to Large Language Models (LLMs) is fundamentally constrained by their vast parameter sizes. Notably, LLaMA-7B—the smallest model in LLaMA series—demands more than 1,000 cm2 of silicon area even in advanced CMOS nodes. This paper presents BitROM, the first CiROM-based accelerator that overcomes this limitation through co-design with BitNet’s 1.58-bit quantization model, enabling practical and efficient LLM inference at the edge. BitROM introduces three key innovations: 1) a novel Bidirectional ROM Array that stores two ternary weights per transistor; 2) a Tri-Mode Local Accumulator optimized for ternary-weight computations; and 3) an integrated Decode-Refresh (DR) eDRAM that supports on-die KV-cache management, significantly reducing external memory access during decoding. In addition, BitROM integrates LoRA-based adapters to enable efficient transfer learning across various downstream tasks. Evaluated in 65nm CMOS, BitROM achieves 20.8 TOPS/W and a bit density of 4,967 kB/mm2—offering a 10x improvement in area efficiency over prior digital CiROM designs. Moreover, the DR eDRAM contributes to a 43.6% reduction in external DRAM access, further enhancing deployment efficiency for LLMs in edge applications.
4B-2
10:45-11:10

BALANCE: Bit and Layer-Aware Lightweight ECC Design Method for In-Flash Computing Based LLM Inference Accelerator

*Changwei Yan, Lishuo Deng, Mingbo Hao, Cai Li, Weiwei Shan (Southeast University)
Keywords
In-Flash Computing, Large Language Model Accelerator, Error Correction Code, NAND flash errors
Abstract
The shift of Large Language Model (LLM) inference to edge devices demands efficient hardware solutions that overcome memory, power, and computational constraints. In-Flash Computing (IFC) using NAND flash provides an energy-efficient alternative but requires robust Error Correction Codes (ECC), severely limiting logic area and power budgets. This paper presents BALANCE, a lightweight ECC co-design strategy tailored for IFC-based LLM accelerators. BALANCE introduces (1) a fine-grained bit-level significance evaluation, (2) a layer-level sensitivity assessment leveraging cosine similarity of residual connections, and (3) a stratified ECC allocation algorithm optimizing flash memory placement. Experimental evaluations on LLaMA3-8B demonstrate that, compared to state-of-the-art solutions Lincoln and AiF, BALANCE reduces ECC overhead, achieving area savings of 275x and 62x, and power reductions of 899x and 321x, respectively. This is accomplished while delivering 6% and 5.7% higher code rates and maintaining accuracy degradation within 0.5%. To our knowledge, BALANCE is the first work to systematically integrate LLM error sensitivity with NAND flash physics, enabling a new class of highly efficient and powerful IFC accelerators for the edge.
4B-3
11:10-11:35

BLADE: Boosting LLM Decoding's Communication Efficiency in DRAM-based PIM

*Yilong Zhao, Fangxin Liu, Zongwu Wang, Mingjian Li (Shanghai Jiao Tong University), Mingxing Zhang (Tsinghua University), Chixiao Chen (Fudan University), Li Jiang (Shanghai Jiao Tong University)
Keywords
Processing-in-Memory (PIM), Large Language Models (LLMs), Dynamic Parallelism, Transpose
Abstract
In recent years, the application of Large Language Models (LLMs) has grown rapidly. LLM inference consists of two stages: the prefill stage and the decoding stage. The prefill stage benefits from high data reuse, allowing GPUs to efficiently utilize computational resources. In contrast, the decoding stage is memory-bound and is more suited for Processing-in-Memory (PIM) techniques. PIM integrates computation units into memory banks to optimize the usage of internal memory bandwidth. However, the limited external bandwidth of PIM creates bottlenecks in two ways. First, PIM systems require high parallelism to fully utilize internal bandwidth, resulting in significant bank-to-bank communication. Second, the value cache must be arranged contiguously along the sequence length dimension to maximize DRAM row-buffer hits, which introduces additional transpose overhead. In this work, we propose BLADE, a novel PIM-based architecture designed to accelerate LLM decoding. First, we introduce a task division strategy for multi-head attention (MHA) layers and dynamic PIM parallelism scaling to optimize the balance between computation and communication time. This approach adapts to the increasing sequence length during the decoding process. Second, we leverage the differing DRAM access granularities of CPUs and PIM units to automatically arrange the transposed matrix contiguously in DRAM rows during value cache transfers. Extensive experiments demonstrate that our architecture can significantly reduce the communication overhead and achieve a 105.7x speedup and 41.6x energy efficiency compared to the GPU baseline.
4B-4
11:35-12:00

SpAct-NDP: Efficient LLM Inference via Sparse Activation on NDP-GPU Heterogeneous Architecture

*Jiaming Xu (Shanghai Jiao Tong University), Tongxin Xie (Tsinghua University), Yongkang Zhou, Jinhao Li, Yaoxiu Lian (Shanghai Jiao Tong University), Zhenhua Zhu, Yu Wang (Tsinghua University), Guohao Dai (Shanghai Jiao Tong University)
Keywords
Near-data Processing, LLM, GPU
Abstract
Sparse activation is caused by the activation function (e.g., ReLU) in the feed-forward network (FFN) of large language models (LLMs), and recently emerges as a promising method for LLM inference acceleration in resource-constrained scenarios by effectively reducing computational workload and memory requirements with > 80% predicted dynamic sparsity. In this paper, we identify the heavy and dynamic data transfer is the primary reason for the significant synchronization and poor GPU utilization during decoding phase of LLM inference with sparse activation, and propose to apply the near-data-processing (NDP) architecture to handle the dynamic sparse activation, while addressing three critical challenges for further NDP-GPU collaboration optimization. (1) Under-utilization of DRAM bandwidth during memory access of NDP. (2) Workload imbalance across channels during computation of NDP. (3) Time-consuming parsing of the sparse predicted pattern during NDP-GPU collaboration. To tackle the above challenges, we present SpAct-NDP, the NDPGPU heterogeneous architecture for efficient LLM inference with sparse activation. (1) For the memory access during NDP, we design the the specific sparsity-aware weight mapping strategy considering the characteristics of sparse activation to improve DRAM bandwidth utilization by balancing the bank workload and eliminating redundant memory access. (2) For the computation during NDP, we propose two-level heuristic scheduling system to achieve channel-wise workload balance. (3) For the collaboration of NDP-GPU, we point out that the parsing of the predicted sparse pattern is more suitable for GPUs with high parallelism and propose the request-weight pair parsing mechanism according to the input requests and sparse pattern on GPU, reducing ∼ 3x execution time and ∼ 9x memory. Experiments show that SpActNDP achieves up to 2.17x and 1.92x end-to-end speedup and 1.53x and 1.45x energy efficiency compared with the SOTA software frameworks for LLM with sparse activation on NVIDIA RTX 3090 and NVIDIA Tesla A100.

Session 4C

(T11-C) Tackling Reliability Issues across the Layers
10:20-11:35 | Wednesday, January 21, 2026 | Snow White 3
Chair(s):
Michihiro Shintani (Kyoto Institute of Technology)
Yutaka Masuda (Nagoya University)
4C-1
10:20-10:45

Quantifying Compiler-induced Reliability Loss in Software-Implemented Hardware Fault Tolerance

*Davide Baroffio (Politecnico di Milano), Johannes Geier (Technical University of Munich), Federico Reghenzani (Politecnico di Milano), Ulf Schlichtmann (Technical University of Munich), William Fornaciari (Politecnico di Milano)
Keywords
SIHFT, reliability, optimizations, RTL
Abstract
Compiler mechanisms for Software-Implemented Hardware Fault Tolerance (SIHFT) offer a cost-effective solution for reliability, paving the way towards the adoption of Commercial Off-The-Shelf (COTS) components in safety-critical environments. However, default compiler optimizations can remove the SIHFT-induced redundancy and checks. For this reason, the use of compiler optimizations was discouraged in the literature. This article presents a comprehensive study of the reliability degradation introduced by LLVM's O2 optimization pipeline when using a state-of-the-art SIHFT tool. We quantify, via RTL fault injection, the impact of O2 at different optimization stages, which identified a data corruption rate increase by up to 48x. We also propose a static exploration methodology to identify the LLVM passes that harm the reliability. Then, we remove these harmful passes from the optimization pipeline, demonstrating how to tune optimization pipelines to make SIHFT successful even in presence of compiler optimizations.
4C-2
10:45-11:10

WARP: Workload-Aware Reference Prediction for Reliable Multi-Bit FeFET Readout under Charge-Trapping Degradation

*Dhruv Thapar, Ashish Reddy Bommana, Arjun Chaudhuri (Arizona State University), Kai Ni (University of Notre Dame), Krishnendu Chakrabarty (Arizona State University)
Keywords
Ferroelectric FET (FeFET), Multi-level Cell, Read Reliability, Charge-Trapping, In-field Monitoring
Abstract
Ferroelectric FET (FeFET)-based arrays are promising candidates for energy-efficient, high-density non-volatile memory in data-intensive applications. However, charge-trapping-induced degradation and process variations pose significant reliability challenges. These effects lead to reduced memory window and degraded read accuracy over time. We propose a workload-aware degradation modeling and readout framework for FeFET arrays. First, we select a small set of representative workloads to efficiently capture degradation trends across a large workload space. We apply a two-step method to reduce read error: (a) adjust intermediate state currents to widen the separation between states; (b) select optimum reference thresholds based on the shifted distributions. Next, we perform detailed trade-off analysis involving degradation improvement, on-chip area, and the overhead of a memory-mapped CPU polling system for in-field workload tracking. Our framework improves read reliability with minimum hardware overhead and enables scalable in-field monitoring for future FeFET-based systems.
4C-3
11:10-11:35

PV-ReCAM: Process Variation-Aware Testing for ReRAM-based Content Addressable Memory

*Haneen G. Hezayyin, Mahta Mayahinia, Mehdi Tahoori (Karlsruhe Institute of Technology)
Keywords
Computation-in-Memory, Non-volatile memories, Redox-based RAM, Process variations, Testing
Abstract
Computation-in-Memory (CiM) is a promising solution to reduce the energy and latency caused by frequent data transfers between the processor and memory, a problem commonly referred to as the memory wall. This problem becomes even more serious in data-intensive applications, where comparing binary patterns to measure similarity is a common and demanding task. This can be implemented efficiently using Content Addressable Memory (CAM), which is well-suited for CiM-based acceleration of such tasks. To improve energy efficiency and performance, nonvolatile memories (NVM) such as ReRAM (Redox-based RAM) can be utilized for the realization of CiM-based CAM. However, ReRAM is highly susceptible to process variations (PV), due to the immaturity of its process and inherent stochasticity. Moreover, the analog realization of CAM functionality using NVMs makes it more sensitive to these non-idealities. Conventional March tests, originally designed for memory fault detection, become ineffective in the presence of PV, which can alter ReCAM behavior and lead to test escapes. To address these challenges, this work systematically analyzes the impact of PV on ReCAM functionality. It proposes a generalized PV-aware March test that optimizes test patterns for both hard defects and PV-induced soft defects, achieving 100% defect coverage.

Session 4D

(SS-1a) Mixed-precision: Silicon-to-Model Turbo Knob
10:20-11:35 | Wednesday, January 21, 2026 | Sleeping Beauty 1/2
Chair:
Li Jiang (Shanghai Jiao Tong University & Shanghai Qi Zhi Institute)
4D-1
10:20-10:45

When Posit Meets Microscaling: Energy Efficient Posit-Based Processing Element for Edge AI Computation

Yulin Wang (Ocean University of China), Seok-Bum Ko (University of Saskatchewan), Qi Wen (Ocean University of China; Shandong Key Laboratory of Intelligent Sensing Chips and System), Zhiqiang Wei (Ocean University of China; Qingdao University), *Hao Zhang (Ocean University of China; Shandong Key Laboratory of Intelligent Sensing Chips and System)
Keywords
Posit arithmetic, microscaling (MX), dot_x0002_product operation, AI accelerator, edge computation
Abstract
Low-precision computation is an effective method to improve energy efficiency when processing AI models at the edge. The design of numeric format is important to maintain good accuracy while reducing energy consumption. However, current fixed-point based formats or floating-point based formats have either limitations in representation range or precision, and thus efficiency or accuracy is compromised. Posit formats can achieve both large dynamic range and high precision, however, the computation overhead is too high. Inspired by the recent microscaling format, in this paper, a novel microscaling posit format and its corresponding dot-product based processing element are proposed. By designing a specific format, the dot-product computation overhead of the original posit format is significantly reduced. Implementation results show that the proposed processing element can achieve up to 79% area reduction and 74% power reduction when compared with other designs available in the literature, which makes the proposed designs especially suitable for edge AI computation.
4D-2
10:45-11:10

When Low-Rank Meets Mixed-Precision: A Unified, Training-Free Framework for Efficient LLM Compression

Junjie Wang (Shanghai Jiao Tong University; Shanghai Qi Zhi Institute; Northeastern University at Qinhuangdao), *Fangxin Liu (Shanghai Jiao Tong University; Shanghai Qi Zhi Institute), Jinqi Zhu (Northeastern University at Qinhuangdao), Chenyang Guan (Shanghai Jiao Tong University), Tao Yang (Huawei Technologies Ltd.), Li Jiang (Shanghai Jiao Tong University; Shanghai Qi Zhi Institute), Haibing Guan (Shanghai Jiao Tong University)
Keywords
Low-precision, model compression LLM acceleration
Abstract
The rapid growth of Large Language Models (LLMs) raises significant challenges for deployment in resource-constrained environments. Existing compression approaches, such as low-rank decomposition and quantization, are typically applied independently, which limits their effectiveness and fails to exploit their complementarity. To address this issue, we present a training-free framework for joint compression that integrates low-rank decomposition with mixed-precision quantization. We formulate the allocation of layer-wise rank and bit-width as a combinatorial optimization problem, guided by an input-aware sensitivity metric to allocate resources where they yield the highest accuracy retention. We further develop a sample-aware low-rank decomposition scheme with theoretical guarantees, and introduce a unified difference matrix to mitigate the coupled errors from structural approximation and quantization. Extensive experiments on diverse LLM architectures and datasets demonstrate that our method achieves state-of-the-art compression, reducing model size to 20% of the original while preserving inference accuracy. The code is available at https://github.com/zzzzzjq0126/HALO.git
4D-3
11:10-11:35

Precision-Scalable Microscaling Datapaths with Optimized Reduction Tree for Efficient NPU Integration

Stef Cuyckens, Xiaoling Yi, Robin Geens, Joren Dumoulin, Martin Wiesner, *Chao Fang, Marian Verhelst (ESAT-MICAS, KU Leuven)
Keywords
neural processing unit
Abstract
Emerging continual learning applications necessitate next-generation neural processing unit (NPU) platforms to support both training and inference operations. The promising Microscaling (MX) standard enables narrow bit-widths for inference and large dynamic ranges for training. However, existing MX multiply-accumulate (MAC) designs face a critical trade-off: integer accumulation requires expensive conversions from narrow floating-point products, while FP32 accumulation suffers from quantization losses and costly normalization. To address these limitations, we propose a hybrid precision-scalable reduction tree for MX MACs that combines the benefits of both approaches, enabling efficient mixed-precision accumulation with controlled accuracy relaxation. Moreover, we integrate an 8x8 array of these MACs into the state-of-the-art (SotA) NPU integration platform, SNAX, to provide efficient control and data transfer to our optimized precision-scalable MX datapath. We evaluate our design both on MAC and system level and compare it to the SotA. Our integrated system achieves an energy efficiency of 657, 1438-1675, and 4065 GOPS/W, respectively, for MXINT8, MXFP8/6, and MXFP4, with a throughput of 64, 256, and 512 GOPS.

Session 4E

(T10-B) Explainable and Generative AI for Lithography and Yield Optimization
10:20-12:00 | Wednesday, January 21, 2026 | Sleeping Beauty 3
Chair(s):
Yuzhe Ma (HKUST(GZ))
Binwu Zhu (Southeast University)
4E-1
10:20-10:45

Understand and Detect: Lithographic Hotspot Detection by the Interpretable Graph Attention Network

Andy Liu, Silin Chen (Nanjing University, Suzhou.), Guohao Wang, Wenzheng Zhao (ZetaTech Co.,Ltd., Shanghai), Yuxiang Fu, *Ningmu Zou (Nanjing University, Suzhou)
Keywords
Design for manufacturability, Lithography hotspot detection, Graph neural networks, Interpretable AI
Abstract
Lithography hotspot detection plays a crucial role in the design-for-manufacturing (DFM) process. Recent developments in machine learning have demonstrated significant advantages in improving feature extraction capabilities, computational efficiency, and reducing false alarms in hotspot detection. However, deep learning models remain black-box approaches, with the interpretability challenge yet to be addressed. The topological features of the local patterns causing hotspot classification results also remain unknown. In this paper, we propose the first interpretable GNN framework for lithography hotspot detection, which achieves both high detection accuracy and precise hotspot localization within the layout. Our framework maps the geometric structure of layouts into graph representation. Then, we introduce a novel graph attention network (GAT) framework, encoding local topological features through attention queries on neighbors. Additionally, a novel graph interpretability method is designed by leveraging latent variables in edge distributions and subgraphs optimization, enabling the extraction of local topological features and providing detailed explanations of hotspot localization. Experimental results demonstrate that our approach achieves state-of-the-art (SOTA) performance on the ICCAD-2012 and ICCAD-2019 benchmarks. Moreover, we validate the interpretability of our GNN model on the ICCAD-2016 benchmark, accurately identifying hotspot locations within the lithographic design.
4E-2
10:45-11:10

Beyond Labels: Data-Efficient Wafer Yield Prediction with TabESA

*Pang Guo, Yining Chen (ZheJiang University)
Keywords
Wafer Yield Prediction, Semi-Supervised Learning, Smart manufacturing
Abstract
Accurate wafer yield prediction is vital for design-for-manufacturability and yield optimization in semiconductor production, enabling early defect detection and proactive process control. However, existing methods are constrained by their heavy dependence on large quantities of yield-labeled data—incurring high costs and limiting scalability. Meanwhile, vast amounts of unlabeled wafer test data remain untapped. We present TabESA (Tabular Enhanced Semi-supervised Architecture), a novel two-stage AI framework tailored for manufacturing-aware yield prediction with minimal supervision. In Stage 1, TabESA employs dual self-supervised learning tasks to uncover intrinsic patterns in unlabeled tabular data. In Stage 2, it introduces a consistency-based semi-supervised training scheme that integrates labeled and unlabeled samples to boost prediction robustness. Tested on real-world manufacturing datasets, TabESA achieves over 0.95 in accuracy, precision, F1-score, and AUC using only 128 labeled samples. It surpasses conventional supervised models by 15.7% in F1-score and outperforms state-of-the-art semi-supervised techniques by 19.1% in AUC. By leveraging unlabeled process data for yield estimation, TabESA provides a label-efficient, scalable, and industry-relevant solution for smart semiconductor manufacturing.
4E-3
11:10-11:35

Code, Not Canvas: Multi-Agent Layout Generation Beyond Vision Models

*Haoyu Yang, Haoxing Ren (NVIDIA)
Keywords
Technology Development, DFM, Test Layout Generation, LLM, Multi-AI Agent
Abstract
Rule-constrained chip layout generation is crucial in semiconductor manufacturing industry, providing significant resouces for technology developement and data-driven method- ologies. Generative AI has become the mainstream solution for layout generation backboned with GAN, ViT or Diffussion models. These methods leverages two phase flow including squish topology generation and DRC-aware geometry filling. However, vision model-based approach is sub-optimal and lacks control- lability of generated layouts. To address this, we reformulate layout generation as a coding problem which offers the user max controllability of layout generation and bridges the gap between vision-based geometry generation and LLM’s coding capability. Specifically, we deliver a multi-agent framework to write Python code to create diverse GDSII layouts following given design rule constraints. We demonstrate superior performance for layout generation over SOTA vision-based models, GPT-4o and Cursor in terms of DRC-clean pattern diversity.
4E-4
11:35-12:00

Integrated Re-Fragmentation and Curve Correction for Curvilinear Optical Proximity Correction

*Seohyun Kim, Shilong Zhang, Junha Jang, Youngsoo Shin (Korea Advanced Institute of Science and Technology)
Keywords
Curvilinear OPC, Mask fragmentation, Bayesian optimization
Abstract
Curvilinear optical proximity correction (OPC) treats segments as curves rather than lines, and offers improved correction accuracy. Once a set of segments is identified through fragmentation, it remains the same throughout OPC iterations, limiting OPC performance in runtime and accuracy. We propose additional step of re-fragmentation that can be integrated with curve correction inside the OPC iteration loop. Two machine learning (ML) models are applied for quick re-fragmentation: (1) U-Net is used to detect the critical segments, and (2) MLP identifies the point along the critical curve where the curve should be divided. The reference samples to train the MLP are generated through Bayesian optimization. Compared to standard OPC, the proposed method yields a substantial reduction in OPC iterations (from 20 to 15) and runtime (from 188s to 152s), on average of test clips, when target vertex placement error (VPE) is given. If the number of OPC iterations is fixed (to 25), the proposed method yields average VPE of 1.12nm, much smaller than 1.65nm achievable through standard OPC.

Session 4F

(T6-B) Smart Techniques for Analog & Mixed-Signal Design
10:20-11:35 | Wednesday, January 21, 2026 | Sleeping Beauty 5
Chair(s):
Fan Yang (Fudan University)
Pingqiang Zhou (ShanghaiTech University)
4F-1
10:20-10:45

MuaLLM: A Multimodal Large Language Model Agent for Circuit Design Assistance with Hybrid Contextual Retrieval-Augmented Generation

*Pravallika Abbineni, Saoud Aldowaish, Colin Liechty, Soroosh Noorzad (University of Utah), Ali Ghazizadeh Ghalati (University of Michigan), Morteza Fayazi (University of Utah)
Keywords
Circuit design automation, LLM, multimodal design assistant, RAG, agentic workflow, Reasoning and Act (ReAct) framework, database updating, scalable, open-source
Abstract
Conducting a comprehensive literature review is crucial for advancing circuit design methodologies. However, the rapid influx of state-of-the-art research, inconsistent data representation, and the complexity of optimizing circuit design objectives (e.g. power consumption) make this task significantly challenging. Traditional manual search methods are inefficient, time-consuming, and lack the reasoning capabilities required for synthesizing complex circuits. In this paper, we propose MuaLLM, an open-source multimodal Large Language Model (LLM) agent for circuit design assistance that integrates a hybrid Retrieval-Augmented Generation (RAG) framework with an adaptive vector database of circuit design research papers. Unlike conventional LLMs, the MuaLLM agent employs a Reason + Act (ReAct) workflow for iterative reasoning, goal-setting, and multi-step information retrieval. It functions as a question-answering design assistant, capable of interpreting complex queries and providing reasoned responses grounded in circuit literature. Its multimodal capabilities enable processing of both textual and visual data, facilitating more efficient and comprehensive analysis. The system dynamically adapts using intelligent search tools, automated document retrieval from the internet, and real-time database updates. Unlike conventional approaches constrained by model context limits, MuaLLM decouples retrieval from inference, enabling scalable reasoning over arbitrarily large corpora. At the maximum context length supported by standard LLMs, MuaLLM remains up to 10x less costly and 1.6x faster while maintaining the same accuracy. This allows rapid, no-human-in-the-loop database generation, overcoming the bottleneck of simulation-based dataset creation for circuits. To evaluate MuaLLM, we introduce two custom datasets: RAG-250, targeting retrieval and citation performance, and Reasoning-100 (Reas-100), focused on multistep reasoning in circuit design. MuaLLM achieves 90.1% recall on RAG-250, highlighting strong multimodal retrieval and citation accuracy. On Reas-100, it reaches 86.8% accuracy, demonstrating robust reasoning capabilities on complex design queries.
4F-2
10:45-11:10

MOSTAR: Multi-Stage Hierarchical Bayesian Optimization for Substructure-Aware High-Dimensional Analog Circuit Sizing

*Weijian Fan, Haoyi Zhang (Peking University), Weibin Lin (Shenzhen University), Runsheng Wang, Yibo Lin (Peking University)
Keywords
Bayesian Optimization, high-dimensional circuits, L2G- GNN, MOSTAR
Abstract
Analog circuit sizing is a critical challenge due to increasing circuit complexity and diverse performance requirements. Existing algorithms struggle with poor scalability in high-dimensional spaces and frequent convergence to local optima.To address these limitations, we propose MOSTAR, a multi-stage hierarchical Bayesian optimization framework that integrates a local-to-global GNN (L2G-GNN). L2G-GNN identifies circuit substructures and adds symmetric constraints to the circuit. MOSTAR employs additive Gaussian processes and stage-adaptive constrained acquisition function to improve scalability in high-dimensional circuits.Furthermore, its dynamic search space adjustment strategy helps avoid local optima during optimization.Experiments show that our L2G-GNN achieves a substructure identification accuracy of 97.22%, and MOSTAR achieves an average performance improvements of 2.16x on three basic circuits and 1.62x on two high-dimensional modular circuits, highlighting its efficacy in automating complex analog circuit sizing.
4F-3
11:10-11:35

Effective RC Reduction via Graph Sparsification for Accurate Post-Simulation of Mixed-Signal ICs

*Yibin Zhang, Zhiqiang Liu (Tsinghua University), Shan Shen (Nanjing University of Science and Technology), Chao Hu (EXCEEDA Inc.), Wenjian Yu (Tsinghua University)
Keywords
Mixed-Signal IC, Circuit Simulation, RC Reduction, Effective Resistance, Graph Spectral Sparsification
Abstract
The circuit simulation after parasitic extraction is very crucial for designing the sensitive mixed-signal integrated circuits (ICs). In order to speed up this time-consuming post-simulation task, fast and effective RC reduction technique is demanded. In this work, we propose a node elimination plus graph sparsification framework to realize the RC reduction for the mixed-signal ICs. The proposed method combines the Time-Constant Equilibration Reduction (TICER) algorithm, the efficient graph spectral sparsification and a novel graph sparsification technique keeping the most important off-tree edges, to separately sparsify capacitance and resistance networks in circuit. Experiments on realistic mixed-signal circuits have validated the efficiency and effectiveness of the proposed techniques, demonstrating remarkable advantage over existing methods. The results show that the proposed method can bring up to nearly 7X speedup to the post-simulation while ensuring good accuracy of the critical performance metrics.

Invited Talk I

12:30-12:50 | Wednesday, January 21, 2026 | Cinderella Ballroom 1/6/7/8
Shulin Zeng
General Manager of Shanghai Company, Infinigence-AI
12:30-12:50
Invited Talk

The Creativity Revolution in the Age of Intelligent Agents

Biography
Dr. Shulin Zeng is a founding member of Infinigence-AI and serves as General Manager of Shanghai company, leading the intelligent terminal business. He focuses on hardware-software co-optimization and AI accelerator design.
He received his B.Eng. (2018) and Ph.D. (2023) degrees from Tsinghua University, under Prof. Yu Wang. His first-author work won the FPGA 2025 Best Paper Award, marking the first Asia-Pacific team to receive this honor.
In industry, he led the development of an edge inference optimization engine achieving 3x end-to-end acceleration of large models on AI PCs, planned for deployment on 10M+ devices. He also proposed the world's first multimodal LPU IP, enabling single-FPGA inference of 7B models and text-to-video generation, delivering 4-6x energy-efficiency gains.
Abstract
The evolution of AI is shifting from model-centric intelligence toward agentic systems capable of autonomous reasoning and action. This talk examines how intelligent agents are transforming creativity into a scalable capability, reshaping how teams and organizations create value. Based on real-world production experience, it discusses the emerging infrastructure challenges of the Agentic AI era, including effectiveness, reliability, cost control, and closed-loop optimization. The talk highlights how Infinigence-AI's Agent Platform enables "Super Teams" to achieve outsized impact and supports the transition toward a scalable, agent-driven creative economy.

Session 5A

(T3-B) Emerging Memory Architectures and Their Applications
13:30-15:35 | Wednesday, January 21, 2026 | Snow White 1
Chair(s):
Yuhong Liang (Great Bay University)
Chun-Yi Lee (National Taiwan University)
5A-1
13:30-13:55

CADC: Crossbar-Aware Dendritic Convolution for Efficient In-memory Computing

*Shuai Dong, Junyi Yang, Ye Ke, Hongyang Shang, Arindam Basu (City University of Hong Kong)
Keywords
Dendritic computing, Convolution, In-memory computing, Crossbar, Sparsity
Abstract
Convolutional neural networks (CNNs) are computationally intensive and often accelerated using crossbar-based in-memory computing (IMC) architectures. However, large convolutional layers must be partitioned across multiple crossbars, generating numerous partial sums (psums) that require additional buffer, transfer, and accumulation, thus introducing significant system-level overhead. Inspired by dendritic computing principles from neuroscience, we propose crossbar-aware dendritic convolution (CADC), a novel approach that dramatically increases sparsity in psums by embedding a nonlinear dendritic function (zeroing negative values) directly within crossbar computations. Experimental results demonstrate that CADC significantly reduces psums, eliminating 80% in LeNet-5 on MNIST, 54% in ResNet-18 on CIFAR-10, 66% in VGG-16 on CIFAR-100, and up to 88% in spiking neural networks (SNN) on the DVS Gesture dataset. The induced sparsity from CADC provides two key benefits: (1) enabling zero-compression and zero-skipping, thus reducing buffer and transfer overhead by 29.3%, and accumulation overhead by 47.9%; (2) minimizing ADC quantization noise accumulation, resulting in small accuracy degradation—only 0.01% for LeNet-5, 0.1% for ResNet-18, 0.5% for VGG-16, and 0.9% for SNN. Compared to vanilla convolution (vConv), CADC exhibits accuracy changes ranging from +0.11% to +0.19% for LeNet-5, -0.04% to -0.27% for ResNet-18, +0.99% to +1.60% for VGG-16, and -0.57% to +1.32% for SNN, across crossbar sizes from 64x64 to 256x256. Ultimately, a SRAM-based IMC implementation of CADC achieves 2.15 TOPS and 40.8 TOPS/W for ResNet-18 (4/2/4b), realizing a 11x-18x speedup and 1.9x-22.9x improvement in energy efficiency compared to existing IMC accelerators.
5A-2
13:55-14:20

M3DKV: Monolithic 3D Gain Cell Memory Enabled Efficient KV Cache & Processing

*Jiaqi Yang (Peking University), Yanbo Su (Tsinghua University), Yihan Fu (Peking University), Jianshi Tang (Tsinghua University), Bonan Yan (Peking University)
Keywords
Gain cell memory, Near-memory computing, KV cache, Hardware acceleration for large language models, Monolithic three-dimensional integration
Abstract
Transformer-based generative large language models (LLMs) have revolutionized natural language processing, yet their quadratic computational complexity growth with context length creates severe inference bottlenecks. While LLM key-value cache (KV cache) enhances decoding efficiency, prolonged context processing inflicts frequent KV cache reloads that exacerbate memory bandwidth constraints. To address this hardware challenge, we propose M3DKV-a monolithic three-dimensional (3D) gain cell near-memory-computing accelerator featuring back-end-of-line (BEOL) cache layers for in-situ KV matrix buffering/computation and a front-end-of-line (FEOL) base layer for full self-attention operations. Through optimized 3D data organization, inter-layer dataflow management, and intelligent computation scheduling, our design achieves 0.29 TB/s/core on-die bandwidth while demonstrating 97.03x/268.01x speedup over GPU/CPU in decoding phases and 1.72x-262.16x better area efficiency per parameter versus state-of-the-art accelerators.
5A-3
14:20-14:45

MemSearch: An Efficient Memristive In-memory Search Engine with Configurable Similarity Measures

*Yingjie Yu, Houji Zhou, Tong Hu, Zhiwei Zhou (Huazhong University of Science and Technology), Jia Chen (AI Chip Center for Emerging Smart Systems (ACCESS) University of Science and Technology), Jiancong Li (Department of Electronic and Computer Engineering, HongKong University of Science and Technology), Yi Li, Xiangshui Miao (Huazhong University of Science and Technology)
Keywords
In-memory computing, Vector database, MemSearch, Similarity measurement
Abstract
In-memory search has emerged as a promising solution for efficient vector discovery of the nearest neighbors in general-purpose vector databases. However, templated storage-in-array structure and VMM-based computational form of in-memory search pose challenges in supporting generic distance computations. In this work, for the first time, we introduce a novel memristive in-memory similarity measure engine, MemSearch, for configurable distance calculations, including dot distance, ED, and CD. MemSearch highlights two aspects: data storage and distance computing. For data storage, we propose a Unified Similarity Element Mapping (USEM) scheme based on a pair array to accommodate various similarity calculations. For distance computing, we introduce a Reconfigurable Current Computing (RCC) circuit designed to process multiple arithmetic rules in similarity calculations, with a slightly increase of 4.4% and 9.9% in energy consumption for ED and CD, respectively. We have tested various datasets with different modalities, including images, voice, human activity and text. Experimental results demonstrate that the MemSearch engine achieves improvements of 864x, 802x, and 1474x in energy efficiency over CMOS-based engines for dot distance, ED, and CD calculations, respectively. The MemSearch engine highlights its potential for future highly efficient general-purpose in-memory vector databases.
5A-4
14:45-15:10

SpANNS: Optimizing Approximate Nearest Neighbor Search for Sparse Vectors Using Near Memory Processing

Tianqi Zhang, Flavio Ponzina, *Tajana Rosing (UCSD)
Keywords
Approximate Nearest Neighbor Search, Near Memory Processing, CXL
Abstract
Approximate Nearest Neighbor Search (ANNS) is a fundamental operation in vector databases, enabling efficient similarity search in high-dimensional spaces. While dense ANNS has been optimized using specialized hardware accelerators, sparse ANNS remains limited by CPU-based implementations, hindering scalability. We propose SpANNS, a near-memory processing architecture for sparse ANNS. SpANNS combines a hybrid inverted index with efficient query management and runtime optimizations, achieving 15.2x to 21.6x faster execution over the state-of-the-art CPU baselines, offering scalable and efficient solutions for sparse vector search.
5A-5
15:10-15:35
Paper ID 2869

NBCache: An Efficient and Scalable Non-Blocking Cache for Coherent Multi-Chiplet Systems

*Zhirong Ye (Sun Yat-sen University), Yongchang Zhang (Tsinghua University), Peilin Wang, Tao Lu (Sun Yat-sen University), Zhaolin Li (Tsinghua University), Zhiyi Yu, Mingyu Wang (Sun Yat-sen University)
Keywords
Multi-chiplet, Memory-level parallelism, Non-blocking cache, Directory-based coherence, MSHR
Abstract
With the advancement of chiplet technology, a large number of processing cores can be modularly integrated to achieve enhanced parallelism, which introduces increased memory-level parallelism at the cost of unbalanced interconnect bandwidth. For recently proposed multi-chiplet systems, high cache bandwidth technologies and directory-based coherence have been widely employed to efficiently utilize limited inter-chiplet bandwidth while enhancing scalability in large-scale systems. Unfortunately, cache bandwidth improvement is limited by the current non-blocking cache, in which limited MSHR resources frequently result in cache blocking. Therefore, it is challenging to design an efficient cache architecture for directory-based coherence in multi-chiplet systems that minimizes the cache blocking time with reasonable overhead to leverage high memory-level parallelism. To address this problem, we propose NBCache, an efficient and scalable non-blocking cache architecture for directory-based coherence, in which the directory controller and MSHR are co-designed to support dynamic MSHR demands across heterogeneous workloads in multi-chiplet systems. Additionally, the modular design of NBCache enables it to adapt to different cache coherence protocols. Evaluation results show that NBCache achieves a geometric mean of 1.30x over the prior design with comparable hardware overhead. Finally, NBCache performs close to the ideal unlimited-capacity MSHR with reasonable hardware overhead in different configurations of coherent multi-chiplet systems.

Session 5B

(SS-1b) From Uniform to Adaptive: The Precision-Scalable Computing Era for Edge Intelligence
13:30-14:45 | Wednesday, January 21, 2026 | Snow White 2
Chair:
Fangxin Liu (Shanghai Jiao Tong University)
5B-1
13:30-13:55

A Low-Power 12-lead Arrhythmia Detection SoC Featuring a Reconfigurable CNN and Mixed-Precision Computing

*Yuejun Zhang, Hanyu Shi, Qikang Li, Huihong Zhan, Xinyu Li, Qingxin Xie, Zhenkai Zhou (Ningbo University), Pengjun Wang (Wenzhou University)
Keywords
mixed-precision computing, reconfigurable convolutional neural networks, low-power, 12-lead ECG detection
Abstract
Cardiovascular diseases remain a leading global health threat, with arrhythmia being a key early indicator of cardiac abnormalities. The need for continuous cardiac monitoring has driven demand for portable, low-power arrhythmia detection systems. This paper presents a low-power mixed-precision System-on-Chip (SoC) solution designed for arrhythmia detection using 12-lead electrocardiogram (ECG) signals. The proposed approach employs a dynamically reconfigurable convolutional neural network (CNN) architecture with flexible hyperparameters, enhancing hardware adaptability while reducing resource overhead and power consumption. At the computation level, an 8-bit and 16-bit mixed-precision floating-point multiplier is introduced to effectively balance arithmetic accuracy and energy efficiency. Furthermore, clock gating and multi-threshold voltage techniques are employed at the digital back-end to further reduce the power consumption of the chip. Through system-level and module-level optimization, the proposed chip design is of great significance for enabling low-power arrhythmia detection in power-constrained portable medical devices.
5B-2
13:55-14:20

A Precision-Scalable Accelerator for Compressive Hyperspectral Image Reconstruction with a Lightweight DUN

*Shuo Zhang, Shengzhi Qiang, Wendong Mao, Zhongfeng Wang (Sun Yat-sen University)
Keywords
Precision-scalable, coded aperture snapshot, spectral imaging (CASSI), deep unfolding networks (DUNs)
Abstract
Hyperspectral images (HSIs) provide unparalleled spectral detail for material analysis across diverse fields, but their high data dimensionality challenges real-time processing. Coded Aperture Snapshot Spectral Imaging (CASSI) addresses this problem by compressing 3D spectral information into a single 2D snapshot, but it introduces an ill-posed reconstruction problem. In response, numerous effective methods employ Deep Unfolding Networks (DUNs), which combine data modules and prior modules to improve reconstruction quality. However, multi-head attention in prior modules causes significant storage and computational overhead, while redundant operations in data modules further reduce computational efficiency. To efficiently deploy DUNs on edge devices, this paper proposes an algorithm-hardware co-optimization framework for hyperspectral image reconstruction. First, a lightweight attention-free DUN prior module, Lightweight Spectral Prior (LSP), is designed. Second, a precision-scalable hardware architecture is developed to accelerate the DUN data module, which features a stage-wise bit-width allocation to enhance processing efficiency. Experimental results demonstrate that our algorithm achieves superior reconstruction quality compared with recent state-of-the-art methods. The hardware design for the data module is implemented on the Xilinx ZU19EG FPGA evaluation board. In FPGA-GPU heterogeneous system execution, we achieve up to 2.3x speedup in inference compared with GPU-only execution.
5B-3
14:20-14:45

Enhancing Trustworthiness Using Mixed Precision: Benchmarking, Opportunities and Challenges

Guanxi Lu, Hao (Mark) Chen, Zhiqiang Que, Wayne Luk, *Hongxiang Fan (Imperial College London)
Keywords
large language models, model quantization, mixed precision, model compression, natural language processing, low-bit inference, post-training quantizationn
Abstract
Large language models (LLMs) have shown promising performance across various tasks. However, their autoregressive decoding process poses significant challenges for efficient deployment on existing AI hardware. Quantization alleviates memory and compute pressure by compressing weights, activations, and KV caches to low precisions while preserving generation quality. However, existing quantization frameworks typically focus on perplexity or classification accuracy, often omitting critical trustworthiness metrics. This gap introduces risks when applying quantized LLMs to downstream high-stakes domains such as finance and healthcare. In this work, we systematically investigate the impact of quantization on four trustworthiness metrics (adversarial robustness, fairness, machine ethics, and out-of-distribution robustness) and identify the instability across compression ratios and quantization methods. Building on these observations, we develop a novel precisionensemble voting approach that leverages predictions from mixedprecision variants of the same model and consistently improves performance by up to 5.8% on trustworthiness metrics. Our results highlight the importance of considering trustworthiness when developing model compression techniques and point to research opportunities at the intersection of compression and trustworthiness for safety-critical applications.

Session 5C

(T12-C) System-Level Security and Secure Communication
13:30-15:35 | Wednesday, January 21, 2026 | Snow White 3
Chair(s):
Song Bian ( Beihang University)
Danella Zhao ( University of Arizona)
5C-1
13:30-13:55

GEmFuzz: Uncovering System-Level Vulnerabilities in SoCs via Emulation-Based Grey-Box Fuzzing

Shuvagata Saha, Ahmed Alhurubi, Tanvir Rahman, Hasan Al Shaikh, Sujan Kumar Saha, *Farimah Farahmandi, Mark Tehranipoor (University of Florida)
Keywords
System-on-Chip, Hardware Security, Emulation, Fuzzing, Security Verification
Abstract
Security verification of modern System-on-Chip (SoC) designs is becoming increasingly challenging due to the growing integration of third-party IPs and the complexity of hardware-software (HW/SW) interactions. This escalating complexity broadens the attack surface, leading to a higher number of potential vulnerabilities and longer detection times. Consequently, verification engineers face increasing pressure to ensure robust security within tight development schedules. Traditional techniques such as formal verification and information flow tracking often suffer from poor scalability, state space explosion, and significant manual effort, necessitating expert-level design knowledge. Fuzzing-based methodologies, while promising, typically rely on the availability of a golden reference model and struggle to scale effectively, which limits their applicability. Furthermore, the increasing intricacy of HW/SW stacks in modern SoCs introduces new classes of system-level vulnerabilities that remain largely unaddressed by existing approaches. To address these challenges, we propose GEmFuzz, a hardware emulation-based grey-box fuzzing framework for SoC security verification. GEmFuzz employs a hardware emulation server to execute the design under test (DUT) and leverages a cost-function-guided fuzzer to generate intelligent input patterns for vulnerability detection. We evaluate GEmFuzz on a RISC-V-based SoC and demonstrate its effectiveness in detecting a set of known system-level vulnerabilities. Additionally, it identifies two previously unknown vulnerabilities, highlighting the capability and promise of the proposed framework.
5C-2
13:55-14:20

Pack Defender: Proactive Defense Against Packet Attacks in NoCs Using an XGBoost-RNN Model

*Shengkai Hu (University of Southampton), Haoyu Wang (University of Oxford), Basel Halak, Boojoong Kang (University of Southampton)
Keywords
Hardware Security, Network-on-Chip (NoC), Multi-Processor System-on-Chip (SoC), Machine Learning
Abstract
The Network-on-Chip (NoC) serves as the critical communication backbone in modern Multi-Processor Systems-on-Chip (MPSoCs), particularly for Deep Learning (DL) hardware where it underpins the reliable execution of machine learning models by facilitating efficient data and weight exchange. However, the NoC is vulnerable to stealthy packet-based attacks initiated by malicious Intellectual Property (IP) cores. Such attacks can severely degrade NoC latency and throughput, which are critical for efficient DL inference, and even compromise the correctness of model execution. Current detection methods are inherently reactive; they identify anomalies by monitoring global system features only after an attack has manifested, lacking the foresight to anticipate impending threats. To address this limitation, we propose Pack Defender, a proactive NoC security framework based on temporal behavior modeling. Pack Defender accurately generates and forecasts future system states, enabling the identification of pre-attack warning signals. It also integrates the detection technique by reusing the partial prediction generative model for the first time, which eliminates the need for a separate detection module. Experimental results demonstrate that Pack DefenderPack Defender exhibits strong predictive power, yielding average/top-three similarities of 83%/92% for Source-Level Packet Dropping (SLPD) attacks and 90%/94% for In-Network Packet Diversion (INPD) attacks, respectively. The model's high accuracy is further validated by its low Mean Absolute Error (MAE), which was 0.05 for SLPD and 0.03 for INPD, confirming its ability to provide forward-looking security while maintaining high resource efficiency. The detection model (XGBoost) achieves 100% accuracy for both SLPD and INPD attacks, with recall rates of 96% and 99% respectively, significantly outperforming state-of-the-art methods that lack proactive prediction capabilities.
5C-3
14:20-14:45

Silentflow: Leveraging Trusted Execution for Resource-Limited MPC via Hardware-Algorithm Co-design

Zhuoran Li, Hanieh Totonchi Asl, Ebrahim Nouri (University of Arizona), Yifei Cai (Old Dominion University), *Danella Zhao (University of Arizona)
Keywords
Security & Privacy, Multiparty Computation, Trusted Execution Environment, FPGA acceleration
Abstract
Secure Multi-Party Computation (MPC) offers a practical foundation for privacy-preserving machine learning at the edge, with MPC commonly employed to support nonlinear operations. These MPC protocols fundamentally rely on Oblivious Transfer (OT), particularly Correlated OT (COT), to generate correlated randomness essential for secure computation. Although COT generation is efficient in conventional two-party settings with resource-rich participants, it becomes a critical bottleneck in real-world inference on resource-constrained devices (e.g., IoT sensors and wearables), due to both communication latency and limited computational capacity. To enable real-time secure inference, we introduce Silentflow, a highly efficient Trusted Execution Environment (TEE)-assisted protocol that eliminates communication in COT generation. We tackle the core performance bottleneck—low computational intensity—through structured algorithmic decomposition: kernel fusion for parallelism, Blocked On-chip eXpansion (BOX) to improve memory access patterns, and vectorized batch operations to maximize memory bandwidth utilization. Through design space exploration, we balance end-to-end latency and resource demands, achieving up to 39.51x speedup over state-of-the-art protocols. By offloading COT computations to a Zynq-7000 SoC, SilentFlow accelerates PPMLaaS inference on the ImageNet dataset under resource constraints, achieving a 4.62x and 3.95x speedup over Cryptflow2 and Cheetah, respectively.
5C-4
14:45-15:10

ANIMo: Accelerating Nested Isolation with Monitor-free Domain Transition

*Yibin Xu, Han Wang, Yue Jin, Tianyi Huang, Tianyue Lu, Mingyu Chen (SKLP, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences)
Keywords
Hardware-based Security, Nested Isolation, Domain Transition
Abstract
Fine-grained domain isolation decomposes a user program into isolated domains, thus reducing the attack surface by preventing potential vulnerabilities from influencing other domains. However, as the monitor in the trusted domain handles domain transitions, the frequent permission verification and switching phases incur significant performance overheads, which prior studies have often overlooked. In this paper, we observe that the permissions of the child domain are inherently a subset of its parent domain. This relationship establishes a permission restriction that the hardware can quickly validate. Therefore, we designed an innovative monitor-free hardware mechanism named ANIMo, which enables an untrusted domain to directly manage permissions of its child domain under the hardware's supervision. As the hardware automatically performs permission verification and switching, we eliminate the associated overheads of interacting with the monitor during transitions between nested domains. We have implemented the prototype of ANIMo on the out-of-order superscalar RISC-V core XiangShan, with minimal hardware overhead (an extra 0.83% LUTs and 0.69% flip-flops) and evaluated it on an FPGA platform. Compared with state-of-the-art monitor-based approaches utilizing protection keys and bound registers, our ANIMo achieves speedups of 9.9x and 3.4x in domain transitions, respectively. Furthermore, we demonstrate the applicability of ANIMo by performing nested isolation over the sensitive module of NGINX.
5C-5
15:10-15:35

Reinforced Logic-Based Distributed Routing within Isolated Secure Zones

*Yogesh Verma, Mattis Hasler (barkhausen institut), Sebastian Haas, Friedrich Pauls (Barkhausen institut)
Keywords
Logic-Based Distributed Routing, Secure Zone Routing, NoC Security, Trustworthy Design, Tiled Architecture
Abstract
Multiprocessor systems-on-chip (MPSoCs) have become the foundation of critical applications in AI, IoT, and autonomous systems. The Network-on-Chip (NoC) is the communication backbone of MPSoCs, enabling smooth and scalable communication between processing elements. Keeping it secure against evolving threats is imperative. Modern processor vulnerabilities expose MPSoCs to threats like timing side-channel attacks (TSCA) and denial-of-service (DoS) attacks, where attackers can infect IP blocks at runtime to target NoC. This is especially concerning in shared-resource environments, where the lack of strict isolation between components enables wider propagation of such attacks. Routing within isolated secure zones can efficiently deal with such attacks. However, existing secure-zone-based approaches are based on routing tables, which become inefficient and unscalable with increases in NoC sizes, necessitating an adaptive and logic- based approach. Although the scalability offered by the latter increases, it is limited in coverage of complex topologies. This paper presents a logic-based distributed routing strategy using the divide-and-conquer principle ( LBDRdnc). It enhances the topological coverage of traditional LBDR methodologies by dividing irregular topologies into a combination of minimal path sub-topologies. (LBDRdnc) is used to isolate sensitive data within secure zones, ensuring the system's integrity against interapplication attacks. Experimental results validate the efficacy of the proposed routing algorithm. Its performance was evaluated by comparison with a state-of-the-art routing table-based approach, demonstrating improved performance, security, and scalability.

Session 5D

5D (Designer Forum 1) Toward Autonomous Chip Design: From Foundation Models to Agentic EDA
13:30-15:35 | Wednesday, January 21, 2026 | Sleeping Beauty 1/2
Chair(s):
Zhiyao Xie (The Hong Kong University of Science and Technology)
5D-1
13:30-13:55

Intelligent Chip Design with Agentic AI EDA

*James Gu (Cadence)
Biography
James Gu, Group Director of Digital Products, Digital and Verification R&D Department, Cadence Design Systems, Inc. He has worked in the chip design and electronic design automation industry for more than ten years. Currently, he is mainly responsible for deeply customizing the digital implementation process from RTL to GDS for customers in China at Cadence, to meet the needs of different products and industries, especially for chip design and tape-out under advanced processes for large customers. He has participated in the chip design and implementation of GPU/CPU for multiple important customers.
Abstract
To solve the complex chip design challenges, we proposes integrating agentic AI into core of EDA systems, these agentic AI agents can analyze design requirements, dynamically adjust design strategies across stages (from floorplanning and placement to Clock tree synthesis and timing closure), and learn from historical design data to optimize PPA metrics in real time.
5D-2
13:55-14:20

The Al Transformation of Semiconductor EDA

*Muming Tang (Synopsys)
Biography
Muming Tang, Sr AE Director from Synopsys.A 27-year IC design veteran of, he brings a unique perspective from his prior role as a Senior Designer at Infineon.Now leading a team at Synopsys, he focuses on enabling customers in China to achieve success with advanced-node design implementation through flow optimization and high-performance IP integration.
Abstract
The intense competition in AI model development is fueling an unprecedented demand for advanced semiconductor chips. This surge, coupled with the critical need for shorter time-to-market, is fundamentally transforming EDA technologies and workflows.This presentation will outline the evolution of AI-driven EDA, demonstrate cutting-edge solutions and their performance in real-world designs, and conclude with a perspective on the future trajectory of EDA transformation.
5D-3
14:20-14:45

From Algorithmic Optimization to Autonomous Agents: Redefining EDA Workflows with AI

*Peng Zou (Shanghai LEDA Technology)
Biography
Peng Zou is Ph.D. from Fudan University, currently a Technical Expert at Shanghai LEDA Technology. Specializing in EDA placement and routing, spearheading PPA optimization for P&R tools. Current research focus is on the application of AI in EDA.
Abstract
As chip design scale and constraints intensify, traditional heuristic algorithms face significant efficiency bottlenecks in navigating huge search spaces. This talk traces the evolution of AI in EDA, transitioning from Reinforcement Learning-based design-space exploration to the LLM-driven Autonomous Agents. We will focus on how this evolution transcends the limitations of local optimization. By leveraging the reasoning, tool-use, and self-correction capabilities of AI Agents, we can achieve intelligent orchestration and closed-loop control of complex EDA toolchains, thereby defining the next generation of automated silicon design workflows.
5D-4
14:45-15:10

From Generation to Verified Synthesis: Bridging Industrial Reality via C-Guided Agents

*Min Li (Southeast Univeristy)
Biography
Min Li received the B.E. degree from the Department of Electronic Engineering, Shanghai Jiao Tong University in 2018, and the Ph.D. degree from the Department of Computer Science and Engineering, The Chinese University of Hong Kong in 2023. He is currently a researcher in the School of Integrated Circuit, Southeast University. His research interests include hardware formal verification and circuit representation learning.
Abstract
Current Large Language Models (LLMs) often fall short of industrial reality, primarily due to ambiguous natural language specifications and the lack of formal correctness guarantees . In this talk, we demonstrate a multi-agent framework designed to bridge this gap by shifting from probabilistic generation to verified synthesis. We illustrate how leveraging software reference models (e.g., C/C++) as executable formal specifications can anchor the design process, utilizing static analysis for planning and formal equivalence checking for counterexample-guided debugging . Specifically, we present the efficient realization of complex, datapath-intensive modules—including IEEE-754 floating-point units and industrial Hifloat8 formats—proving that agentic workflows can ensure functional equivalence to golden models at scale.
5D-5
15:10-15:35

SLEG: A LLM-based SVA Evaluation and Generation System

*Tao Lin (Xepic. Inc.)
Biography
Tao Lin holds the position of R&D manager in Xepic. Inc., He received a Ph.D. degree from the University of Cincinnati, currently focused on formal methods and AI applications in formal verifications.
Abstract
In the field of Electronic Design Automation (EDA), assertion-based validation is powerful and indispensable for hardware design verification. The auto-generation of SystemVerilog Assertions (SVA) is one of the promising applications of Large Language Models (LLM). However, due to the inherent behavior complexity of SVAs, evaluating the quality of generated SVAs poses a significant challenge. In this talk, we demonstrate a formal-checking-based evaluation system that brings this gap. An exhaustive quantitative study reveals that formal-proof loop improves LLM performance in real industrial settings. By incorporating formal methods, we can push the boundaries of autonomous code generation for LLMs in the realm of hardware design—with this expansion underpinned by the inherent confidence of formal verification.

Session 5E

(T9-D) Performance-Driven Floorplanning and Global Placement
13:30-15:35 | Wednesday, January 21, 2026 | Sleeping Beauty 3
Chair(s):
Wai Kei Mak (National Tsing Hua University)
Yi-Yu Liu (National Taiwan University of Science and Technology)
5E-1
13:30-13:55

MdpoPlanner: Mask-Driven Floorplan via Reinforcement Learning-Based Placement Order

Yue Wu, *Caiyu Chen, Xiaoyan Yang (Hangzhou Dianzi University)
Keywords
floorplanning, reinforcement learning, placement order, mask
Abstract
Floorplanning has long been a critical task in physical design due to the large search space. Recent approaches have made progress by leveraging reinforcement learning (RL) to guide the sequential placement of blocks on the chip canvas. However, these methods remain susceptible to suboptimal solutions, as the placement order is still determined by handcrafted heuristic rules rather than being jointly optimized with the placement decisions. This paper proposes a reinforcement learning based mask-driven floorplanner called MdpoPlanner, enabling joint optimization of placement sequence and spatial arrangement. Based on the connection between blocks and layout, an agent is trained to identify the optimal sequence of blocks for placement. Once a macro is selected, the proposed mask-driven floorplanner determines its optimal location by encoding spatial constraints and inter-macro connectivity into a dynamic mask, enabling legal placements while minimizing wire length and enhancing overall floorplan quality. Compared to the state-of-the-art floorplanner, MdpoPlanner yielded an average HPWL improvement of 42.61% on fixed-outline MCNC benchmarks and 6.75% on scaled-outline GSRC benchmarks.
5E-2
13:55-14:20

C3PO: Commercial-Quality Global Placement via Coherent, Concurrent Timing, Routability, and Wirelength Optimization

Yi-Chen Lu (Nvidia), Hao-Hsiang Hsiao (Georgia Institute of Technology), Rongjian Liang, *Wen-Hao Liu (Nvidia), Haoxing Ren (NVIDIA)
Keywords
Differentiable Timing Optimization, Differentiable Routability Optimization, GPU-accelerated Differentiable Multi-Objective Placement
Abstract
Despite achieving orders-of-magnitude runtime speedup, GPU-accelerated placers (GPU-Placers) still have extremely limited industrial adoption, largely due to the wide gaps in Power, Performance, and Area (PPA) metrics compared to those well-established CPUcentric commercial Physical Design (PD) tools. To overcome this issue, we introduce C3PO, the first commercial-quality, differentiable, multiobjective global placer that performs concurrent timing, routability, and wirelength optimization in a coherent manner with custom CUDA kernels. Particularly, we propose a convex-based framework that dynamically computes objective weights at each placement iteration by solving a quadratic problem, eliminating the need of manual parameter tuning. In the experiments, we rigorously validate C3PO with an industry-leading commercial PD tool and demonstrate that on 8 designs from TILOS [1] and IWLS [2] in ASAP 7nm [3], C3PO consistently outperforms the commercial tool by up to 16.7% in routed wirelength and 19.6% in switching power with complete full-flow validation.
5E-3
14:20-14:45

A Timing-Driven Hierarchical Macro Placement Framework for Large-Scale Complex IP Blocks

*Wei Fu, Lixin Chen, Jinghui Zhou, Ziran Zhu (Southeast University)
Keywords
physical design, macro placement, incremental refinement, timing
Abstract
Macro placement critically influences physical design quality, yet optimizing timing characteristics at this stage remains a challenging research frontier. In this paper, we propose a novel timing-driven macro placement framework comprising two major components: a global macro placement approach and an incremental macro placement refinement approach. Our global placement begins by applying Hier-RTLMP's multilevel autoclustering engine to group cells into clusters. Each cluster is abstracted as a density-inflated pseudo macro, and an analytical method is applied to generate the initial top-level layout. Then, intra-group macro placement is guided by dataflow-based virtual connection and solved via an integer-linear programming (ILP) approach. Each macro group (macros within the same cluster) is subsequently abstracted as a bounding box and placed via boundary-aware optimization followed by legalization. To further enhance macro placement quality, we develop an incremental refinement method applied after standard cell placement. It consists of two specialized strategies: array-constrained macro swapping and projected gradient descent (PGD)-based free macro shift, both targeting critical path slack improvement. We compare our framework with the state-of-the-art macro placer Hier-RTLMP on 15 benchmark suites from ChiPBench. For 7 designs employing both global macro placement and incremental refinement, our approach improves worst negative slack (WNS) and total negative slack (TNS) by 46.4% and 19.8%, respectively. For the remaining 8 designs that do not require incremental refinement, our approach improves the WNS and TNS by 24.5% and 28.7%, respectively.
5E-4
14:45-15:10

Accelerating Electrostatics-based Global Placement with Enhanced FFT Computation

Hangyu Zhang (University of Minnesota Twin Cities), *Sachin Sapatnekar (University of Minnesota)
Keywords
Electrostatics-based placement, FFT acceleration, Electrical field calculation, Potential calculation, Global placement
Abstract
Global placement is essential for high-quality and efficient circuit placement for complex modern VLSI designs. Recent advancements, such as electrostatics-based analytic placement, have improved scalability and solution quality. This work demonstrates that using an accelerated FFT technique, AccFFT, for electric field computation significantly reduces runtime. Experimental results on standard benchmarks show significant improvements when incorporated into the ePlace-MS and Pplace-MS algorithms, e.g., a 5.78x speedup in FFT computation and a 32% total runtime improvement against ePlace-MS, with 1.0% reduction of scaled half-perimeter wirelength after detailed placement
5E-5
15:10-15:35

Comprehensive Delay-Aware Net Weighting Framework for Timing-Driven Global Placement

*Lixin Chen, Keyu Peng, Jinghui Zhou, Hao Gu, Wei Fu (Southeast University), Shuting Cai (Guangdong University of Technology), Ziran Zhu (Southeast University)
Keywords
Net Weighting, Timing-Driven Global Placement, Timing Optimization, RC-Based Delay Estimation
Abstract
Timing optimization is a critical challenge in modern very large-scale integration (VLSI) design, where global placement plays a pivotal role in achieving timing closure. In this paper, we propose a comprehensive delay-aware net weighting framework for timing-driven global placement. Our framework systematically integrates the cell delay balancing factor, the net delay balancing factor, and static timing analysis (STA)-based incremental timing factor to dynamically guide net weight adjustment throughout the global placement process. The cell delay balancing factor is achieved by first analytically quantifying the drive strength of each cell to derive a reference wirelength of the net it drives, and then applying a sigmoid function to obtain a smooth weight adjustment, thereby promoting the delay balance at the cell level. The net delay balancing factor leverages a lightweight RC-based delay estimation scheme to evaluate the net delay efficiently, which is then used to balance the delays across different nets. Finally, the STA-based incremental timing factor targets critical nets identified by STA, with the aim of enhancing timing performance on critical paths. Experimental results on the ICCAD 2015 contest benchmarks show that our algorithm outperforms recent state-of-the-art timing-driven placers, achieving at least 40.4% and 8.6% average improvements in total negative slack (TNS) and worst negative slack (WNS), respectively, while maintaining comparable wirelength and runtime.

Session 5F

(T8-A) Advances in Logic Optimization and Technology Mapping
13:30-15:35 | Wednesday, January 21, 2026 | Sleeping Beauty 5
Chair(s):
Hongce Zhang (The Hong Kong University of Science and Technology (Guangzhou))
Yuanqing Cheng (Beihang University)
5F-1
13:30-13:55

DCLOG: Don't Cares-based Logic Optimization using Pre-training Graph Neural Networks

*Rongliang Fu, Libo Shen, Ziyi Wang (The Chinese University of Hong Kong), Zhengxing Lei (Xi'an Jiaotong University), Zixiao Wang (The Chinese University of Hong Kong), Junying Huang (Institue of Computing Technology, Chinese Academy of Sciences), Bei Yu, Tsung-Yi Ho (The Chinese University of Hong Kong)
Keywords
logic optimization, don't cares, graph neural network, majority-inverter graph
Abstract
Logic rewriting serves as a robust optimization technique that enhances Boolean networks by substituting small segments with more effective implementations. The incorporation of don't cares in this process often yields superior optimization results. Nevertheless, the calculation of don't cares within a Boolean network can be resource-intensive. Therefore, it is crucial to develop effective strategies that mitigate the computational costs associated with don't cares while simultaneously facilitating the exploration of improved optimization outcomes. To address these challenges, this paper proposes DCLOG, a don't cares-based logic optimization framework, to efficiently and effectively optimize a given Boolean network. DCLOG leverages a pre-trained graph neural network model to filter out cuts without don't cares and then performs an incremental window simulation to calculate don't cares for each cut. Experimental results demonstrate the effectiveness and efficiency of DCLOG on large Boolean networks, specifically average size reductions of 15.64% and 1.44% while requiring less than 23.84% and 44.70% of the average runtime compared with state-of-the-art methods for the majority-inverter graph (MIG), respectively.
5F-2
13:55-14:20

SOFA-H: Post-Synthesis Area Optimization via Functionally Encoded, Net-Driven Subgraph Mining and SAT-Based Hypercell Remapping

*Jimmy Y.-C. Lee, Yen-Ju Su (National Yang Ming Chiao Tung University), Jiun-Cheng Tsai (Mediatek), Aaron C.-W. Liang, Charles H.-P. Wen (National Yang Ming Chiao Tung University), Hsuan-Ming Huang (Mediatek)
Keywords
Post-synthesis optimization, area optimization, Frequent subgraph mining, circuit encoding, SAT
Abstract
Synthesized netlists often leave substantial room for area optimization due to the limited function diversity in standard cell libraries, which frequently results in recurring logic patterns that could be compacted through cell combination—referred to as hypercells in this work. While prior studies have demonstrated the potential of hypercell-based optimization, most lack efficient and scalable mining strategies. We present SOFA-H, a post-synthesis framework that extracts and remaps hypercells for maximum area reduction. SOFA-H (i) mines fanout-induced subgraphs and canonically encodes them using P-Representatives, (ii) selects an optimal set of hypercells with non-overlapping replacements via a one-shot weighted MaxSAT formulation, and (iii) supports high-input, multi-output cells with scalable runtime. Evaluated on the EPFL benchmark suite synthesized with FreePDK45 and ASAP7, SOFA-H achieves average area reductions of 12.2% and 7.4%, respectively, and runs 380x faster on average at ASAP7 compared to the state-of-the-art TeMACLE. These results demonstrate that the extracted hypercells offer a scalable and effective path to closing the area gap left by conventional synthesis.
5F-3
14:20-14:45

Formalization of Rectification Learning for Economic Design Updates

*Victor N. Kravets (IBM Research), Jie-Hong R. Jiang (National Taiwan University)
Keywords
Engineering Change Order, Boolean functional synthesis, Design rectification
Abstract
Engineering Change Order (ECO) is the task of finding the non-intrusive design implementation updates to comply with a specification revision. This paper states the rectification problem in quantified Boolean logic that gives sound and complete capture of the update choices for an ECO. Its closed-form statement offers an analytical search for small patches that maximize logic sharing in the implementation. With the abstraction-refinement paradigm assisted by relevance classification, we effectively generalize the sampled knowledge of a revision, enabling the identification of compact updates without undue computational costs. Our experimental evaluation demonstrates almost twice as few gates in synthesized patches compared to the reported state-of-the-art results.
5F-4
14:45-15:10

PhyMap: A Physically-Aware Incremental Mapping Framework with On-the-fly Post-Layout Critical Path Tracking

*Hongyang Pan, Cunqing Lan (Fudan university), Zhiang Wang (University of California, San Diego), Xuan Zeng, Fan Yang, Keren Zhu (Fudan University)
Keywords
Physically aware Synthesis, Technology Mapping, Monte carlo tree search
Abstract
Physically aware synthesis aims to solve power, performance, and area~(PPA) challenges, but existing methods are suboptimal because they fail to predict the true post-route critical path. We argue that the objective should shift from pursuing absolute timing accuracy to ensuring correct critical path identification. In this paper, we propose PigMap3, a novel framework for incremental physically aware technology mapping. We introduce a lightweight placement oracle that dynamically estimates cell locations for accurate interconnect delay modeling. Furthermore, we propose a hybrid mapping algorithm. Monte carlo tree search algorithm is used to robustly handle critical paths while dynamic programming algorithm is used to optimize the area of non-critical paths. Evaluated using the OpenROAD tool flow, PigMap3 demonstrates significant improvements over state-of-the-art methods. On the IWLS'05 sequential benchmarks, PigMap3 improves the WNS by 32.7% and reduces power consumption by 15.4%. On the EPFL combinational benchmarks, our framework achieves an average 7.0% reduction in the post-route PPA product.
5F-5
15:10-15:35

CombRewriter: Enabling Combinational Logic Simplification in MLIR-Based Hardware Compiler

*Haisheng Zheng (Shanghai AI Laboratory), Zhuolun He, Shuo Yin (The Chinese University of Hong Kong), Yuzhe Ma (The Hong Kong University of Science and Technology (Guangzhou)), Bei Yu (The Chinese University of Hong Kong)
Keywords
Combinational Logic Simplification, Hardware Compiler, Logic Synthesis
Abstract
Modern Hardware Description Languages (HDLs) play a pivotal role in enabling swift and adaptable hardware development. A hardware compiler translates high-level designer intents into a concrete hardware implementation, the quality of which directly determines ultimate circuit performance. However, current hardware compilers may overlook opportunities for combinational logic simplification, leading to RTL code that contains redundant logic and degrades the Quality of Results (QoR) of the synthesized netlist. This paper presents CombRewriter, a novel approach that incorporates compilation-level optimization techniques into combinational logic simplification. Experimental results demonstrate that the proposed method effectively reduces netlist area.

Session 6A

(T3-C) Circuit- and Device-Aware Design for CIM
15:55-18:00 | Wednesday, January 21, 2026 | Snow White 1
Chair(s):
Hiromitsu Awano (Kyoto University)
Arindam BASU (City University of Hong Kong)
6A-1
15:55-16:20

SCION: A Comprehensive Simulation Framework for Charge-based In-Memory Computing for Rapid Evaluation of Hardware Non-idealities and DNN Accuracy

*Doug Hyun Kim, Akul Malhotra, Sumeet Kumar Gupta (Purdue University)
Keywords
Charge-based computing, Deep neural networks (DNNs), Hardware non-idealities, In-memory computing (IMC), Simulation framework
Abstract
As artificial intelligence advances at a rapid pace, the demand for computational resources has grown significantly, dominated by matrix-vector multiplications (MVMs). In-memory computing (IMC) is a promising approach that addresses the major data transfer bottleneck in these computations. Among various IMC designs, charge-based sensing offers robustness against challenges like IR drops that severely affect common current-based IMC approaches. However, charge-based IMC is vulnerable to its own non-idealities, particularly parasitic capacitive coupling, which can degrade computational accuracy. Accurately integrating the effects of these non-idealities into DNN inference evaluation requires time-intensive transient SPICE simulations, making comprehensive design space exploration and hardware-software co-optimization impractical. To overcome these limitations, we propose a comprehensive framework, SCION, which rigorously models the hardware non-idealities in charge-based IMC designs, rapidly predicts the IMC output charge and directly integrates the effect of hardware non-idealities in PyTorch-based DNN inference simulations. We show that our framework predicts the IMC output charge with more than 99% accuracy with respect to SPICE while offering more than 2000x speedup. We demonstrate the capability of our framework by conducting experiments on a 9T-SRAM-based IMC design, showing how and to what extent the parasitic capacitive coupling impairs the inference accuracy. Furthermore, we propose mitigation techniques based on layout optimizations and array biasing that alleviate the impact of these non-idealities. Our results show that the mitigation techniques significantly improve the sense margin and restore inference accuracy on CIFAR-100 for both ResNet-50 and ViT-small. We also show how the technology/circuit-level knobs affect the system-level accuracy. This general framework is compatible with other charge-based IMC designs and DNN workloads, highlighting its potential to enable cross-layer design of charge-based IMC accelerators.
6A-2
16:20-16:45

SABCIM: Self-Adaptive Biasing Scheme for Accurate and Efficient Analog Compute-in-Memory

*Yashvardhan Biyani, Abhairaj Singh, Rajendra Bishnoi, Said Hamdioui (CE, TU Delft)
Keywords
Computation-in-Memory (CIM), In Memory Computing, Edge AI, Neural Networks, Vector-Vector Multiplication (VVM), Multiply-and-Accumulate (MAC), Vector-Matrix Multiplication (VMM), Analog CIM, CIM Architecture, Array-Periphery co-design, Low-power, Emerging non-volatile memories, Memristor, Resistive Random Access Memory (RRAM), Voltage-to-Time Converter (VTC), Analog-to-Digital Converter (ADC), VTC-based ADC
Abstract
Analog Compute-in-Memory (CIM), leveraging non-volatile memristive devices to perform in-place computations in the analog domain, holds great potential to efficiently accelerate vector-matrix multiplications (VMM) and realize AI (Artificial Intelligence) at the edge. However, the data converters in such architectures often trade-off accuracy for high energy and area overheads, practically limiting the benefits of CIM. In this work, we present SABCIM, an array-periphery co-design approach for CIM that enables accurate computation as well as digitization of analog VMM outputs with high energy efficiency and competitive area overhead. By leveraging complementary input activations and data storage, each crossbar column generates differential analog output corresponding to the vector-vector multiplication (VVM) result, while inherently addressing underlying non-idealities. This is digitized using a compact, dual-ramp voltage-to-time converter (VTC)-based analog-to-digital converter (ADC). Benchmark results indicate that our work achieves up to 19.6x higher energy efficiency compared to state-of-the-art (SOTA), while maintaining comparable accuracies.
6A-3
16:45-17:10

CDACiM: A Charge-Domain Compute-in-Memory Macro for FP/INT MAC Operations with Reconfigurable Capacitor Digital-Analog-Converter

*Jinting Yao, Zeyu Yang, Yuxiao Jiang, Yuxiao Yang, Zheyu Yan, Cheng Zhuo, Xunzhao Yin (Zhejiang University)
Keywords
Compute-in-Memory(CiM), floating point multiply-and-accumulate(MAC), dual-mode macro, static random access memory(SRAM)
Abstract
Advanced artificial intelligence (AI) edge chips need to balance flexible computation, high energy efficiency, and sufficient inference accuracy across diverse workloads. Many compute-in-memory (CiM) designs enable efficient neural network acceleration but focus solely on integer (INT) multiply-and-accumulate (MAC) operations, limiting precision. Some CiM macros add extra circuitry to support floating point (FP) MACs, but these dedicated exponent-handling blocks often waste area when running INT workloads. In this paper, we propose CDACiM, a charge-domain CiM macro that supports both FP and INT MAC operations with minimal overhead. CDACiM introduces a reconfigurable capacitor digital-to-analog converter (RCDAC) that performs both exponent summation and bitwise AND for mantissa multiplication. To calculate exponent offsets, we develop a shared single-slope ADC (SS-ADC) that finds the maximum exponent and computes differences in time domain simultaneously. Our design includes a sparsity-aware computation scheme with tunable thresholds that skips low-importance input-weight pairs, boosting energy efficiency through higher input sparsity. We also introduce a multi-bit input accumulation method that leverages ADC redundancy during quantization and normalization to improve performance. Fabricated in a 40nm CMOS process, CDACiM demonstrates an excellent flexibility and trade-off between accuracy and resource usage. Notably, it is the first CiM design to reconfigure capacitor-based INT macro for parallel exponent computation. CDACiM achieves 16.2 TOPS/W for INT MACs and 15.9 TFLOPS/W for FP MACs. It delivers a 1.36-1.48x improvement in energy efficiency with minimal accuracy loss compared to recent FP CiM macros.
6A-4
17:10-17:35

Learnable Center-Based Quantization for Efficient Analog PIM with Reduced ADC Precision

*Sangheum Yeon, Jonghwan Ko (Sungkyunkwan University)
Keywords
Quantization, In-Memory Computing, Analog-to-Digital Converter
Abstract
Processing-in-memory (PIM) architectures have shown significant potential for accelerating deep neural network (DNN) inference by performing matrix-vector multiplications directly within memory. However, achieving high precision often requires high-resolution analog-to-digital converters (ADCs), which can increase energy consumption and limit overall efficiency. To address this, we propose a learnable center-based quantization (LCQ) technique that minimizes the range of partial sums in PIM arrays. This reduction in the range of partial sums decreases the ADC resolution requirements, enabling accurate low-bit quantization while maintaining energy efficiency. Our framework directly models ADC precision constraints within the training process without requiring extensive retraining. Experimental results on DNN models such as ResNet20 and ResNet18 with CIFAR-10/ImageNet datasets demonstrate that LCQ significantly enhances energy efficiency while maintaining competitive accuracy compared to previous techniques for efficient analog PIM. LCQ improves both accuracy and energy efficiency, reducing ADC resolution requirements and enabling practical low-bit quantization.
6A-5
17:35-18:00

RL-Guided Thermal-Aware Quantization for Efficient and Robust ReRAM CIM Systems

*Lihua An, Jiayi Li, Pingqiang Zhou (Shanghaitech University)
Keywords
Neural Network, Computing in Memory, Thermal Effect, Reinforcement Learning, Quantization, ReRAM
Abstract
Resistive RAM (ReRAM)-based Computing-in-Memory (CIM) systems present significant advantages in energy efficiency and computational throughput for neural network acceleration. However, their performance is highly constrained by thermal-induced conductance drift, especially under aggressive quantization strategies. This work presents a reinforcement learning (RL)-guided thermal-aware layer-wise quantization framework optimized for ReRAM-based CIM systems. The proposed method encourages sparse and low-magnitude weight representations, while adaptively exploring layer-wise bit-width configurations guided by direct hardware evaluation feedback and thermal-aware reward. Experiments on CIFAR-10 and ImageNet benchmarks with ResNet and VGG show that the proposed method achieves up to 10.2% peak temperature reduction, and an average top-1 accuracy improvements of 46.37% over fixed 8-bit baselines. Compared to prior layer-wise quantization methods without thermal considerations, our approach improves accuracy by 6.36%-15.06% on average.

Session 6B

(T1-B) Accelerator and Mapping Innovations for LLMs and Neural Networks
15:55-18:00 | Wednesday, January 21, 2026 | Snow White 2
Chair(s):
Zhe Lin (Sun Yat-sen University)
Jianwang Zhai (Beijing University of Posts and Telecommunications)
6B-1
15:55-16:20

SnipSnap: A Joint Compression Format and Dataflow Co-Optimization Framework for Efficient Sparse LLM Accelerator Design

*Junyi Wu, Chao Fang, Zhongfeng Wang (Nanjing University)
Keywords
Design Space Exploration (DSE), Sparse Accelerators, Compression Format, Large Language Model (LLM)
Abstract
The growing scale of large language models (LLMs) has intensified demands on computation and memory, making efficient inference a key challenge. While sparsity can reduce these costs, existing design space exploration (DSE) frameworks often overlook compression formats, a key factor for leveraging sparsity on accelerators. This paper proposes SnipSnap, a joint compression format and dataflow co-optimization framework for efficient sparse LLM accelerator design. SnipSnap introduces: (1) a hierarchical compression format encoding to expand the design space; (2) an adaptive compression engine for selecting formats under diverse sparsity; and (3) a progressive co-search workflow that jointly optimizes dataflow and compression formats. SnipSnap achieves 18.24% average memory energy savings via compression format optimization, along with 2248.3x and 21.0x speedups over Sparseloop and DiMO-Sparse frameworks, respectively.
6B-2
16:20-16:45

ALMA: Adaptive Co-optimization of Loop-Memory Uneven Mappings and Architectures for DNN Accelerators

*Xiaodong Liu, Zhihui Wang, Xiangcong Kong, Weixin Zhou, Xiaofei Xia, Yuqi Jiang (Vivo Mobile Communication (Shenzhen) Co., Ltd)
Keywords
Transformers, DNN, design space exploration, genetic algorithm, multi-objective optimization, uneven mapping
Abstract
Efficient design of Deep Neural Network (DNN) accelerators requires the joint optimization of hardware archi- tecture and dataflow mapping, a task complicated by a vast, non-convex, and tightly-coupled design space. Existing methods typically decouple these two stages or restrict mapping flexibility, leading to suboptimal solutions. This paper presents ALMA (Adaptive Loop-Memory Uneven Mappings and Architecture co- optimization), a unified framework that holistically co-optimizes hardware architectures and flexible, uneven mapping strategies. ALMA’s core contributions are threefold: 1) A unified search space that integrates hardware parameters with uneven mapping strategy on unbalanced memory hierarchies and a novel mixed- dimensional parallelism approach. 2) Two specialized genetic operators, Factor Compensation and Position-Sensitive Loop Crossover (PSLC), specifically designed to maintain solution validity within this complex, unified chromosome structure. 3) A Reinforcement Learning (RL) agent that dynamically adapts the genetic algorithm’s hyperparameters, improving search efficiency and robustness. We evaluate ALMA on a variety of Convolutional Neural Networks (CNNs) and Transformer models. Experimental results demonstrate that ALMA identifies solutions with superior Performance, Power, and Area (PPA) trade-offs, significantly outperforming state-of-the-art frameworks like MEDEA and SENNA MO. On key benchmarks, ALMA reduces latency by up to 95% while achieving comparable or superior energy efficiency and area costs.
6B-3
16:45-17:10

MARCO: Hardware-Aware Neural Architecture Search for Edge Devices with Multi-Agent Reinforcement Learning and Conformal Filtering

Arya Fayyazi, Mehdi Kamal, *Massoud Pedram (University of Southern California)
Keywords
Hardware/Software Co-Design, Conformal Prediction, Multi Agent Reinforcement Learning, Network Architecture Search, ARM, Edge AI
Abstract
We present MARCO (Multi-Agent Reinforcement learning with Conformal Optimization), a hardware-aware neural architecture search (NAS) framework for resource-constrained edge devices. MARCO combines multi-agent reinforcement learning (MARL) with Conformal Prediction (CP) to efficiently explore architectures under strict memory and latency budgets. Unlike once-for-all (OFA) supernets that require expensive pretraining, MARCO separates the NAS task into a Hardware Configuration Agent and a Quantization Agent, coordinated via a centralized-critic, decentralized-execution (CTDE) paradigm. A calibrated CP surrogate model offers distribution-free guarantees to filter low-reward candidates before costly training or simulation, significantly accelerating the search. Experiments on MNIST, CIFAR-10, and CIFAR-100 show MARCO achieves 3-4x faster search than OFA while maintaining accuracy within 0.3% and reducing latency. Validation on the MAX78000 confirms simulator fidelity with less than 5% error.
6B-4
17:10-17:35

NetTLM-DSE: Design Space Exploration for DNN Layer-Pipeline Spatial Mappings

*Shinyoung Kim, Junsu Heo, Hyeseong Shin, Jaesuk Lee (Konkuk University), Sungkyung Park (Pusan National University), Chester Sungchung Park (Konkuk University)
Keywords
Deep Neural Networks Accelerator, Spatial Mapping, Design Space Exploration
Abstract
This paper presents NetTLM-DSE, a novel framework for exploring the design space of layer-pipeline spatial mapping (LP-SM) for a large-scale accelerator network based on SystemC transaction level modeling (TLM). In detail, the proposed SystemC-TLM simulator, NetTLMSim, is used to predict the network delay and optimize the mapping of deep neural network (DNN) layers to multiple accelerator cores. Conventional analytical delay models fail to accurately predict the delay of a large-scale accelerator network because of lack of the ability to capture the dynamic traffic of an accelerator network such as network contention. Such inaccurate delay prediction makes it difficult to find the optimal or near-optimal LP-SM in the corresponding design space exploration (DSE). In contrast, NetTLMSim reduces the prediction error by up to 52.8% through its cycle-accurate modeling of dynamic traffic in accelerator networks. When applied to the DSE for Transformer LP-SM, the proposed cycle-accurate simulator can find the optimal or near-optimal design options more efficiently than the conventional analytical delay models. Specifically, the proposed DSE framework, NetTLM-DSE, is shown to achieve up to 37.8% lower delay than the conventional DSE framework. In addition, a set of simulated annealing (SA) operators using dynamic traffic profiles is newly proposed to improve the efficiency of DSE. In detail, it is shown that, when applied to the DSE for Transformer LP-SM, the proposed SA operators improve the DSE speed by up to 684x over conventional SA operators. The NetTLM-DSE framework is open-sourced at https://github.com/SDL-KU/NetTLM-DSE
6B-5
17:35-18:00

BalanceGS: Algorithm-System Co-design for Efficient 3D Gaussian Splatting Training on GPU

*Junyi Wu, Jiaming Xu, Jinhao Li, Yongkang Zhou, Jiayi Pan, Xingyang Li, Guohao Dai (Shanghai Jiao Tong University)
Keywords
Algorithm, system, efficiency
Abstract
3D Gaussian Splatting (3DGS) has emerged as a promising 3D reconstruction technique. The traditional 3DGS training pipeline follows three sequential steps: Gaussian den- sification, Gaussian projection, and color splatting. Despite its promising reconstruction quality, this conventional approach suffers from three critical inefficiencies: (1) Skewed density allocation during Gaussian densification. The adaptive densi- fication strategy in 3DGS makes skewed Gaussian allocation across dense and sparse regions. The number of Gaussians of dense regions can be 100x that of sparse regions, leading to Gaussian redundancy. (2) Imbalanced computation workload during Gaussian projection. The traditional one-to-one allocation mechanism between threads and pixels results in execution time discrepancies between threads, leading to ∼ 20% latency overhead. (3) Fragmented memory access during color splatting. Discrete storage of colors in memory fails to take advantage of data locality with fragmented memory access, resulting in ∼ 2.0x color memory access time. To tackle the above challenges, we introduce BalanceGS, the algorithm-system co-design for efficient training in 3DGS. (1) At the algorithm level, we propose heuristic workload-sensitive Gaussian density control to automatically balance point distri- butions - removing 80% redundant Gaussians in dense regions while filling gaps in sparse areas. (2) At the system level, we propose Similarity-based Gaussian sampling and merging, which replaces the static one-to-one thread-pixel mapping with adaptive workload distribution - threads now dynamically process variable numbers of Gaussians based on local cluster density. (3) At the mapping level, we propose reordering-based memory access mapping strategy that restructures RGB storage and enables batch loading in shared memory. Extensive experiments demon- strate that compared with 3DGS, our approach achieves a 1.44x training speedup on a NVIDIA Tesla A100 GPU with negligible quality degradation.

Session 6C

(T12-B) Logic Locking and Hardware Trojan Detection
15:55-18:00 | Wednesday, January 21, 2026 | Snow White 3
Chair(s):
Amin Rezaei (California State University Long Beach)
Qiaoyan Yu (University of New Hampshire)
6C-1
15:55-16:20

LumiLock: LUT-based Multi-Key Logic Locking

*Tsunato Nakai, Takuya Higashi (Mitsubishi Electric Corporation)
Keywords
logic locking, SAT, multi-key logic locking
Abstract
Logic locking, which ensures that a logic circuit operates correctly only when the correct key is input, has gained attention as a countermeasure against the infringement of hardware intellectual property and reverse engineering threats. However, many existing methods have been compromised by SAT attacks. Recent research has reported that multi-key logic locking methods are effective as a fundamental countermeasure against SAT attacks. However, these methods face mainly challenges, such as limitations in their application scope and increased overhead when applied. In this paper, we propose a new logic locking method, LUT-based multi-key logic locking, to address these challenges. The proposed method was evaluated through ten SAT attacks on three benchmark circuits. The results demonstrate that, compared to existing methods, our method achieves both flexibility in its applicability to diverse circuits and robust resistance against SAT attacks, even with a smaller key size.
6C-2
16:20-16:45

SCPrompt: Semantic Compression and Prompt-Guided LLM Reasoning for RTL Trojan Detection

Jiaji He, *Jiansheng Chen (Tianjin University), Fei Zhao (Peking University), Yaohua Wang (National University of Defense Technology), Yongqiang Lyu (Tsinghua University)
Keywords
Hardware security, RTL features compression, Large language models, Prompt engineering
Abstract
The increasing scale and complexity of integrated circuit (IC) designs present significant challenges to hardware security. Hardware Trojans (HTs)—stealthy, malicious alterations to hardware logic—are especially difficult to detect at the register-transfer level (RTL), where both structural and semantic understanding are essential. While recent advances have explored the use of large language models (LLMs) for RTL analysis, their performance is often constrained by token limits and a lack of targeted semantic abstraction. To overcome these limitations, we propose SCPrompt, a novel and extensible framework that, for the first time, integrates RTL instance-level compression with prompt-guided LLM reasoning for hardware Trojan detection. SCPrompt compresses RTL inputs by over 90%, significantly reducing prompt length while retaining critical control and data flow semantics. This enables LLMs to perform accurate Trojan analysis without task-specific fine-tuning. Unlike prior binary classification approaches, SCPrompt supports instance-grained localization of suspicious logic components, identifying both trigger and payload modules within a design. Evaluated on 51 RTL designs that had hardware Trojans inserted, our method achieves up to 100% precision, 90% recall, and an F1 score of 94.74%, demonstrating strong generalization across diverse circuits. These results validate SCPrompt as the first effective and scalable framework to unify RTL feature engineering with general-purpose language model reasoning for advanced hardware security analysis.
6C-3
16:45-17:10

Can Large Language Models Unlock Logic Locking?

*Takuya Higashi, Tsunato Nakai (Mitsubishi Electric Corporation)
Keywords
Logic Locking, SAT Attack, Multi-Key Locking, Large Language Model
Abstract
In the realm of integrated circuit (IC) manufacturing, the industry faces security threats such as the theft of intellectual property and reverse engineering. In response to these threats, logic locking has gained significant attention as a viable countermeasure. Logic locking is a technique that obfuscates the operation of an IC by integrating specific gates or components into the circuit design, allowing the IC to function only under certain conditions. Traditional logic locking methods have been vulnerable to SAT attacks; however, recent research has shown that multi-key logic locking serves as an effective countermeasure against these attacks. Currently, there is a growing body of research in the field of security that explores sophisticated attacks utilizing large language models (LLMs), with numerous studies documented. Nevertheless, to the best of the authors' knowledge, there have been no reports on attack methodologies that utilize LLMs to target logic locking specifically. This study presents, for the first time, an attack using LLMs against circuits employing logic locking, particularly those enhanced with multi-key logic locking that is resistant to SAT attacks. Through the proposed attack methodology utilizing LLMs, we successfully executed an attack against the state-of-the-art multi-key logic locking scheme known as K-Gate Lock.
6C-4
17:10-17:35

GALA: An Explainable GNN-based Approach for Enhancing Oracle-Less Logic Locking Attacks Using Functional and Behavioral Features

Yeganeh Aghamohammadi, Henry Jin (University of California Santa Barbara), *Amin Rezaei (California State University Long Beach)
Keywords
Logic Locking, Logic Encryption, Machine Learning, Graph Neural Networks, Explainability
Abstract
With the rise of fabless manufacturing, the risks of piracy and overproduction in integrated circuits have become more pressing, making it crucial to analyze and prevent hardware-based attacks. Although existing machine learning oracle-less attacks on logic-locked circuits are able to report approximate keys, they often struggle to produce operationally effective keys because they focus mainly on the structural topology of the circuits. This paper addresses this limitation by incorporating both functional features, such as output corruptibility, and behavioral features, like power consumption and area overhead, into graph neural network-based circuit modeling attacks. With the help of both subgraph-level and graph-level attack strategies, we achieve notable improvements in rendering a meaningful key compared to existing oracle-less methods. In addition, our graph-level model is explainable, providing insights into the learning process and how the attack is executed. These findings are critical for chip design houses looking to identify and address security vulnerabilities, ultimately safeguarding hardware intellectual property.
6C-5
17:35-18:00

SAND: A Self-supervised and Adaptive NAS-Driven Framework for Hardware Trojan Detection

*Zhixin Pan (Florida State University), Ziyu Shu (Stony Brook University), Linh Nguyen, Amberbir Alemayoh (Florida State University)
Keywords
Hardware Trojan, Machine Learning, Self-supervised Learning, Neural Architecture Search
Abstract
The globalized semiconductor supply chain has made Hardware Trojans (HT) a significant security threat to embedded systems, necessitating the design of efficient and adaptable detection mechanisms. Despite promising machine learning-based HT detection techniques in the literature, there are three major limitations of existing works: ad hoc feature selection, vulnerability to obfuscation, and lack of generalizability, all of which hinder their effectiveness across diverse HT attacks. In this paper, we propose SAND, a self-supervised and adaptive NAS-driven framework for efficient HT detection. Specifically, this paper makes three key contributions. (1) We leverage self-supervised learning (SSL) to enable automated feature extraction, eliminating the dependency on manually engineered features and improving the robustness. (2) SAND integrates neural architecture search (NAS) to dynamically optimize the downstream classifier, allowing for seamless adaptation to unseen benchmarks with minimal fine-tuning. (3) Experimental results show that SAND achieves a significant improvement in detection accuracy (up to 18.3%) over state-of-the-art methods, exhibits high resilience against evasive Trojans, and demonstrates strong generalization.

Session 6D

6D (Designer Forum 2) AI in Production EDA: Digital, Custom, and Manufacturing Use Cases
15:55-17:35 | Wednesday, January 21, 2026 | Sleeping Beauty 1/2
Chair(s):
Min Li (Southeast University)
6D-1
15:55-16:20

How AI is Supercharging Digital lmplementation

*Ko-Lung Yuan (Siemens)
Biography
Ko-Lung Yuan is a Senior Manager of Software Engineering at Siemens EDA, currently working on Aprisa, a digital implementation solution. With around 11 years of experience in the electronic design automation (EDA) industry, his expertise includes RTL optimization, placement optimization, clock tree synthesis, design exploration, and machine learning-based optimization. In recent years, Ko-Lung has focused on integrating generative AI, AI agents, and intelligent design automation techniques into production-grade EDA workflows, bridging traditional algorithms with emerging AI capabilities. He is also the author of "It's Django", a practical guide to web design using Python—recognized as the first native Chinese-language book on the topic.
Abstract
In addition to meeting power, performance, and area (PPA) targets, semiconductor design teams are currently challenged to accelerate design cycle time significantly.
Today, AI in EDA has delivered results on productivity and scalability for RTL-to-GDS, enabling digital implementation teams to achieve better results, and within much faster schedule time.
This presentation explores how AI-driven techniques, ranging from machine learning / reinforcement learning-based optimization to Generative AI-powered natural language interfaces, as well as streamlining and automation of design tasks using AI agents, will significantly scale productivity and engineering effort to achieve PPA faster and more consistently.
6D-2
16:20-16:45

Shaping a Full-custom Design Ecosystem with industry-Academia-Research Collaboration

*Michael Liu (Empyrean Technology)
Biography
Michael Liu is the senior product director of Empyrean Technology. He has more than 10 years of experience in ASIC chip design, manufacturing, and packaging EDA software product development and management, focusing on the planning, development, and promotion of EDA products. He helps Empyrean build up a mature Analog/Mixed-Signal design flow, and expand it to the fields of other full custom design such as flat panel display, signal chain, memory, RF and optoelectronics. He is building a reliability design methodology for design-manufacturing collaboration and a PPAC-oriented design-manufacturingpackaging collaborative design solution. The solutions are widely adopted by national and international leading design houses.
Abstract
As the complexity of integrated circuit design continues to grow, industry-academia-research collaborative innovation has emerged as a critical pathway to overcome core EDA challenges and accelerate technology adoption. Empyrean Technology is building an ecosystem around a full-custom design platform powered by Python APIs—leveraging Python’s widespread use in algorithm development, script extensibility, and the academic ecosystem—to equip researchers with flexible, efficient tools and support for secondary development. We have already launched AI-driven design automation initiatives on this platform, spanning use cases like circuit topology optimization and layout generation. These early efforts have demonstrated intelligent tools’ potential to significantly boost design efficiency and quality. We sincerely invite academic partners to join us in growing this ecosystem, and inject new momentum into design innovation.
6D-3
16:45-17:10

AI-driven Analog and Custom Design Solution

*Yutao Ma (PRIMARIUS)
Biography
Dr. Yutao Ma graduated from Tsinghua University in 1996 for bachelor and in 2001 for Ph.D. degrees with honor majoring in Microelectronics. Since then Dr. Ma has been working in EDA industry for over 20 years from Celestry to Cadence then Primairus. His expertise is in semiconductor device modeling, circuit simulation, yield analysis area. Dr. Ma is now in Primarius Technologies as VP of R&D leading technology innovation and new product development as well as the R&D engineering infrastructure team. He also serves as the director of the EDA Innovation Key Laboratory of Shandong Province and the co-director of Primarius-Peking University DTCO innovation laboratory. Dr. Ma published dozens of technical papers in leading journals and conference and held multiple patents in device modeling, simulation algorithm, yield analysis and hardware acceleration.
Abstract
The rapid convergence of artificial intelligence and electronic design automation is reshaping analog and custom circuit design. This talk discusses how AI—ranging from traditional machine learning to large foundation models—can enhance modeling accuracy, design efficiency, and reliability analysis across advanced technology nodes. It highlights emerging research directions, evolving tool architectures, and practical design workflows that integrate AI into circuit synthesis, layout, and verification. Emphasis is placed on methodological innovation, cross-domain collaboration, and the pathway toward intelligent, data-driven design ecosystems enabling faster development cycles and improved design quality in the era of AI-empowered semiconductor engineering.
6D-4
17:10-17:35

Exploring the Application of Machine Learning in Accelerating Model and Mask Optimization

*Xiaodong Meng (Amedac)
Biography
Meng Xiaodong is currently the Vice President of Advanced Manufacturing EDA Co., Ltd. He graduated from the Department of Physics of Tsinghua University and holds a master's degree from the Institute of Physics at the Chinese Academy of Sciences. As an expert in computational lithography, he has over 20 years of extensive experience in its R&D and application. Mr. Meng Xiaodong joined Semiconductor Manufacturing International Corporation (SMIC) in 2001 and held positions such as lithography process engineer and senior process integration engineer. In 2005, he joined Synopsys (Shanghai) and held positions such as senior engineer, project manager, senior manager, and the head of OPC in China, managing the OPC, precise lithography simulation, and mask data preparation for customers in the Asia-Pacific region. In 2019, Mr. Meng Xiaodong joined Advanced Manufacturing EDA Co.,Ltd. as the vice president of industrial promotion, responsible for promoting the deep integration of industry, academia, and research, facilitating the transformation of scientific and technological achievements, and building a cooperation platform between enterprises and universities and research institutions.
Abstract
With the increasing demand for chips at 7nm and smaller technology nodes, the requirements for the accuracy of chip process models and mask designs in chip fabs are also getting higher and higher. At the same time, machine learning algorithms, big data processing technologies, and related computing power have also made rapid progress in recent years, bringing powerful new tools for the optimization of each link in the chip manufacturing process. Both the academic and industrial communities have made some attempts at the application of machine learning in this field. In this report, we will introduce our roadmap and progress in this area.

Session 6E

(T9-C) Cell Placement and Generation for Advanced Technologies
15:55-18:00 | Wednesday, January 21, 2026 | Sleeping Beauty 3
Chair(s):
Pingqiang Zhou (ShanghaiTech University)
Yibo Lin (Peking University)
6E-1
15:55-16:20

An Effective Placement Framework for Designs with Half-Row-Extended Cells

Po-Yi Wu (National Tsing Hua University), *Tzu-Sheng Hung (National Tsing Hua University), Wai-Kei Mak, Ting-Chi Wang (National Tsing Hua University)
Keywords
Physical design, Half-row-extended cell, Quardratic placement, Multi-row-height legalization
Abstract
In advanced technology nodes, it is becoming common to mix standard cells with different heights in the same design to optimize timing, power, and routability simultaneously. In this paper, we explore the placement problem of mixedcell- height designs that include half-row-extended (HRE) cells and conventional single-row-height cells. An HRE cell is like a conventional single-row-height cell extended by half a row both upward and downward, resulting in a double-row-height cell. The advantage of an HRE cell is its superior driving strength compared to a conventional double-row-height cell, where both the top and bottom cell boundaries are aligned with power or ground rails. But mixing HRE cells with conventional singlerow height cells may result in a less compact placement, which adversely affects the wirelength and/or the chip size. To address the placement challenges introduced by HRE cells, we propose a two-level placement framework: (1) HRE cell-aware global placement and (2) half-row-fragment-aware legalization. During the global placement, we estimate the potential area overhead caused by the HRE cells and identify groups of HRE cells that should be placed close to one another to control the dead space distribution while minimizing the wirelength. Then, we invoke FragmentSaver, a legalizer that leverages dead-space cost curves to reduce row fragmentation with minimal displacement. Experimental results show that, compared to the state-of-theart mixed-cell-height placer RePlAce [8], our method resulted in significant reductions in routed wirelength and cell displacement.
6E-2
16:20-16:45

DUALPlace: Reinforcement Learning based Mixed-size Placement with Multi Modal Cross Attention

Yue Wu, *Hu Liu, Jiayi Ding, Junwei Li, Tieming Han, Xiaoyan Yang (Hangzhou Dianzi University)
Keywords
Reinforcement learning, Physical design, Mixed-size placement, Electronic design automation
Abstract
As modern circuits scale up, chip placement becomes increasingly challenging. Reinforcement learning (RL) algorithms have shown promise in achieving high-performance placement outcomes. However, existing RL approaches struggled with mixed-size placement challenges primarily due to their inability to model interdependencies between heterogeneous components during co-optimization. This study introduced an RL framework specifically designed for mixed-size placement called DUALPlace. DUALPlace leveraged cross-attention for multi-modal fusion, combining a multi-scale cross-attention vision transformer (Cross-ViT) and a graph attention network (GAT) for different modal feature extraction. DUALPlace introduced pin density as part of the global vision feature to enhance congestion analysis, providing a detailed view of layout congestion. Additionally, to reduce diffusion iterations, the netlist graph was partitioned with fixed macro vertices and initially placed together by an RL agent. Compared to the state-of-the-art online RL-based placement method DeepTH-finetune, DUALPlace demonstrated significant enhancements of 93.58% and 64.50% in iteration and runtime efficiencies, along with 4.01% and 3.65% improvements in half-perimeter wire length (HPWL) and rectangular uniform wire density (RUDY) metrics.
6E-3
16:45-17:10

SMT-Based Optimal Transistor Folding and Placement for Standard Cell Layout Generation

*Junghyun Yoon, Heechun Park (Ulsan National Institute of Science and Technology)
Keywords
standard cell layout generation, Satisfiability Modulo Theories, transistor placement
Abstract
As semiconductor technology continues to scale, standard cell layout generation becomes increasingly challenging, yet remains critical for Design Technology Co-Optimization (DTCO). This paper presents a Satisfiability Modulo Theories (SMT)-based methodology that tightly integrates transistor folding and placement to optimize standard cell layouts. Our approach offers two key contributions: (1) a unified SMT-based framework for simultaneous transistor folding and placement, which explores folding configurations beyond mandatory constraints to achieve globally optimal placement with improved area and routability; and (2) a mono-dummy insertion strategy based on the Longest Common Subsequence (LCS) algorithm, which aligns PMOS and NMOS transistor chains to enhance diffusion sharing and further reduce layout area. Experimental results using the ASAP7 7nm PDK show that our SMT-based placement framework produces fully routable layouts with 4.12% smaller average cell area compared to manually crafted counterparts. Moreover, when combined with heuristic in-cell routing, our method successfully generates layouts for all 172 standard cells, including complex cells such as high fan-in gates and flip-flops with asynchronous reset, with placement optimization completing in 836.70 seconds. This work demonstrates that our simultaneous optimization of transistor folding and placement enables both higher layout quality and practical automation for advanced standard cell design.
6E-4
17:10-17:35

Synthesis of CFET Standard Cells Utilizing Backside Interconnects Towards Improving Pin Accessibility

*Hyunbum Park, Taewhan Kim (Seoul National University)
Keywords
standard cells, Complementary FET (CFET), backside metals, automatic cell layout generation
Abstract
Complementary FET (CFET), which stacks n-FET (or p-FET) on top of p-FET (or n-FET), has been known to be one of the most promising technologies for the next-generation transistors. However, one inherent limitation of CFET-based standard cells is the reduced pin accessibility due to the reduced footprint. As a result, synthesizing CFET-based standard cells using the backside metals on the wafer is too important to be ignored to enhance the pin accessibility. In this regard, this work addresses the problem of automatic synthesis of CFET standard cells with maximal use of the backside metals, so that the pin accessibility on the frontside should be maximally improved. Specifically, we propose optimal solutions to the two subproblems: (1) pruning partial solutions in the transistor placement phase and (2) in-cell routing. For (1), we employ a graph-based exhaustive exploration with fast front/backside metal connectivity checking for pruning while for (2), we develop a satisfiability modulo theory (SMT) based formulation with a maximal use of backside metals as the utmost priority. Through experiments, it is shown that CFET cells generated by our method use on average 24.4% less frontside metals over that of the conventional CFET cells while significantly reducing runtime by employing pruning technique of using our fast metal accessibility checking. Furthermore, when chip placement and routing are performed using our CFET cells, we are able to produce final implementations with on average 67.5% less DRVs and 2.1% less wirelength over that produced by using conventional CFET cells.
6E-5
17:35-18:00

Standard Cell Layout Synthesis for Dual-Sided 3D-Stacked Transistors

*Kairong Guo, Haoran Lu, Rui Guo, Jiarui Wang, Chunyuan Zhao, Heng Wu, Runsheng Wang, Yibo Lin (Peking University)
Keywords
standard cell, layout synthesis, transistor-level placement and routing
Abstract
As transistor scaling approaches physical limits, Flip FET (FFET) emerges as a promising 3D-stacked transistor architecture, featuring back-to-back-stacked N/P transistors and dual-sided interconnects. This unique structure demands novel design solutions, including drain/gate merge for dual-side connectivity and flexible frontside/backside I/O pin assignment. In this paper, we propose a standard cell synthesis framework for dual-sided 3D-stacked transistors (FFETs) comprising SMT-based merge-aware placement that ensures dual-side connectivity via dynamic field drain merge insertion, and SAT-based dual-side routing supporting automated or specified I/O pin assignment. Experimental results show that our flow achieves, on average, 4% reduction in cell area, 4% in via usage, and 7% in M0 metal usage compared to previous 3.5T FFET designs, while efficiently generating all 2^n pin assignment variants for each cell. The support for multi-row placement and FDM insertion in our flow allows it to identify layouts surpassing manual designs, such as an AOI22xp5 variant with 6.3% better performance and 4.3% lower power than manual designs. At the chip level, our generated library with all 2^n pin assignment variants can further reduce wirelength by 10% and eliminate DS-nets. These show the effectiveness and flexibility of our framework for advanced FFET cell design.

Session 6F

(T5-E) Hardware-Software Co-Design and Optimization Frameworks
15:55-18:00 | Wednesday, January 21, 2026 | Sleeping Beauty 5
Chair(s):
Quan Chen (The Southern University of Science and Technology)
Yueting Li (Hangzhou International Innovation Institute, Beihang University)
6F-1
15:55-16:20

pHNSW: PCA-Based Filtering to Accelerate HNSW Approximate Nearest Neighbor Search

*Zheng Li, Guangyi Zeng (South China University of Technology), Paul Delestrac (KTH Royal Institute of Technology, Sweden), Enyi Yao, Simei Yang (South China University of Technology)
Keywords
HNSW, PCA Filtering, Nearest Neighbor Search, Algorithm-Hardware Co-optimization, QPS, Energy Efficiency
Abstract
Hierarchical Navigable Small World (HNSW) has demonstrated impressive accuracy and low latency for high-dimensional nearest neighbor searches. However, its high computational demands and irregular, large-volume data access patterns present significant challenges to search efficiency. To address these challenges, we introduce pHNSW, an algorithm-hardware co-optimized solution that accelerates HNSW through Principal Component Analysis (PCA) filtering. On the algorithm side, we apply PCA filtering to reduce the dimensionality of the dataset, thereby lowering the volume of neighbor access and decreasing the computational load for distance calculations. On the hardware side, we design the pHNSW processor with custom instructions to optimize search throughput and energy efficiency. In the experiments, we synthesized the pHNSW processor RTL design with a 65nm technology node and evaluated it using DDR4 and HBM1.0 DRAM standards. The results show that PHNSW boosts Queries per Second (QPS) by 14.47x - 21.37x on a CPU and 5.37x - 8.46x on a GPU, while reducing energy consumption by up to 57.4% compared to standard HNSW implementation.
6F-2
16:20-16:45

Boosting Scalability and Performance: Macro Placement for Flexible 3D-Stacked ML Accelerators

*Jiawei Hu, Canlin Zhang (Georgia Institute of Technology), Pruek Vanna-Iampikul (Burapha University), Tushar Krishna, Sung Kyu Lim (Georgia Institute of Technology)
Keywords
Flexible ML accelerators, 3D integration, Physical design, Macro placement, Convex optimization, Energy-delay product, Scalability, Memory access latency
Abstract
Flexible ML accelerators promise significant advantages in mapping efficiency over rigid architectures; however, fully harnessing these benefits remains challenging due to scalability bottlenecks in physical design. In this paper, we propose a unified multi-tier 3D integration optimization framework to systematically identify and address key architectural constraints limiting scalability in area, performance, and power (PPA). Building upon these insights, our approach strategically partitions memory macros based on their architectural characteristics, leverages convex optimization for efficient macro placement, and employs simulated annealing for overlap resolution and legalization. Experimental results demonstrate that our framework achieves substantial improvements, including up to 3.4x faster runtime, a 4.4x reduction in energy-delay product, significantly reduced footprint, and decreased memory access latency, thus fully unlocking the scalability potential of flexible ML accelerators.
6F-3
16:45-17:10

AutoCT: Hybrid Compressor Tree Optimization via Reinforcement Learning with Graph Modeling

Shangshang Yao (National University of Defense Technology), *Kunlong Li (Fudan University), Li Shen, Ruoxi Wang, Jingyu Liu (National University of Defense Technology)
Keywords
Compressor Tree, Multiplier Design, Reinforcement Learning, Graph Neural Networks, Hardware Optimization
Abstract
Compressor tree optimization is a critical step in the design of high-performance arithmetic circuits, such as multipliers or Multiply-Accumulate (MAC) units, where the goal is to minimize area and delay. Traditional methods often rely on heuristic rules or manual design, which struggle to achieve optimal results across diverse multiplier scales. In this paper, we propose AutoCT, a novel framework for hybrid compressor tree optimization that leverages reinforcement learning (RL) with graph neural networks (GNNs) to model and optimize compressor trees. By representing the compressor tree as a graph and employing a deep Q-network (DQN) enhanced with graph attention networks (GAT), AutoCT dynamically selects compressor types and configurations to minimize a combined cost of area and delay. Integrated with design automation tools for synthesis feedback, AutoCT outperforms conventional approaches in terms of performance and adaptability. Experimental results demonstrate significant improvements in cost metrics, with reductions in area-delay product by up to 15% compared to baseline designs.
6F-4
17:10-17:35

Chiplet-NAS: Chiplet-aware Neural Architecture Search for Efficient AI Inference on 2.5D Integration

Cheng Guo (Arizona State University), Pragnya Sudershan Nalla, Nikhil Kumar Cherukuri (University of Minnesota Twin Cities), Rui Xue (University of Chinese Academy of Sciences), Sachin S. Sapatnekar (University of Minnesota Twin Cities), Chaitali Chakrabarti (Arizona State University), *Yu Cao (University of Minnesota Twin Cities), Jeff Zhang (Arizona State University)
Keywords
Chiplet-based systems, Neural architecture search, AI workloads
Abstract
The co-design of neural network architectures and their target chiplet-based hardware systems presents a significant challenge due to the vast and combinatorial design space. Identifying solutions that are Pareto-optimal across competing objectives of task accuracy, system latency, and power consumption requires solutions beyond manual design and brute-force methods. This paper proposes a closed-loop chiplet-aware neural architecture search (Chiplet-NAS) framework to automate the exploration and discover hardware-optimized models for efficient AI inference on 2.5D chiplet-based systems. Our framework integrates a Tree-structured Parzen Estimator (TPE) for sample-efficient search with CLAIRE, a chiplet-based library and fast performance benchmarking tool, to provide direct hardware feedback on latency and energy consumption, along with accuracy optimization. We evaluate our framework's versatility by co-designing architectures for ResNet-based model architectures. Compared to a baseline NAS that optimizes the task accuracy only, our Chiplet-NAS achieves significant power and performance benefits at iso-accuracy.
6F-5
17:35-18:00

MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts

*Yushu Zhao, Yubin Qin, Yang Wang, Xiaolong Yang, Huiming Han, Shaojun Wei, Yang Hu, Shouyi Yin (Tsinghua University)
Keywords
Large language model, Mixture-of-experts, Offloading, Algorithm-system co-design
Abstract
Mixture-of-Experts (MoE) models have recently demonstrated exceptional performance across a diverse range of applications. The principle of sparse activation in MoE models facilitates an offloading strategy, wherein active experts are maintained in GPU HBM, while inactive experts are stored in CPU DRAM. The efficacy of this approach, however, is fundamentally constrained by the limited bandwidth of the CPU-GPU interconnect. To mitigate this bottleneck, existing approaches have employed prefetching to accelerate MoE inference. These methods attempt to predict and prefetch the required experts using specially trained modules. Nevertheless, such techniques are often encumbered by significant training overhead and have shown diminished effectiveness on recent MoE models with fine-grained expert segmentation. In this paper, we propose MoBiLE, a plug-and-play offloading-based MoE inference framework with mixture of big-little experts. It reduces the number of experts for unimportant tokens to half for acceleration while maintaining full experts for important tokens to guarantee model quality. Further, a dedicated fallback and prefetching mechanism is designed for switching between little and big experts to improve memory efficiency. We evaluate MoBiLE on four typical modern MoE architectures and challenging generative tasks. Our results show that MoBiLE achieves a speedup of 1.60x to 1.72x compared to the baseline on a consumer GPU system, with negligible degradation in accuracy.

Keynote Session III

Keynote Addresses
08:20-09:50 | Thursday, January 22, 2026 | Cinderella Ballroom 1/6/7/8
Jim Chang
TSMC Academician/Deputy Director
3DIC Design Methodology Development, TSMC
08:20-09:05
Keynote Address

Unlocking Hyper-Scale AI: Navigating the Future of 3DIC Design Solutions

Biography
Dr. Jim Chang leads the 3DIC design methodology development efforts in TSMC. With over two decades of semiconductor experience, Jim is a recognized expert in synthesis, physical optimization, detailed routing, and timing analysis. Prior to his current role, he spearheaded TSMC's design flow development and EDA certification program for advanced 7nm to 3nm technology nodes. His extensive background also includes R&D leadership positions at prominent EDA companies such as Plato, Cadence, Extreme DA, and Synopsys. Dr. Chang holds a Ph.D. in Electrical and Computer Engineering from the University of California, Santa Barbara.
Abstract
The era of hyper-scale AI demands a radical rethinking of 3DIC design, a paradigm shift unlocking unprecedented opportunities for architectural innovation for superior system performance. Yet, this explosion of possibility brings an exponential surge in design complexity, challenging even the most seasoned engineers.
This presentation will delve into the forefront of this revolution. We begin with a review of the TSMC 3DFabric™ family of solutions, specifically engineered to power the most advanced AI systems on the market today. We will then pivot to a comprehensive exploration of the critical 3DIC design challenges that emerge at this bleeding edge: from intricate 3D integration and feasibility assessment to robust implementation, power integrity, physical verification, thermal analysis, and substrate design optimization.
Join us to discover how a cohesive suite of solutions is forming the foundation for designing the AI systems of tomorrow - systems that will redefine what's possible. This is your essential guide to conquering complexity and harnessing the full potential of 3DIC for the next generation of intelligent machines.
Takefumi Miyoshi
Director at e-trees.Japan, Inc.
Adjunct Professor, The University of Osaka
Founder, QuEL, Inc.
09:05-09:50
Keynote Address

Design and Implementation of Control System for Quantum Computers

Biography
Dr. Takefumi Miyoshi received his Ph.D. from the Interdisciplinary Graduate School of Science and Engineering at Tokyo Institute of Technology in 2007. He is a Director at e-trees.Japan, Inc. and an Adjunct Professor at the Center for Quantum Information and Quantum Biology at The University of Osaka. Dr. Miyoshi is also one of the founders of QuEL, Inc., where he works as the CTO. His research interests include reconfigurable system, computer architecture, compiler, and quantum computing.
Abstract
Quantum computing has advanced rapidly in recent years, raising strong expectations for its transformative impact on information processing. As quantum processors scale in size and complexity, progress in their control systems becomes increasingly essential.
This talk first introduces the roles of control systems in quantum computing and then presents our efforts in developing scalable and precise quantum computer controllers featuring high-accuracy microwave transceivers and synchronization mechanisms for reliable qubit manipulation and measurement. As we approach fault-tolerant quantum computing (FTQC), the scalability and efficiency of control electronics emerge as major challenges.
To address these challenges, compact, high-performance, and energy-efficient LSI-based control systems are required, supported by advanced design methodologies and electronic design automation (EDA) tools. Promising directions such as cryogenic CMOS integration may also open new possibilities for co-design between quantum and classical electronics. The talk will highlight how innovations in digital and analog integrated circuit design and system integration can accelerate the realization of large-scale quantum computing.

Session 7A

(T4-A) Efficient AI Model Design and Training
10:20-12:00 | Thursday, January 22, 2026 | Snow White 1
Chair(s):
Masanori Hashimoto (Kyoto University)
Chen Wu (Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo)
7A-1
10:20-10:45

StabiFreeze: Early Stopping for Training Binary Neural Networks via Internal Dynamics Stabilization

*Jisu Kim, Tae-Hwan Kim (Korea Aerospace University)
Keywords
efficient training, binary neural networks, early stopping, low resource, edge computing
Abstract
Binary neural networks (BNNs) offer significant efficiency in memory and computation during inference, making them promising for on-device applications. However, training BNNs remains computationally intensive, which limits their primary use to inference rather than on-device training. This paper proposes StabiFreeze, a layer-wise freezing framework that detects convergence in convolution and batch normalization layers using a stability-based criterion. On CIFAR-100, StabiFreeze achieves up to 67.63% reduction in total training computation, completing training with 56.75% fewer epochs, while incurring accuracy degradation of only 0.30% and 0.54% for BinaryNet and XNOR-Net, respectively. Evaluation results on Raspberry Pi 4 and Jetson Nano show that StabiFreeze reduces training time by 80.46% and 80.63%, with accuracy degradation of 0.20% and 0.25%, respectively. This demonstrates that StabiFreeze enables efficient and practical BNN training in resource-constrained edge scenarios.
7A-2
10:45-11:10

Constrained NAS via Symbolic Expressions in Declarative Hierarchical Search Spaces

*Moritz Reiber, Christoph Gerum, Oliver Bringmann (University of Tübingen)
Keywords
neural architecture search, search space design, design space exploration, constraints, hardware-aware NAS
Abstract
Neural Architecture Search (NAS) automates the design of deep neural networks (DNNs), but the design of the search space remains crucial: manually designed spaces require significant engineering effort, while overly flexible designs often lead to invalid or inefficient architectures. This paper introduces a novel NAS search space design that centers on symbolic constraint modeling, enabling fine-grained parametrization while ensuring architecture validity and resource efficiency. By representing parameter dependencies as symbolic expressions, our method supports automatic resolution of interdependent attributes and the specification of hard constraints, such as limits on parameter count or MAC operations, directly within the search space. This mechanism allows efficient exploration of valid architectures under strict deployment budgets. The search space itself is constructed declaratively using hierarchical, composable topology patterns, drawing from common DNN motifs and enabling intuitive and scalable definition. We demonstrate the effectiveness of our approach through evolutionary NAS under multiple resource constraints, showing that symbolic constraint enforcement improves search efficiency and robustness without sacrificing accuracy.
7A-3
11:10-11:35

HERO: Hardware-Efficient RL-based Optimization Framework for NeRF Quantization

*Yipu Zhang, Chaofang Ma, Jinming Ge (The Hong Kong University of Science and Technology), Lin Jiang (Northeastern University), Jiang Xu (The Hong Kong University of Science and Technology (Guangzhou)), Wei Zhang (The Hong Kong University of Science and Technology)
Keywords
Neural Radiance Field (NeRF), Quantization, Hardware-software co-design, Reinforcement learning
Abstract
Neural Radiance Field (NeRF) has emerged as a promising 3D reconstruction method, delivering high-quality results for AR/VR applications. While quantization methods and hardware accelerators have been proposed to enhance NeRF's computational efficiency, existing approaches face crucial limitations. Current quantization methods operate without considering hardware architecture, resulting in sub-optimal solutions within the vast design space encompassing accuracy, latency, and model size. Additionally, existing NeRF accelerators heavily rely on human experts to explore this design space, making the optimization process time-consuming, inefficient, and unlikely to discover optimal solutions. To address these challenges, we introduce HERO, a reinforcement learning framework performing hardware-aware quantization for NeRF. Our framework integrates a NeRF accelerator simulator to generate real-time hardware feedback, enabling fully automated adaptation to hardware constraints. Experimental results demonstrate that HERO achieves 1.31-1.33x better latency, 1.29-1.33x improved cost efficiency, and a more compact model size compared to CAQ, a previous state-of-the-art NeRF quantization framework. These results validate our framework's capability to effectively navigate the complex design space between hardware and algorithm requirements, discovering superior quantization policies for NeRF implementation. Our code will be available upon the paper's acceptance.
7A-4
11:35-12:00

Gundam: A Generalized Unified Design and Analysis Model for Matrix Multiplication on Edge

Quan Cheng, Haoyuan Li, Weirong Dong (Kyoto University), Mingqiang Huang, Longyang Lin (Southern University of Science and Technology), *Masanori Hashimoto (Kyoto University)
Keywords
Resource Analysis, PE array modeling, AI accelerator configuration, Matrix Multiplication
Abstract
Matrix multiplication is the core operation in many edge AI applications, yet its efficient implementation requires balancing compute throughput with strict area and resource constraints. To address this, we propose Gundam, a generalized unified design and analysis model that enables agile and structured estimation and configuration of AI accelerator architectures. Gundam provides analytical modeling of matrix operations, supporting rapid evaluation of processing element counts, data reuse patterns, buffer sizing, and computation latency under diverse hardware constraints. Unlike conventional models, Gundam jointly captures both hardware mapping and dataflow behaviors within a unified model, facilitating fast and resource-aware design space exploration. To validate its accuracy and utility, we apply Gundam to guide accelerator generation across 16nm, 22nm, and 28nm process nodes. Results show that Gundam's estimated configurations differ by less than 6% from post-layout implementations, while automatically identifying optimal AI accelerator configurations under fixed resource constraints. Gundam offers a lightweight yet powerful tool for early-stage deployment and optimization of matrix processors for edge-AI applications.

Session 7B

(T2-B) Designing Predictable and Reliable Real-Time Systems
10:20-12:00 | Thursday, January 22, 2026 | Snow White 2
Chair(s):
Zhenyu Yan (The Chinese University of Hong Kong)
Zhiding Liang (The Chinese University of Hong Kong)
7B-1
10:20-10:45

TTI: An Instruction Set Supporting Priority-Inversion-Free Time-Triggered Preemptive Scheduling in Real-Time Embedded Systems

*Yinkang Gao, Yixuan Zhu, Bo Zhang, Haoyuan Ren, Cheng Tang, Xi Li (University of Science and Technology of China)
Keywords
Time-Triggered Scheduling, Priority Inversion, Instruction Set Architecture
Abstract
In real-time embedded systems, traditional timer-based time-triggered preemptive scheduling suffers from priority inversion, where the release of a lower-priority task at its activation time interferes with the execution of a higher-priority task. This not only complicates the worst-case response time (WCRT) analysis for higher-priority tasks but also increases their actual WCRT, thereby degrading predictability and schedulability. Addressing this challenge requires enabling the processor to aware the priority of tasks reaching their activation time relative to the currently executing task and to make scheduling decisions accordingly. Since the instruction set serves as the software/hardware interface, the most direct approach to support this capability is to incorporate timing- and priority-aware semantics into the instruction set architecture. In this paper, we propose a novel instruction set, the Time-Triggered Instruction Set (TTI), which introduces priority-aware timed operations to allow the processor to perform operations based on time and priority. We then design a TTI-supported hardware microarchitecture. And we develop the TTI-based priority-inversion-free time-triggered preemptive scheduling. Experimental results demonstrate that, compared to timer-based scheduling, TTI-based scheduling achieves lower and more stable WCRT for high-priority tasks, improving predictability and schedulability.
7B-2
10:45-11:10

DRAPP: An end-to-end Latency Evaluation Tool for Containerized ROS Applications

*Sicheng Guan (Northeastern University), Haiyang Wang, Jinghao Sun (Dalian University of Technology), Qingxu Deng (Northeastern University)
Keywords
Container-Crossed Evaluation, Edge Computing, ROS 2
Abstract
The advent of Software-Defined Vehicles (SDVs) has revolutionized the automotive industry by enabling rapid innovation through the integration of software-driven functionalities. The Scalable Open Architecture for Embedded Edge (SOAFEE) framework, an innovative open software architecture, integrates cloud-native methodologies into in-vehicle environments. By leveraging this framework, the containerization of Robot Operating System (ROS) applications has gained widespread adoption for deploying and maintaining autonomous driving applications. However, real-time performance remains a critical challenge, particularly in addressing end-to-end (E2E) latency within ROS systems to ensure responsiveness and timely task execution. Existing tools, such as the Chain-Aware ROS Evaluation Tool (CARET), have contributed significantly to real-time performance evaluation. Nevertheless, these tools exhibit limitations when assessing E2E latency in containerized environments involving multiple containers. To address this gap, we introduce Distributed ROS Application Performance Profiler (DRAPP), a solution referenced CARET with enhancements, designed specifically to evaluate E2E latency in ROS-based applications. Our approach includes experiments using Autoware, deployed across multiple containers within the SOAFEE framework, with a primary focus on analyzing E2E processing latency. Preliminary empirical results demonstrate that DRAPP outperforms existing tools in both usability and effectiveness for latency evaluation, representing a significant step forward in the performance assessment of in-vehicle software for autonomous systems.
7B-3
11:10-11:35

LaPOD: Latency Prediction for Real-Time LiDAR Object Detection

Wenjing Xie, Tianchi Ren (City University of Hong Kong), Jen-Ming Wu (Hong Hai Research Institute), Chun Jason Xue (Mohamed bin Zayed University of Artificial Intelligence), *Nan Guan (City University of Hong Kong)
Keywords
Latency prediction, LiDAR point cloud, Object detection
Abstract
LiDAR-based Object detection is widely used in real-time systems such as autonomous driving vehicles and robotics. In these applications, it is essential to achieve high accuracy while ensuring detection is completed within the timing constraints. However, the latency of LiDAR-based object detection is highly variable and unpredictable, making it difficult to meet the timing constraints. To address this challenge, an essential capability needed is accurately predicting the inference latency of the LiDAR-based object detection algorithm before execution begins. Intuitively, the processing time of LiDAR-based object detection depends on the input point cloud size. However, we conducted experiments revealing that even with the same number of input points, the processing time can still vary significantly. We analyze this phenomenon and find that the spatial distribution of points also plays a critical role in determining the inference latency. Based on these insights, we propose LaPOD, a lightweight latency predictor that rapidly estimates inference latency based on point cloud inputs. Specifically, we develop a representation to capture key features of the point cloud that influence inference latency. Moreover, LaPOD can be quickly adapted to different hardware platforms through fine-tuning with a limited number of samples. We evaluate LaPOD on the widely-used KITTI dataset, demonstrating its effectiveness and robustness.
7B-4
11:35-12:00

Flowmap Overapproximation for Linear Time-Varying Stochastic Systems

*Xin Chen (University of New Mexico)
Keywords
Formal verification, Reachability analysis, Time-varying system, Stochastic differential equation
Abstract
We present an approach to compute overapproximate abstractions for the cyber-physical systems defined by linear time-varying stochastic differential equations. The abstractions are represented by a set of univariate Taylor models that only overapproximate the reachability mapping of the system and are independent from system's initial state and constant parameters. Based on the Taylor model abstractions, a guaranteed upper bound for the probability of reaching an unsafe set can be obtained efficiently, and the approach can be used to investigate the probabilistic safety and robustness of the system. We evaluate the effectiveness of our approach based on a group of challenging benchmarks and compare it with the state of the art.

Session 7C

(T11-B) Scaling up Physical Design Optimization to Heterogeneous Systems
10:20-11:35 | Thursday, January 22, 2026 | Snow White 3
Chair(s):
Senling Wang (Ehime University)
Michihiro Shintani (Kyoto Institute of Technology)
7C-1
10:20-10:45

Timing-Aware Optimization of Die-Level Routing and TDM Assignment for Multi-FPGA Systems

*Yijun Chen (Beijing University of Posts and Telecommunications), Haoyuan Li, Chunyan Pei (Tsinghua University), Jianwang Zhai, Kang Zhao (Beijing University of Posts and Telecommunications), Wenjian Yu (Tsinghua University)
Keywords
Timing path, Timing-aware, Die-level, System routing, TDM assignment
Abstract
The escalating scale and complexity of modern circuits demand multi-FPGA emulation platforms that incorporate multi-die architectures. However, most existing routers remain FPGA-level, optimizing wire-length or total Time-Division Multiplexing (TDM) ratios while disregarding die-level load imbalance and path-level slack. The result is suboptimal performance and timing violations. This paper introduces the first timing-aware co-optimization framework for die-level routing and TDM assignment, explicitly linking physical constraints to critical path slack. The proposed flow features a timing-aware load-balanced die-level router with timing path compression and a timing graph-based TDM assignment. Experiments on industrial designs improve the worst-path slack by 98% over the existing methods.
7C-2
10:45-11:10

Φ-BO: Physics-Informed Bayesian Optimization for Multi-Port Decoupling Capacitor Placement in 2.5-D Chiplets

Quansen Wang (Beihang University), *Yuchuan Lin (Wuhan University of Technology), Zhuohua Liu (Beihang University), Wei Xing (The University of Sheffield), Wei Zhang (The Hong Kong University of Science and Technology), Ning Xu (Wuhan University of Technology), Yuanqing Cheng (Beihang University)
Keywords
Power Distribution Network Optimization, Decoupling Capacitor Placement, 2.5D Chiplet, Bayesian Optimization, Multi-Port Aware Transformation
Abstract
Power distribution network (PDN) optimization in 2.5-D chiplet architectures represents a critical bottleneck as designs scale to 100+ integrated chiplets, where decoupling capacitor placement becomes a multi-port optimization challenge requiring millions of expensive electromagnetic (EM) simulations. Current state-of-the-art (SOTA) methods—from genetic algorithms (GA) to reinforcement learning (RL)—treat PDN as black-box functions, failing to exploit inherent physical structure and scaling exponentially with problem complexity. We introduce Φ-BO, the first physics-informed Bayesian optimization (BO) framework specifically designed for multi-port decoupling capacitor placement in 2.5-D chiplet PDN. Our key innovation systematically integrates EM field theory into machine learning (ML) optimization through novel spatial feature transformations and Multi-Port Aware Transformation (MPAT), enabling a paradigm shift from black-box to physics-aware optimization. This approach captures spatial dependencies and port coupling effects, dramatically reducing effective problem dimensionality while enabling intelligent exploration of discrete placement configurations. Demonstrated on a 22-chiplet RISC-V processor design, Φ-BO achieves 23% impedance improvement, and 3x faster convergence compared to SOTA methods.
7C-3
11:10-11:35

Automated Parameter Tuning for Multi-FPGA Partitioning: A Preference-Guided Approach

*Yutao Dai (Beihang University), Shengbo Tong, Chunyan Pei (Tsinghua University), Zhuohua Liu (Beihang University), Wei Xing (University of Sheffield), Yi Liu (Beihang University), Rui Wang (Beihang University), Wenjian Yu (Tsinghua University)
Keywords
Preference Bayesian Optimization, Multi-FPGA partitioning, Parameter tuning, Design automation
Abstract
Parameter tuning for multi-FPGA partitioning algorithms represents a bottleneck in modern chip emulation and verification workflows. Current multilevel partitioning tools require manual configuration of various parameters, where each evaluation can take tens of seconds to minutes, making exhaustive search impractical and expert-driven tuning both time-consuming and suboptimal. To automate this process, we propose a preference-guided Bayesian optimization framework specifically designed for industrial FPGA partitioning parameter tuning under limited evaluation budgets. Our approach maximizes the minimum timing slack by incorporating domain-specific insights: we exploit the strong correlation between cutsize and timing performance through a priority-based ranking scheme that guides a pairwise Gaussian process to learn configuration preferences. Additionally, we introduce a kernel input transformation that properly handles the mixed discrete-continuous parameter space typical in EDA tools. Our method converges faster with fewer evaluations and achieves the best timing slack in 60--70% of cases on industrial circuit benchmarks compared to existing methods including standard Bayesian optimization, quasi-random sampling, and state-of-the-art preference learning techniques. The proposed framework reduces parameter tuning from days of manual effort to hours of automated optimization, offering practitioners a deployment-ready solution that improves both design quality and engineering productivity.

Session 7D

(SS-5) From Solvers to Layout: Cross-Layer Approaches for Reliable and Scalable IC Design
10:20-12:00 | Thursday, January 22, 2026 | Sleeping Beauty 1/2
Chair(s):
Tony Geng (Rice University)
7D-1
10:20-10:45

Advancing General Sparse Linear-Equation Solvers via Nested-Dissection-Based Parallel Scheduling and Randomized Linear Algebra

*Wenjian Yu, Jiawen Cheng, Baiyu Chen (Tsinghua University)
Keywords
Circuit Simulation, Sparse Matrix, Parallel LU Factorization, Random Embedding, GMRES Algorithm
Abstract
Sparse linear-equation solvers are indispensable to circuit simulation. They provide the mathematical engine that faithfully forecasts the dynamic response of analog circuits. In this invited paper, we present two novel techniques to speed up these solvers. The first is a parallel LU factorization driven by a new task scheduling strategy. Based on the nested dissection approach for matrix reordering, we derive a task assignment/scheduling strategy which largely reduces synchronization and develops more parallelism. Thus, a more efficient parallel sparse LU factorization algorithm (named SubtreeLU) is obtained. It outperforms both PARDISO and CKTSO in computational speed while remaining similar robustness. The second technique is a practical randomized GMRES algorithm. By implementing the Gram-Schmidt process with an extremely efficient random sketched linear-least-squares kernel, we obtain a fast randomized Arnoldi procedure that orthogonormalizes the Krylov subspace basis. Coupled with on-the-fly residual-error estimates, this yields a practical randomized GMRES that is provably stable and runs remarkably faster than the standard GMRES on a wide range of circuit and field simulation benchmarks.
7D-2
10:45-11:10

HeteroSTA: A CPU-GPU Heterogeneous Static Timing Analysis Engine with Holistic Industrial Design Support

Zizheng Guo, Haichuan Liu, Xizhe Shi, Shenglu Hua, Zuodong Zhang, Chunyuan Zhao (Peking University), Runsheng Wang, *Yibo Lin (Peking University; Beijing Advanced Innovation Center for Integrated Circuits)
Keywords
HeteroSTA
Abstract
We introduce in this paper, HeteroSTA, the first CPU-GPU heterogeneous timing analysis engine that efficiently supports: (1) a set of delay calculation models providing versatile accuracy-speed choices without relying on an external golden tool, (2) robust support for industry formats, including especially the .sdc constraints containing all common timing exceptions, clock domains, and case analysis modes, and (3) end-to-end GPU-acceleration for both graph-based and path-based timing queries, all exposed as a zero-overhead flattened heterogeneous application programming interface (API). HeteroSTA is publicly available with both a standalone binary executable and an embeddable shared library targeting ubiquitous academic and industry applications. Example use cases as a standalone tool, a timing-driven DREAMPlace 4.0 integration, and a timing-driven global routing integration have all demonstrated remarkable runtime speed-up and comparable quality.
7D-3
11:10-11:35

IR Drop-Aware ECO: A Fast Approach to Minimize Layout and Timing Disturbance

Jingchao Hu (Zhejiang University), Yibo Lin (Peking University), Hao Yu, *Quan Chen (Southern University of Science and Technology), Zhou Jin, Cheng Zhuo (Zhejiang University)
Keywords
IR-Drop, ECO, Timing, Cell Displacement
Abstract
Ensuring power integrity in advanced IC design is increasingly challenging, as excessive IR drop can severely impact circuit performance and reliability, especially during the late-stage Engineering Change Order (ECO) process. In this work, we propose a novel IR drop-aware ECO framework that addresses IR drop violations through targeted cell displacement while minimizing timing and layout disruption. Our approach incorporates vertical IR drop mitigation and horizontal timing fix, and employs a rail severity scoring mechanism that combines current correlation and spatial proximity to evaluate IR drop severity. Experimental results on three post-routed benchmark designs demonstrate that our method achieves significant reductions in worst-case dynamic voltage drop for certain designs and mitigates local timing degradation. Additionally, the proposed severity score accurately reflects trends in IR drop risk, providing valuable guidance for ECO optimization.
7D-4
11:35-12:00

iPCL: Pre-training for Chip Layout

*Xingquan Li, Weiguo Li (Pengcheng Laboratory), Xinhua Lai (University of Chinese Academy of Sciences), Junfeng Liu (Pengcheng Laboratory), Rui Wang (Shenzhen University)
Keywords
Chip layout, pre-training, foundation model, symbol representation, placement and routing
Abstract
As chip complexity continues to grow, traditional rule-based EDA tools face increasing challenges in optimizing power, performance, and area (PPA). There is a pressing need for a foundation model in chip layout to leverage large-scale historical design data for unified and intelligent physical design. We propose iPCL (pre-training for chip layout), a comprehensive framework that integrates placement and routing generation, metric evaluation and optimization. iPCL consists of layout symbolization, multimodal pre-training, solution generation and selection, and post-processing stages, forming a scalable and automated design pipeline. iPCL reduces design iteration time, supports multi-layout generation, and automatically selects optimal solutions through lightweight ECO refinement. Two versions are developed: iPCL-R, which generates routing layouts with performance comparable to commercial tools while reducing design time by 55.7%; and iPCL-M, which delivers 336x faster and more accurate metric evaluation than open-source EDA, achieving about 5% better optimization results than commercial tools.

Session 7E

(T7-B) Emerging Computing Architectures and Learning Systems
10:20-12:00 | Thursday, January 22, 2026 | Sleeping Beauty 3
Chair(s):
Ke Chen (NUAA)
Hao Geng (ShanghaiTech University)
7E-1
10:20-10:45

NEURAL: An Elastic Neuromorphic Architecture with Hybrid Data-Event Execution and On-the-fly Attention Dataflow

*Yuehai Chen, Farhad Merchant (Bernoulli Institute and CogniGron, University of Groningen)
Keywords
Spiking neural network, elastic computing, sparsity-aware, knowledge distillation, spiking transformer
Abstract
Spiking neural networks (SNNs) have emerged as a promising alternative to artificial neural networks (ANNs), offering improved energy efficiency by leveraging sparse and event-driven computation. However, existing hardware implementations of SNNs still suffer from the inherent spike sparsity and multi-timestep execution, which significantly increase latency and reduce energy efficiency. This study presents NEURAL, a novel neuromorphic architecture based on a hybrid data-event execution paradigm by decoupling sparsity-aware processing from neuron computation and using elastic first-in-first-out (FIFO). NEURAL supports on-the-fly execution of spiking QKFormer by embedding its operations within the baseline computing flow without requiring dedicated hardware units. It also integrates a novel window-to-time-to-first-spike (W2TTFS) mechanism to replace average pooling and enable full-spike execution. Furthermore, we introduce a knowledge distillation (KD)-based training framework to construct single-timestep SNN models with competitive accuracy. NEURAL is implemented on a Xilinx Virtex-7 FPGA and evaluated using ResNet-11, QKFResNet-11, and VGG-11. Experimental results show that NEURAL achieves accuracy improvements of up to 3.2% and 5.13% on the VGG-11 model using CIFAR-10 and CIFAR-100, respectively, while reducing resource usage by 50% and delivering up to 1.97x higher energy efficiency compared to existing SNN accelerators.
7E-2
10:45-11:10

Mask-based Meta-Learning for Stuck-at Faults Tolerance in ReRAM Computing Systems

*Zhan Shen, Yi Qi Zhou, Tian Yi Xu, Shan Shen, Zhen Mei , Da Ying Sun (Nanjing University of Science and Technology), Zi Chen Zhang (Beihang University)
Keywords
ReRAM crossbar array, Stuck-at Faults (SAFs), Meta-Learning, Robustness, Generalization, Neural Network
Abstract
ReRAM crossbar-based computing-in-memory (CIM) systems offer computational efficiency but suffer from significant accuracy degradation under stuck-at faults (SAFs). Conventional approaches like retraining-based methods fail to effectively generalize across diverse SAFs ratios. To address this challenge, we propose the Mask-based Meta Learning (MML) framework, leveraging meta-learning's multi-task generalization capability to achieve robust performance of varying SAFs scenarios. Within the MML framework, a SAFs mask-based task formulation is used to create meta-learning tasks based on SAFs masks without dataset dependency. Second, we define a new meta-learning objective by integrating different SAFs masks into the meta loss. Finally, we develop a SAFs-sensitivity guided weight importance search algorithm and dynamically expands the adjustment range of crucial weights using the ReRAM array’s redundant cells to further enhance model performance. Experimental results show that our method demonstrates superior robustness and generalization performance over the state-of-the-art robust training approaches. Moreover, our method can maintain high accuracy across various SAFs ratios while the accuracy of other approaches illustrates large fluctuations.
7E-3
11:10-11:35

LMESN: A Leakage-Driven MOSFET Reservoir for Scalable and Ultra-Low-Power Temporal Inference

*Haoyuan Li (Xi'an Jiaotong University, Kyoto University), Masami Utsunomiya, Ryuto Seki, Quan Cheng, Weirong Dong (Kyoto University), Feng Liang (Xi'an Jiaotong University), Takashi Sato (Kyoto University)
Keywords
Reservoir Computing, Echo State Network, Hardware Reservoir, Leakage-Current Mosfets, Hardware-Software Co-Optimization
Abstract
Edge-based temporal inference demands energy-efficient and scalable computing architectures, but existing analog reservoir computing models often face high energy costs and limited reconfigurability. We present LMESN, a leakage-current-driven, pulse-based reservoir computing architecture that exploits intrinsic threshold-voltage variation in standard CMOS to realize ultra-low-power stochastic dynamics. To overcome physical array size constraints, we propose a Shift-Multi-Mask (SMM) technique that emulates large virtual reservoirs through cyclic mask shifts, reducing update energy by over 100x and enabling single-cycle reconfiguration. To further boost task-level performance, we develop a hardware-software co-optimization framework that jointly tunes the ADC quantization range and reservoir mask structure via a discrete genetic algorithm. Post-layout simulations in 22 nm CMOS and evaluations on eight time-series datasets demonstrate up to 13.7% accuracy improvement and 5x variance reduction over unoptimized LMESN baselines. Compared to prior analog and neural network models, LMESN achieves 3-7 orders of magnitude lower energy consumption while delivering competitive or superior accuracy. Together, these innovations make LMESN a scalable, energy-efficient, and task-adaptive platform for edge temporal processing, setting a new direction in physical reservoir computing.
7E-4
11:35-12:00

BIHDC: A Retrainable Fully-Binary Hyperdimensional Computing Accelerator for Edge FPGAs

*Changzhen Han, Ke Chen (Nanjing University of Aeronautics and Astronautics), Bi Wu (Nanjing University of Aeronautics and Astronautics (NUAA)), Chenggang Yan, Weiqiang Liu (Nanjing University of Aeronautics and Astronautics)
Keywords
Hyperdimensional Computing, Image Classification, Edge Computing, Hardware Acceleration
Abstract
Hyperdimensional Computing (HDC) is a lightweight machine learning paradigm characterized by low complexity, efficient learning, and strong interpretability, making it well suited for edge intelligence. However, its accuracy on 2D image tasks still lags behind deep neural networks (DNNs), and existing HDC hardware often relies on in-memory computing or high-end FPGAs, which fail to meet the strict low-power and small-area requirements of edge devices and generally lack complete retraining capabilities. To address these limitations, we propose BIHDC, a lightweight HDC framework that introduces a learnable preprocessing scheme to enhance feature extraction and implements an edge FPGA-based fully binary accelerator supporting end-to-end processing, including preprocessing, encoding, training, retraining, and testing. Experimental results show that BIHDC improves classification accuracy by up to 5% compared to baseline HDC with binarized encoding and by 1.5% over baseline HDC with original encoding across four image datasets. Implemented on the Xilinx Zynq-7000 platform, BIHDC reduces resource utilization and power consumption by 90% and 70%, respectively, providing an efficient and scalable solution for resource-constrained edge applications.

Session 7F

(T6-A) RF and Photonic IC
10:20-12:00 | Thursday, January 22, 2026 | Sleeping Beauty 5
Chair(s):
Yuanqing Cheng (Beihang University)
Yibo Lin (Peking University)
7F-1
10:20-10:45

MOTIF-RF: Multi-template On-chip Transformer Synthesis Incorporating Frequency-domain Self-transfer Learning for RFIC Design Automation

Houbo He, Yizhou Xu, Lei Xia, Yaolong Hu (Rice University), Fan Cai (Keysight Technologies), *Taiyun Chi (Rice University)
Keywords
inverse design, surrogate model, transfer learning, impedance matching, transformer(XFMR), UNet, CNN, graph transformer
Abstract
This paper presents a systematic study on developing multi-template machine learning (ML) surrogate models and applying them to the inverse design of transformers (XFMRs) in radio-frequency integrated circuits (RFICs). Our study starts with benchmarking four widely used ML architectures, including MLP-, CNN-, UNet-, and GT-based models, using the same datasets across different XFMR topologies. To improve modeling accuracy beyond these baselines, we then propose a new frequency-domain self-transfer learning technique that exploits correlations between adjacent frequency bands, leading to ~30%-50% accuracy improvement in the S-parameters prediction. Building on these models, we further develop an inverse design framework based on the covariance matrix adaptation evolutionary strategy (CMA-ES) algorithm. This framework is validated using multiple impedance-matching tasks, all demonstrating fast convergence and trustworthy performance. These results advance the goal of AI-assisted “specs-to-GDS” automation for RFICs and provide RFIC designers with actionable tools for integrating AI into their workflows.
7F-2
10:45-11:10

Efficient RF Passive Components Modeling with Bayesian Online Learning and Uncertainty Aware Sampling

*Huifan Zhang, Pingqiang Zhou (ShanghaiTech University)
Keywords
RF Modeling, Bayesian Neural Networks, Online Learning, Vector Fitting
Abstract
Conventional radio frequency (RF) passive components modeling based on machine learning requires extensive electromagnetic (EM) simulations to cover geometric and frequency design spaces, creating computational bottlenecks. In this paper, we introduce an uncertainty-aware Bayesian online learning framework for efficient parametric modeling of RF passive components, which includes: 1) a Bayesian neural network with reconfigurable heads for joint geometric-frequency domain modeling while quantifying uncertainty; 2) an adaptive sampling strategy that simultaneously optimizes training data sampling across geometric parameters and frequency domain using uncertainty guidance. Validated on three RF passive components, the framework achieves accurate modeling while using only 2.86\% EM simulation time compared to traditional ML-based flow, achieving a 35x speedup.
7F-3
11:10-11:35

PrometheusFree: Concurrent Detection of Laser Fault Injection Attacks in Optical Neural Networks

*Kota Nishida, Yoshihiro Midoh, Noriyuki Miura (The University of Osaka), Satoshi Kawakami (Kyushu University), Alex Orailoglu (University of California, San Diego), Jun Shiomi (The University of Osaka)
Keywords
Silicon Photonics-based AI Accelerator (SPAA), Optical Neural Network (ONN), Laser Fault Injection Attack
Abstract
Silicon Photonics-based AI Accelerators (SPAAs) have been considered as promising AI accelerators achieving high energy efficiency and low latency. While many researchers focus on improving SPAAs' energy efficiency and latency, their physical security has only recently received attention. While it is essential to deliver strong optical neural network inferencing approaches, their success and adoption are predicated on their ability to deliver a secure execution environment. This paper first presents a threat of laser fault injection attacks on SPAAs. The threat is addressed in this paper by developing techniques for concurrent detection of the laser fault injection attacks, capable of subjecting the optical neural network to misclassifications. This paper proposes PrometheusFree, an optical neural network framework that is capable of concurrent detection of the attacks. Furthermore, this paper introduces a novel application of Wavelength Division Perturbation (WDP) technique where wavelength-dependent Vector Matrix Multiplication (VMM) results are utilized to boost fault attack detection accuracy. Simulation results show that PrometheusFree achieves over 96% attack-caused misprediction recall as the use of the WDP technique squashes the attack success rate by 38.6% on average. Compared with prior art, PrometheusFree achieves an average attack success ratio of 0.019, corresponding to a 95.3% reduction. The experimental results confirm the superiority of the concurrent detection and the boost in fault detection abilities imparted by the WDP approaches.
7F-4
11:35-12:00

BEAM: Bidirectional MEEF-Driven Mask Optimization for Curvilinear Photonic Design

*Xiaoxiao Liang, Yang Luo (HKUST(GZ)), Bei Yu (The Chinese University of Hong Kong), Yuzhe Ma (The Hong Kong University of Science and Technology (Guangzhou))
Keywords
Photonic integrated circuit, Curvilinear photonic design, Computational lithography, Optical proximity correction
Abstract
The photonic integrated circuit (PIC) is a promising direction for future computing and interconnect, which involves many curvilinear geometries to modulate and transmit signals. To ensure the functionality, the PIC manufacturing requires very meticulous optimization to refrain from geometry distortion resulting from lithography process. While conventional optical proximity correction (OPC) methods can handle curvilinear features, they face challenges in mask manufacturability, computational cost, and the ability to correct any-angle edge placement error (EPE). This paper proposes BEAM, a native framework designed for photonic designs with curvilinear patterns, including lossless curvilinear pattern representation and a powerful OPC solver. BEAM uses control points to represent curvilinear mask shapes directly, avoiding Manhattanization and approximation errors. Instead of manually specifying movement directions, control points are bidirectionally updated along two orthogonal basis directions, ensuring versatile corrections. To further enhance efficiency, we propose a fast batch-based sensitivity measurement strategy that effectively guides the movement of control points while substantially reducing the computational overhead. The effectiveness of BEAM is demonstrated on multiple fundamental layout components of photonic designs, achieving state-of-the-art correction performance in terms of mask quality and computational efficiency.

Session 8A

(T3-A) Hybrid CIM Architectures and Flexible Dataflows
13:30-15:35 | Thursday, January 22, 2026 | Snow White 1
Chair(s):
Ren-Shuo Liu (National Tsing Hua University)
Atsutake Kosuge (University of Tokyo)
8A-1
13:30-13:55

OAH-CIM: Outlier-Aware Hybrid RRAM-SRAM CIM Accelerator with Variation-Robust Sparsity

*Zhiwei Zhou, Tong Hu, Han Bao, Houji Zhou, Yuyang Fu (School of Integrated Circuits, Huazhong University of Science and Technology), Jiancong Li (Department of Electronic and Computer Engineering, HongKong University of Science and Technology), Jia Chen (AI Chip Center for Emerging Smart Systems (ACCESS) University of Science and Technology), Yi Li, Xiangshui Miao (School of Integrated Circuits, Huazhong University of Science and Technology)
Keywords
RRAM-SRAM hybrid architecture, Computing-in-memory, Transformer, outlier-aware quantization, sparsity
Abstract
Hybrid RRAM-SRAM Compute-in-Memory (CIM) architectures offer a promising solution for accelerating Transformer. However, their efficiency is severely constrained by a fundamental challenge: the prevalence of outlier values inherent to these models. These outliers create a severe quantization dilemma: hardware-friendly Block Floating-Point (BFP) suffers catastrophic accuracy loss, while Integer (INT) quantization incurs prohibitive hardware overhead. Furthermore, poor quantization directly undermines sparsity exploitation. The resulting numerical imprecision renders analog RRAM-CIM highly vulnerable to device variation noise, which is amplified during naive sparse computation. Addressing these deeply intertwined challenges requires a holistic co-design focused on robust data representation. We propose an outlier-aware hybrid CIM (OAH-CIM) accelerator built on a synergistic hardware-algorithm co-design. First, we introduce an outlier-aware BFP quantization that achieves near-FP32 accuracy at 5-bit by efficiently representing both outliers and non-outliers. Second, leveraging this high-fidelity representation, we propose balanced bit sparsity, a hardware scheduling technique that equalizes workloads to ensure reliable, low-noise sparse computation in variation-prone RRAM. Evaluations on ViT-base show OAH-CIM improves 5-bit accuracy by 5.21% over BFP and 2.94% over INT. Furthermore, OAH-CIM achieves up to 8.6 TOPS/W, yielding a 2.3-39.1x energy efficiency improvement over state-of-the-art CIM accelerators.
8A-2
13:55-14:20

FlexMem: High-Parallel Near-Memory Architecture for Flexible Dataflow in Fully Homomorphic Encryption

*Shangyi Shi, Husheng Han, Jianan Mu, Xinyao Zheng (Institute of Computing Technology), Ling Liang (Peking University), Hang Lu, Zidong du, Xiaowei Li, Xing Hu (Institute of Computing Technology)
Keywords
Fully Homomorphic Encryption (FHE), accelerator, Near-Memory Processing (NMP), DRAM
Abstract
Fully Homomorphic Encryption (FHE) imposes substantial memory demands, presenting significant challenges for efficient hardware acceleration. Near-Memory Processing (NMP) has emerged as a promising architectural solution to alleviate the memory bottleneck. However, the irregular memory access patterns and flexible dataflows inherent to FHE limit the effectiveness of existing NMP accelerators, which fail to fully utilize the available near-memory bandwidth. In this work, we propose FlexMem, a near-memory accelerator featuring high-parallel computational units with varying memory access strides and interconnect topologies to effectively handle irregular memory access patterns. Furthermore, we design polynomial- and ciphertext-level dataflows to efficiently utilize near-memory bandwidth under varying degrees of polynomial parallelism and enhance parallel performance. Experimental results demonstrate that FlexMem achieves 1.26x performance improvement over the state-of-the-art near-memory architectures in end-to-end benchmarks, with on average 95.7% of near-memory bandwidth utilization.
8A-3
14:20-14:45

Data Flow-Aware Weight Remapping for Efficient Fault Tolerance in ReRAM-Based Accelerators

*Hyeonsu Bang (Sungkyunkwan University), Kang Eun Jeon, Jong Hwan Ko (Sungkyunkwan University)
Keywords
In-memory computing (IMC), ReRAM, Stuck-at fault (SAF)
Abstract
Resistive random-access memory (ReRAM)-based in-memory computing (IMC) systems offer significant advantages for efficient neural network inference. However, these systems are vulnerable to stuck-at faults (SAFs), which degrade inference accuracy—a challenge that becomes more pronounced in multi-level cell (MLC) configurations. A conventional fault mitigation technique, array-wise weight remapping (AWR), addresses SAFs but incurs significant hardware overhead. To overcome these limitations, we propose pseudo-array-wise weight remapping (PAWR), a novel method that integrates mux-wise weight remapping (MWR) and mux group remapping (MGR) to achieve cost-efficient fault tolerance. Experimental results demonstrate that even at a high 20% SAF rate, PAWR achieves accuracies within 0.7% of AWR, while significantly reducing area overhead by 86.5% and energy overhead by 72.4% compared to AWR.
8A-4
14:45-15:10

MIREDO: MIP-Driven Resource-Efficient Dataflow Optimization for Computing-in-Memory Accelerator

*Xiaolin He, Cenlin Duan, Yingjie Qi, Xiao Ma, Jianlei Yang (Beihang University)
Keywords
Computing-in-Memory, Dataflow Optimization, Mixed-Integer Programming, DNN Accelerator
Abstract
Computing-in-Memory (CIM) architectures have emerged as a promising solution for accelerating Deep Neural Networks (DNNs) by mitigating data movement bottlenecks. However, realizing the potential of CIM requires specialized dataflow optimizations, which are challenged by an expansive design space and strict architectural constraints. Existing optimization approaches often fail to fully exploit CIM accelerators, leading to noticeable gaps between theoretical and actual system-level efficiency.To address these limitations, we propose the MIREDO framework, which formulates dataflow optimization as a Mixed-Integer Programming (MIP) problem. MIREDO introduces a hierarchical hardware abstraction coupled with an analytical latency model designed to accurately reflect the complex data transfer behaviors within CIM systems. By jointly modeling workload characteristics, dataflow strategies, and CIM-specific constraints, MIREDO systematically navigates the vast design space to determine the optimal dataflow configurations. Evaluation results demonstrate that MIREDO significantly enhances performance, achieving up to 3.2X improvement across various DNN models and hardware setups.
8A-5
15:10-15:35

Hy2S-CIM: Hybrid-Cache-LUT FP/INT-CIM with 2-Stage Alignment and Area-efficient LUT for High Precision Vision AI Tasks

*Xiang Li (Tsinghua Shenzhen International Graduate School), Wenbin Jia (Tsinghua University), Sheng Zhang (Tsinghua Shenzhen International Graduate School), Yongpan Liu (Tsinghua University)
Keywords
Computing in memory, Vision AI, LUT, Cache
Abstract
Computing-in-Memory (CIM) is a promising solution for Deep Neural Network (DNN) accelerators, improving performance and energy efficiency by addressing the "memory wall" challenge. Recent Look-Up Table (LUT)-based CIM and Cache-based CIM architectures leverage weight-stationary dataflow and feature-map data locality for energy efficiency. However, applying existing Cache-CIM to high bit-width computation faces two main challenges:(1) High Bit-Width Data Incompatibility: Directly processing high bit-width inputs (e.g., INT12, FP16) lowers cache hit rates and energy efficiency. Maintaining hit rates would require an exponential, and thus prohibitive, increase in cache lines. Bit-serial approaches are also inefficient due to poor data locality in LSBs. (2) Cache-Unfriendly Floating-Point Alignment: Essential floating-point alignment operations inherently degrade data locality or prevent resource sharing across output channels. To overcome these, we propose Hy2S-CIM, featuring three key contributions: (1) A Hybrid-Cache-LUT CIM architecture that efficiently handles high bit-width computations by decoupling the multiplication of MSBs and LSBs to MSB Cache-based and LSB LUT-based multipliers, respectively. (2) A Cache-Friendly 2-Stage Alignment Scheme that preserves data locality and enables controller sharing. (3) An Area-efficient LUT multiplier that achieves a better trade-off between the overhead of LUT and decoder. Experiment results show that Hy2S-CIM achieves 1.4x-2.4x energy efficiency improvement over prior CIM architectures on typical vision AI tasks.

Session 8B

(T1-A) Exploring interconnect design, power modeling, and system prototyping
13:30-15:35 | Thursday, January 22, 2026 | Snow White 2
Chair(s):
Chester Park (Konkuk University)
Fitzgerald Sungkyung Park (Pusan National Univeristy)
8B-1
13:30-13:55

DOME: A Domain-Orchestrated Multi-GPU Optical Network for Rack-Scale Systems

*Chongyi Yang, Peiyu Chen, Bohan Hu (The Hong Kong University of Science and Technology (Guangzhou)), Yinyi Liu, Wei Zhang (The Hong Kong University of Science and Technology), Jiang Xu (The Hong Kong University of Science and Technology (Guangzhou))
Keywords
Optical interconnection networks, GPUs, path reservation, rack-scale systems
Abstract
Modern data centers increasingly use multi-GPU systems for AI and high-performance computing, where growing data transfer demands lead to high energy consumption and performance bottlenecks in electrical networks. Optical interconnects offer compelling advantages to address these challenges, including high bandwidth, distance-independent latency, and better energy efficiency. This paper presents DOME, a rack-scale optical interconnection network that connects multiple GPUs using high-radix optical switches and extends optical interfaces into GPU packages close to memories and multiprocessors, forming distinct in-GPU and GPU-to-GPU network domains. To efficiently manage the paths across network domains, we develop a time-slotted path reservation scheme that quickly identifies the earliest time when the path is available in all network domains, reducing unnecessary reservation retries. Evaluations reveal that DOME achieves 14% speedup while maintaining comparable energy consumption compared to the state-of-the-art preemptive chain feedback control scheme.
8B-2
13:55-14:20

H²NoC: A Hybrid NoC Architecture for FPGAs with Hardened Interconnects

Jinhyeong Park, Younggil Jeong, *Jeongwoo Park (Sungkyunkwan University)
Keywords
Hybrid Noc with Hardened Interconnect(H²NoC), Field Programmable Gate Array(FPGA), High-Bandwidth Memory (HBM), Network-on-Chip (NoC), Butterfly-Fat Tree (BFT), Out-of-Order Generating Buffers (OGB), Quad-cluster Partitioning
Abstract
Modern FPGA platforms with high-bandwidth memory (HBM) subsystems often incorporate hardened interconnects in their static regions to alleviate routing congestion in programmable logic. However, these interconnects typically suffer from up to 67% bandwidth degradation in platforms such as the Alveo U280 under specific access patterns due to their inherent structural limitations. Previous works introduce custom Networks-on-Chip (NoC) in FPGA programmable logic but fail to fully utilize all 32 HBM channels due to excessive resource overhead and severe routing congestion. In this paper, we propose H²NoC, a hybrid NoC architecture that integrates the hardened interconnect embedded in the static HBM region with a custom Butterfly-Fat Tree (BFT) NoC in programmable logic, enabling full 32-channel utilization with minimum routing congestion. The custom 3-stage NoC is partitioned into four clusters, enabling flexible floorplanning strategies that facilitate optimal implementation across SLRs. The out-of-order generating buffer (OGB) mitigates read bandwidth loss by stage reduction from five to three through lightweight AXI reordering, thereby restoring bandwidth and sustaining performance. Compared to prior state-of-the-art NoCs on Alveo U280, H²NoC demonstrates an improvement of 134.8% in bandwidth-to-resource efficiency. H²NoC is evaluated using a set of synthetic benchmarks designed to emulate various traffic patterns. It achieves a peak bandwidth of 391 GB/s under bit permutation traffic and sustains over 200 GB/s across most traffic patterns. The proposed hybrid architecture delivers performance comparable to a full custom 5-stage BFT network while requiring significantly fewer FPGA resources.
8B-3
14:20-14:45

CommILP: Synthesizing Communication Infrastructure for Domain Computing Platforms

*Qucheng Jiang, Jacob Ginesin, Oscar Kellner (Northeastern University), Tianrui Ma (Washington University in St. Louis), Gunar Schirner (Northeastern University)
Keywords
Domain-specific platform, communication synthesis, integer linear programming (ILP), accelerator-rich system, interconnect design, design space exploration (DSE), streaming application, hardware-software co-design
Abstract
Domain-specific accelerator-rich platforms have emerged as a promising solution for edge applications with high-throughput and low-power requirements. Prior design-space exploration (DSE) methods efficiently allocate processing elements (PEs) and bind applications but largely assume simplified, centralized communication models. This oversimplification limits scalability and wastes resources in hardware tiles with sparse, statically known communication patterns. We propose CommILP, an Integer Linear Programming (ILP)-based framework for synthesizing intra-tile communication infrastructure in domain platforms. Given a platform allocation and PE-to-PE communication graphs for multiple applications, CommILP instantiates modular Communication Elements (CEs), determines a minimal interconnect topology, and assigns exact routing paths per connection. It guarantees that application throughput is preserved while minimizing area and energy. CommILP introduces a hierarchical ILP formulation across platform, application, and hop-stack levels, and supports both App-Level and Domain-Aggregated modeling modes to balance solution quality and runtime. Evaluated on ASIC and FPGA backends using both real (OpenVX-40) and synthetic (RNDComm-100) domains, CommILP achieves up to 69.4% area and 73% energy savings over centralized baselines, while remaining scalable to over 100 applications. It complements existing platform DSE tools by bridging the gap between computation binding and hardware-efficient communication synthesis.
8B-4
14:45-15:10

ReadyPower: A Reliable, Interpretable, and Handy Architectural Power Model Based on Analytical Framework

*Qijun Zhang, Shang Liu, Yao Lu, Mengming Li, Zhiyao Xie (The Hong Kong University of Science and Technology)
Keywords
Power model, CPU modeling, Data-driven model
Abstract
Power is a primary objective in modern processor design, requiring accurate yet efficient power modeling techniques. Architecture-level power models are necessary for early power optimization and design space exploration. However, classical analytical architecture-level power models (e.g., McPAT) suffer from significant inaccuracies. Emerging machine learning (ML)-based power models, despite their superior accuracy in research papers, are not widely adopted in the industry. In this work, we point out three inherent limitations of ML-based power models: unreliability, limited interpretability, and difficulty in usage. This work proposes a new analytical power modeling framework named ReadyPower, which is ready-for-use by being reliable, interpretable, and handy. We observe that the root cause of the low accuracy of classical analytical power models is the discrepancies between the real processor implementation and the processor’s analytical model. To bridge the discrepancies, we introduce architecture-level, implementation-level, and technology-level parameters into the widely adopted McPAT analytical model to build ReadyPower. The parameters at three different levels are decided in different ways. In our experiment, averaged across different training scenarios, ReadyPower achieves >20% lower mean absolute percentage error (MAPE) and >0.2 higher correlation coefficient R compared with the ML-based baselines, on both BOOM and XiangShan CPU architectures.
8B-5
15:10-15:35

A RISC-V CHERI VP: Enabling System-Level Evaluation of the Capability-Based CHERI Architecture

*Manfred Schlägl, Andreas Hinterdorfer (Johannes Kepler University Linz), Daniel Große (Johannes Kepler University Linz & DFKI Bremen)
Keywords
Virtual Prototyping, System-level Evaluation, SystemC, CHERI, RISC-V
Abstract
Despite decades of mitigation efforts, memory corruption bugs remain a dominant source of security vulnerabilities. CHERI, a capability-based architecture, directly targets this problem by replacing traditional pointers with Capabilities that encode bounds, permissions, and tamper-protection tags. However, CHERI represents a significant architectural intervention that impacts not only the processors core, but the entire Hardware/Software platform. System-level evaluation methods, such as Virtual Prototypes (VPs), have shown to be highly valuable for exploring, validating, and optimizing such complex Hardware/Software systems. This paper introduces the first open-source SystemC/TLM-based CHERI-enhanced RISC-V VP. The VP comes with support for Virtual Memory Management (VMM) and is capable of executing complex software stacks, such as the general purpose and memory-safe CheriBSD operating system. A verification using TestRIG demonstrates the VP's robustness, passing 2.15 million test cases. A case study with CheriBSD and 10 representative, demanding benchmark workloads highlights the VP#s capability to simulate complex CHERI-enabled systems and to provide valuable insights for Hardware/Software co-design. The CHERI-enhanced VP, along with the Software used in our case study, will be released as open-source on GitHub.

Session 8C

(T12-A) Privacy-Preserving & Secure AI Computation
13:30-15:35 | Thursday, January 22, 2026 | Snow White 3
Chair(s):
Danella Zhao (University of Arizona)
Song Bian ( Beihang University)
8C-1
13:30-13:55

EP-HDC: Hyperdimensional Computing with Encrypted Parameters for High-Throughput Privacy-Preserving Inference

Jaewoo Park, Chenghao Quan, *Jongeun Lee (Ulsan National Institute of Science and Technology)
Keywords
PPML, HDC, FHE
Abstract
While homomorphic encryption (HE) provides strong privacy protection, its high computational cost has restricted its application to simple tasks. Recently, hyperdimensional computing (HDC) applied to HE has shown promising performance for privacy-preserving machine learning (PPML). However, when applied to more realistic scenarios such as batch inference, the HDC-based HE has still very high compute time as well as high encryption and data transmission overheads. To address this problem, we propose HDC with encrypted parameters (EP-HDC), which is a novel PPML approach featuring client-side HE, i.e., inference is performed on a client using a homomorphically encrypted model. Our EP-HDC can effectively mitigate the encryption and data transmission overhead, as well as providing high scalability with many clients while providing strong protection for user data and model parameters. In addition to application examples for our client-side PPML, we also present design space exploration involving quantization, architecture, and HE-related parameters. Our experimental results using the BFV scheme and the Face/Emotion datasets demonstrate that our method can improve throughput and latency of batch inference by orders of magnitude over previous PPML methods (36.52~1068x and 6.45~733x, respectively) with <1% accuracy degradation.
8C-2
13:55-14:20

RTL Verification for Secure Speculation Using Cascaded Two-Phase Information Flow Tracking

*Yuhao Liu (Institute of Microelectronics of the Chinese Academy of Sciences, University of Chinese Academy of Sciences), Ying Li, Yinhao Zhou, Chang Guo, Zhenfeng Li, Yanyun Lu (Institute of Microelectronics of the Chinese Academy of Sciences)
Keywords
Speculative Execution Attack, Hardware Security, Information Flow Tracking, Formal Property Checking
Abstract
Modern out-of-order processors have exposed critical vulnerabilities to speculative execution attacks such as Spectre and Meltdown. Despite various countermeasures taken in recent years, new attacks continue to emerge, necessitating a formal understanding of the underlying security flaws. Existing formal approaches face fundamental limitations: those targeting abstract models overlook crucial microarchitectural details, while RTL-level methods struggle with scalability issues. This paper proposes Cascaded Two-Phase Information Flow Tracking (CTP-IFT), an RTL-level verification technique towards a formal security evaluation of speculative execution. We novely design a cascade-structured IFT model to track data flows in the access phase and timing flows in the transmission phase of the speculative execution attacks. Specifically, we propose a new tracking model for timing flows called shadow pipeline to improve the tag-based tracking model. Consequently, the CTP-IFT model markedly reduces the complexity for checking the noninterference-based property used for secure speculation, with minimal expertise in formal methods required. We evaluate CTP-IFT on three out-of-order processors. The results show that CTP-IFT exhibits a significant advantage in detecting attacks on insecure designs and obtaining proofs on secure designs. Notably, in the BOOM processor, CTP-IFT can achieve over 50x speedup in finding attacks over Contract Shadow Logic, a state-of-the-art verification scheme, and uncovers a new contention-based attack variant.
8C-3
14:20-14:45

Exploiting Feature-driven Approximation to Preserve Privacy in Machine Learning based Health Monitoring Systems

*Nishanth Chennagouni (University of New Hampshire), Wei Lu (Keene State College), Qiaoyan Yu (University of New Hampshire)
Keywords
Approximation, Privacy Preservation, Health monitoring, sensor, human activity recognition, pain detection
Abstract
Real-time health-monitoring systems generate massive biometric sensor data, placing substantial demands on memory, computation, and power resources. Additionally, the sensitive nature of such data raises critical privacy concerns related to attributes like gender, age, and body mass index (BMI). To address these challenges, this work proposes FAxC—a novel feature-driven approximation framework that performs multi-dimensional data reduction by leveraging the biometric features of sensor signals. Unlike prior approximation techniques that operate on isolated signals or uniformly sample data, FAxC intelligently selects and masks segments to preserve human activity recognition (HAR) performance while enhancing user privacy. Our case study shows that, compared to existing privacy-aware approaches, FAxC reduces the disparity in gender-distinguishable biometric features in human activity recognition, decreasing Z-direction peak acceleration from 70% to 22% and stride root mean square (RMS) from 29% to 3%. Our FAxC enhances the privacy-preserving rate by up to 5.6x over the baseline and outperforms existing methods by a factor ranging from 1.2x to 5.4x. The proposed method has also been validated on the TAME pain dataset for a voice-based pain level detection system. FAxC reduces the risk of gender leakage by 50% compared to using raw data, while maintaining pain level detection accuracy comparable to existing methods.
8C-4
14:45-15:10

Safeguarding Neural Network IPs from Scan Chain based Model Extraction Attacks

*E Bhawani Eswar Reddy, Gundameedi Sai Ram Mohan, Sukanta Bhattacharjee, Chandan Karfa (Indian Institute of Technology Guwahati)
Keywords
Scan Chain, Neural Network Model Extraction Attacks, SAT Attack
Abstract
This work addresses the vulnerability of trained models, particularly against scan-chain based attacks exploiting the accessibility of activation function outputs to illicitly acquire model information such as weights and biases. To counter this threat, we propose a novel defense mechanism inspired by logic locking. Our intention is to obfuscate the interconnections among the layers in the ML model controlled by secret keys stored in tamper-proof memory. Our model ensures normal functionality with the correct key while disrupting data flow and producing erroneous outputs with an incorrect key. Extensive experiments demonstrate a significant drop in accuracy when incorrect keys are used, with the accuracy drop being more pronounced when the latter layers of the model are locked. The protection technique incurs 0.3% overhead in the area and 0.1% overhead in the latency of the designs.
8C-5
15:10-15:35

FedBit: Accelerating Privacy-Preserving Federated Learning via Bit-Interleaved Packing and Cross-Layer Co-Design

Xiangchen Meng, *Yangdi Lyu (The Hong Kong University of Science and Technology (Guangzhou))
Keywords
Federated Learning, Bit-Interleaved Packing, Homomorphic Encryption, Cross-Layer Co-Design
Abstract
Federated Learning (FL) with Fully Homomorphic Encryption (FHE) effectively safeguards data privacy during model aggregation by encrypting local model updates before transmission, mitigating threats from compromised or untrusted servers. However, the computational burden and ciphertext expansion inherent to homomorphic encryption significantly increase resource and communication overhead. To address this, we propose FedBit, a hardware/software co-designed framework optimized for the Brakerski-Fan-Vercauteren (BFV) scheme. FedBit employs bit-interleaved data packing to embed multiple model parameters into a single ciphertext slot, minimizing ciphertext expansion and maximizing computational parallelism. Additionally, we integrate a dedicated FPGA accel- erator to offload cryptographic operations, significantly outper- forming software-only implementations. Experimental evaluation shows FedEdge-HE delivers two orders of magnitude speedup on encryption, reduces communication overhead by 29%, and improves model accuracy by 10% compared to existing HE-based FL frameworks.

Session 8D

8D (Designer Forum 3) Chiplets Go Mainstream: Design Automation for 2.5D/3D Systems
13:30-15:35 | Thursday, January 22, 2026 | Sleeping Beauty 1/2
Chair(s):
Hailong Yao (University of Science & Technology Beijing)
8D-1
13:30-13:55

Implementation of Shift-Left Integrated STCO Design Method for Advanced Packaging

*Kai Zhao (Huatian Technology (Jiangsu) Co., Ltd.)
Biography
Kai Zhao is Director of Design & Simulation, Huatian Technology (Jiangsu) Co., Ltd. Prof. Zhao earned his PhD in Microelectronics and Solid-State Electronics from Peking University in 2012 and holds the academic title of Professor. Currently, he serves as Director of Design and Simulation at Huatian Technology (Jiangsu) Co., Ltd. His research focuses on 2.5D advanced packaging technology and multi-physics collaborative design & simulation. He has successively presided over national-level scientific research projects, including the Key R&D Program of the Ministry of Science and Technology and the National Natural Science Foundation of China (NSFC). He has published more than 50 SCI/EI-indexed papers and filed over 30 invention patent applications.
Abstract
With AI hardware driving the evolution of advanced packaging toward larger form factors aggressively, 2.5D packaging based on heterogeneous materials (e.g., polyimide and Si Bridge) is emerging as the mainstream. Due to innovations the material system, traditional rectilinear routing rules are no longer applicable, while more complex structures and material combinations impose higher requirements on System technology Co-Optimization. From the perspective of backend packaging, this report implements a multi-physics coupling-aware automatic routing tool for advanced packaging through shift-left integration—incorporating stress, delamination, and warpage constraints from large-size composite material systems into the routing phase upfront, and adopting octilinear routing rules. A comparison with two mainstream commercial tools demonstrates that the developed shift-left integrated design tool offers comprehensive advantages.
8D-2
13:55-14:20

Integrated Full-custom 3DIC Design Methodology and Flow

*Michael Liu (Empyrean Technology)
Biography
Michael Liu is the senior product director of Empyrean Technology. He has more than 10 years of experience in ASIC chip design, manufacturing, and packaging EDA software product development and management, focusing on the planning, development, and promotion of EDA products. He helps Empyrean build up a mature Analog/Mixed-Signal design flow, and expand it to the fields of other full custom design such as flat panel display, signal chain, memory, RF and optoelectronics. He is building a reliability design methodology for design-manufacturing collaboration and a PPAC-oriented design-manufacturing-packaging collaborative design solution. The solutions are widely adopted by national and international leading design houses.
Abstract
Emerging application scenarios such as AI, data centers, 5G communications, and autonomous driving have imposed comprehensive and stringent requirements on the functions, computing power, and cost of chips. The advanced 3D IC packaging technology represented by Chiplet has become one of the best solutions to address this challenge. The Chiplet 3D IC technology has given rise to new business models and design-manufacturing paradigms. In the process of implementing related technologies, EDA tools that carry design methodologies play a very important role. This report will introduce the latest advancements and practices in 3D IC design methodologies and provide a vision for their future development trends.
8D-3
14:20-14:45

Al for EDA: Chiplets Simulation New Era

*Jimin Wen (Xpeedic)
Biography
Dr. Jimin Wen is the R&D Director at Xpeedic, where he leads advanced initiatives in circuit and system-level simulation, High-Performance Computing (HPC), and the application of AI in EDA. With 17 years of R&D management experience, he was recruited to Shanghai as a high-level overseas talent to drive innovation in next-generation simulation technologies. Dr. Wen is a recipient of global technology innovation awards from both Cadence and Ansys. He holds a Ph.D. from the Chinese Academy of Sciences, has authored over 40 IEEE papers—including a HOST Best Paper award—and holds 5 international patents.
Abstract
As the semiconductor industry moves decisively into the "Beyond Moore" era, 2.5D/3D Chiplet technology has emerged as the primary driver for performance scaling. However, the transition from monolithic SoCs to heterogeneous 3D interconnected systems introduces unprecedented challenges in multi-physics simulation, particularly regarding thermal management, mechanical stress, and high-speed die-to-die (D2D) signal integrity. Traditional simulation methodologies struggle to balance accuracy and convergence speed when facing the massive parameter spaces of advanced packaging.
In this session, Xpeedic presents a new era of Chiplet simulation powered by XAI, a full-stack multi-agent AI platform designed to break the simulation bottleneck in 2.5D/3D system design. We will demonstrate how XAI.Sim, leveraging Physics-Informed AI and reinforcement learning, achieves 100x-1000x acceleration in thermal and multi-physics analysis—critical for predicting heat dissipation in dense 3D stacks without sacrificing high-fidelity results.
Furthermore, we will discuss XAI.Opt, which utilizes surrogate modeling to optimize Signal Integrity (SI) for complex RF and high-speed links, reducing iteration cycles significantly compared to traditional methods. By integrating these specialized agents with XAI.Copilot for automated workflow orchestration , we propose a pathway toward intelligent System Technology Co-Optimization (STCO), enabling designers to efficiently navigate the complexities of next-generation Chiplet architectures.
8D-4
14:45-15:10

Breakthroughs in Chiplet-Based Electromagnetic Designs: Efficient Simulation and Al-Driven Optimization

*Peng Zhao(Faraday Dynamics)
Biography
Peng Zhao, Chief Scientist of Faraday Dynamics Co., Ltd. He received his B.Eng degree and M.Phil degree in electronic engineering department both from the Zhejiang University. He earned his Ph.D. degree in electronic engineering from the City University of Hong Kong. He joins Hangzhou Dianzi University in 2014. His research interests include Electronic Design Automation (EDA), Radio-Frequency (RF) Circuit, Computational Electromagnetics and Antennas. In 2017, as a co-founder, he founded Faraday Dynamics Co., Ltd., mainly engaged in the development of efficient EDA software. He is a member of the leading innovation and entrepreneurship team in Zhejiang Province. He hosted multiple EDA research projects, including National Natural Science Foundation, subject of National Key R&D Program. He has published over 60 research papers and obtained over 20 authorized patents.
Abstract
With the slowing down of Moore's Law, Chiplet technology has emerged as a core pathway to break through the performance bottlenecks of traditional monolithic chips, enabling the heterogeneous integration of multi-functional chips. However, the dense interconnections and heterogeneous integration in Chiplet architectures give rise to complex electromagnetic (EM) effects, such as signal crosstalk, impedance mismatch, and multi-physics field coupling, which severely restrict the reliability and efficiency of system operation. Conventional electromagnetic simulation methods face the dilemma of poor efficiency when dealing with large-scale Chiplet systems, while traditional optimization methods need to continuously call rigorous time-consuming electromagnetic simulations during the optimization process, further deteriorating the design efficiency. To address these challenges, this talk focuses on breakthroughs in rapid electromagnetic simulation and AI-driven optimization for Chiplet electromagnetic designs. Firstly, a fast integral equation electromagnetic simulation algorithm based on the layered Green's function is proposed. This algorithm has been implemented in a commercial EDA tool, called UltraEM, achieving a computational complexity of O(NlogN) for full-wave electromagnetic simulation and enabling large-scale Chiplet-based designs. Secondly, an AI-based parametric modeling technique is introduced, leveraging deep learning and multi-objective optimization, which has been implemented in another commercial EDA tool, called EMOptimizer, significantly improving the efficiency of Chiplet-based chip design and optimization. This research provides a solid foundation to overcome electromagnetic design challenges in Chiplet systems, paving the way for widespread adoption of Chiplet technology.
8D-5
15:10-15:35

A Shift-Left Design Methodology for Chiplet Architecture: Automation, Multi-Physics Modeling and Converaence

*Chen Wu (BTD Technology/EIT)
Biography
Chen Wu, got his PhD from UCLA. He is now an assistant researcher at Ningbo Institute of Digital Twin, Eastern Institute of Technology, and the director of Chiplet team at BTD. His research interests include AI Chips with architecture, instruction set and compiler designs, and EDA for Chiplets, with left-shift design methodologies, AI-driven multi-physics evaluation and early-stage design space exploration for chiplet-based design. He has published 30 papers, got one best paper reward, two best paper nominations, and 10 patents.
Abstract
Chiplet technology has emerged as a key architectural pathway in the post-Moore era, enabling continued scaling of functionality and performance through heterogeneous integration of multiple specialized dies. Despite this promise, the design and verification of chiplet-based systems remain constrained by significant methodological and tooling limitations.
Automation Challenges: Existing EDA solutions -- both domestic and international - exhibit limited automation capability and poor routability, often requiring extensive manual tuning. There is an urgent need for next-generation automated tools that support system-level generation, physical planning, placement, and routing tailored for chiplet architectures.
Multi-Physics Simulation Bottlenecks: Traditional field-solver-based multi-physics tools demand layout truncation and long simulation runtimes, and struggle to operate on incomplete design. This incompatibility severely constrains early-stage design exploration. Accelerated analytical and AI-enhanced methods for signal integrity (SI), power integrity (PI), and electro-thermal coupling are critically needed.
Design-Verification Disconnect: Fragmented workflows between design and verification tools result in long and inefficient iteration cycles - design, verify, redesign - often with uncertain convergence. A "shift-left" verification paradigm is essential to provide early guidance and reduce overall design latency.
This talk presents an integrated workflow that directly addresses these challenges and advances the state of chiplet physical design and verification. Physical Planning: We introduce a systematic methodology for physical-level planning that co-optimizes thermal management, power delivery, and routing feasibility while incorporating SI/PI constraints. AI-Driven Multi-Physics Evaluation: A data-driven and AI-accelerated multi-physics evaluation framework is proposed to rapidly predict electrical, thermal, and mechanical behaviors, enabling fast design-space exploration without reliance on time-consuming field solvers. Convergent Design Flow: By unifying the proposed planning and evaluation models, we establish a convergent, shift-left-oriented design-verification flow that supports early validation, reduces the number of design iterations, and markedly improves time-to-closure.

Session 8E

(T9-B) Advances in Routing for Chips and Advanced Packaging
13:30-15:35 | Thursday, January 22, 2026 | Sleeping Beauty 3
Chair(s):
Siting Liu (The Chinese University of Hong Kong)
Hung-Ming Chen (National Yang Ming Chiao Tung University)
8E-1
13:30-13:55

DPO-3D: Differentiable Power Delivery Network Optimization via Flexible Modeling for Routability and IR-Drop Tradeoff in Face-to-Face 3D ICs

*Zhen Zhuang (The Chinese University of Hong Kong), Zheng Yang (Georgia Institute of Technology), Yuxuan Zhao (The Chinese University of Hong Kong), Jiawei Hu (Georgia Institute of Technology), Bei Yu (The Chinese University of Hong Kong), Sung Kyu Lim (Georgia Institute of Technology), Tsung-Yi Ho (The Chinese University of Hong Kong)
Keywords
3D IC, PDN optimization, Differentiable optimization, GPU acceleration, IR drop, Routability
Abstract
With the advancement of technology nodes, increasing device density leads to serious routability challenges, especially for the promising face-to-face 3D ICs. Because most metal layer resources near the inter-die bonding layer are reserved by the stripes of power delivery networks (PDNs). Consequently, signal routing performance degrades as numerous nets must connect devices on both the bottom and top die through power-reserved metal layers. There are two primary challenges for 3D IC PDN optimization, including flexible power stripe optimization with global search capacity and addressing inter-die routability. Traditional 3D IC optimization methods using unified power stripes limit the search space, while single-die optimization methods cannot maintain the inter-die coherence. To address these challenges, we propose a differentiable PDN optimization framework, DPO-3D, for routability and IR-drop tradeoff in 3D ICs. At first, a novel modeling method that supports flexible 3D IC PDN optimization without compromising global search capability is proposed. Additionally, the modeling method effectively reduces design complexity. We further formulate the problem using an integer linear programming (ILP) model to address inter-die routability, achieving a good tradeoff between routability and IR-drop. Finally, we propose a GPU-accelerated differentiable method to solve the problem of overcoming the scalability issue of the ILP model. Compared with the state-of-the-art baseline, DPO-3D achieves a 7.10% DRC violation reduction, an 11.61% IR-drop reduction, and a 7.30X speedup on average.
8E-2
13:55-14:20

Enhancing Pin Accessibility Through Pin Pattern Migration and Optimization Across Cell Boundaries

*Hyunbae Seo, Sehyeon Chung, Taewhan Kim (Seoul National University)
Keywords
pin accessibility, cell layout generation, routing
Abstract
This paper presents a new approach to the problem of improving pin accessibility, which has become an important task for physical design at the advanced technology nodes. To this end, we propose a new concept, called disclosed-pins, which allows to daringly expose some pins of standard cells across the cell boundary, which otherwise, there would be no way to boost pin accessibility. However, it requires two key issues that should be resolved to make the disclosed-pin concept fully effective. Those are (1) how can we systematically synthesize diverse structures of standard cells with disclosed-pins? and (2) how can we effectively exploit those cells in the course of chip implementation? Precisely, for issue 1, we propose a new in-cell routing method combined with disclosed-pin generation, developing it based on an SMT (satisfiability modular theory) formulation, while for issue 2, we develop a post-place optimization, in which we optimally replace, on a row-basis, the cell instances with low pin accessibility in a row by the cells with disclosed-pins, formulating it into an instance of DP (dynamic programming). In the meantime, through experiments with benchmark circuits, it is shown that our proposed methodology utilizing our synthesized cells with disclosed-pins can considerably relieve the burden in block-level routing, resulting in reducing the number of design rule violations by 26.05% and 22.80% on average over that produced by the state-of-the-art prior methods of pin pattern (internal) extension and dummy poly insertion, respectively.
8E-3
14:20-14:45

Topological Optimization-Based Layer Assignment Method for Fan-Out Wafer-Level Packaging

Haoying Wu, *Guanxian Zhu (Wuhan University of Technology), Xu He (Hunan University), Mingyu Liu (Huazhong University of Science and Technology)
Keywords
Fan-Out wafer-level packaging (FOWLP), inter-chip connections, multi-directional projection circular model, layer assignment
Abstract
Fan-Out Wafer-Level Packaging (FOWLP) technology supports the integration of heterogeneous chips within a single package and enables interchip connections through redistribution layers (RDLs). This provides critical support for constructing highly integrated heterogeneous systems. However, the rapid increase in interconnect density has significantly heightened the topological complexity of net crossings. In light of manufacturing cost constraints and limited routing resources, it has become crucial to effectively minimize the number of topologically conflicting nets to achieve high-quality layer assignment solutions. State-of-the-art layer assignment methods predominantly utilize a circular model based on fixed access points, onto which net connections are projected to evaluate topological crossings. However, existing circular models have limitations in identifying topologically compatible net combinations and often fail to encompass the full range of possibilities. Moreover, the use of fixed access points limits the reduction of topological conflicts, which adversely affects overall routability. To remedy these disadvantages, we propose a topological optimization-based layer assignment method. The proposed method includes the following key techniques: (1) a multi-directional projection circular (MDPC) model that expands the solution space; (2) a global topology optimization strategy based on flexible access points to significantly reduce net crossings; and (3) a wirelength-driven access point allocation algorithm aimed at minimizing total wirelength. Experimental results indicate that, compared to Wen's layer assignment algorithm, the proposed method achieves a 15.7% reduction in the number of routing layers when the number of layers is not fixed, and improves the net assignment ratio by 31.6% under fixed-layer conditions.
8E-4
14:45-15:10

Au-MEDAL: Adaptable Grid Router with Metal Edge Detection And Layer Integration

Andrew B. Kahng (University of California, San Diego), Seokhyeong Kang, Sehyeon Kim, *Jakang Lee (Pohang University of Science and Technology), Dooseok Yoon (University of California, San Diego)
Keywords
Standard Cell Design Automation, SMT-based Router, Design Technology Co-optimization
Abstract
Standard cell layout routing faces several challenges due to stringent design rules. In this paper, we propose Au-MEDAL, a new SMT-based standard cell router that supports various design rules at the nanometer scale. Au-MEDAL implements extended design rules to enable bidirectional routing and proposes methods that: (i) ensure at least one pin access approaches for block-level routing; (ii) integrate Middle-of- Line (MOL) and Back-End-of-Line (BEOL) routing; (iii) support variable routing grid spacings; and (iv) perform off-grid design rule checks. As a result, based on the ASAP7 PDK, Au-MEDAL not only achieves cell- level DRC-clean routing—a feat unattained by existing in-cell routers—but also improves pin accessibility, reduces obstructive routing length and minimizes the use of vias and M2 in cell design. Across the evaluated block designs, Au-MEDAL achieves improvements of 0.87%, 1.19%, 2.34%, and 25.05% in total core area, total power, effective clock, and total negative slack, respectively. Its block-level DRC performance is comparable to that of manually designed layouts and shows significant improvements over previous in-cell router results.
8E-5
15:10-15:35

GPU-Accelerated Global Routing with Balanced Timing and Congestion Optimization

*Jinghui Zhou, Fuxing Huang, Lixin Chen, Xinglin Zheng, Ziran Zhu (Southeast University)
Keywords
global routing, timing driven, pattern routing, GPU acceleration
Abstract
As integrated circuit (IC) designs continue to scale in complexity, global routing faces increasing challenges in managing timing and congestion simultaneously. This paper proposes a GPU-accelerated global routing framework that effectively balances timing optimization and congestion mitigation. The proposed framework begins with a preprocessing stage that partitions ultra-large nets and decomposes nets based on estimated pin slack to enhance scalability and timing sensitivity. For critical nets, we propose a timing and congestion driven GPU-accelerated hybrid 3D pattern routing method. Specifically, an Elmore-based timing weight calculation method is proposed to efficiently capture the timing criticality of routing paths, and the resulting weights are ordered for more targeted and effective timing optimization. Then, a well-designed cost scheme is proposed to better balance timing and congestion. Finally, we develop a GPU-accelerated hybrid 3D pattern routing strategy that combines L-shape and sparse Z-shape patterns to improve routing efficiency. After routing critical nets, the remaining non-critical nets are routed using a congestion-driven GPU-accelerated routing engine that supports flexible detours to alleviate congestion and utilize residual routing resources. Compared with the champion of the ISPD 2025 contest, experimental results on the ISPD 2025 contest benchmarks show that our algorithm achieves 19.4% better weighted scores and 16.2% faster runtime.

Session 8F

(T5-D) Memory-Centric and Compute-in-Memory Innovations
13:30-15:35 | Thursday, January 22, 2026 | Sleeping Beauty 5
Chair(s):
Zhengwu Liu (The University of Hong Kong)
Hongwu Jiang (The Hong Kong University of Science and Technology (Guangzhou))
8F-1
13:30-13:55

Fine-Grained Parallelization of FHE Workloads in Multi-GPU Systems

*Homer Gamil (New York University), Michail Maniatakos (New York University Abu Dhabi)
Keywords
multi-GPU, FHE, Hardware Acceleration
Abstract
Fully Homomorphic Encryption (FHE) enables computation on encrypted data but remains computationally expensive, especially for large-scale workloads. Multi-GPU systems offer significant acceleration potential, but performance is highly sensitive to factors such as the FHE parameter set, chosen algorithms, number and type of GPUs, and the communication medium. In this work, we propose a methodology to automatically parallelize and distribute FHE operations across GPUs to maximize performance. Our methodology adapts execution based on system configuration, workload characteristics and FHE parameters. Experimental results across varied FHE parameters and hardware setups demonstrate efficient scaling, achieving up to 95.6%, 91.3%, and 82.4% of ideal speedup with 2, 4, and 8 GPUs respectively. Furthermore, compared to coarse grained parallelization, our approach exhibits up to a 4.2x speedup, emphasizing the importance of fine-grained parallelization in efficient FHE execution. Lastly, compared to state-of-the-art multi-GPU parallelization methodologies, our work achieves up to a 2.81x improvement for comparable GPUs.
8F-2
13:55-14:20

Exploiting the Irregular Input Sparsity in Systolic Array-based DNN Accelerators via Local Soft Pooling

Desheng Fu (College of Informatics, Huazhong Agricultural University), Jingbo Jiang, Xiaomeng Wang (The Hong Kong University of Science and Technology), Jingyang Zhu (NVIDIA Corporation), *Xizi Chen (College of Informatics, Huazhong Agricultural University), Chi-Ying Tsui (The Hong Kong University of Science and Technology)
Keywords
Neural network, Input Sparsity, Soft Pooling
Abstract
One promising approach to mitigating the computational complexity of deep neural networks is to leverage the sparsity of input activations that results from the application of the ReLU function. However, the irregular distribution of zero-valued inputs poses a challenge for efficient implementation in existing regular architectures, such as systolic arrays. Previous works usually depend on specialized architectures to bypass the dummy computations during runtime. In contrast to these prior strategies, we propose a local soft pooling method to efficiently exploit the irregular input sparsity in systolic array-based architectures. Through local soft pooling, adjacent input rows can be safely merged at runtime, compressing the sparse input matrix into a compact format that is only 1/3 to 1/2 of its original size. The compact matrix can then be directly fed into the systolic array for computation. A computation saving over 67.78% is achieved across various networks on both CIFAR-10 and ImageNet with negligible accuracy loss. As a result, the throughput and energy efficiency are improved by 2.72 and 2.07 times, respectively.
8F-3
14:20-14:45

Input Reuse, Weight-Stationary Dataflow and Mapping Strategy for Depthwise Convolution in Computing-in-Memory Neural Network Accelerators

*Chia-Chun Wang (National Tsing-Hua University), Yu-Chih Tsai, Ren-Shuo Liu (National Tsing Hua University)
Keywords
AI accelerators, Accelerator architectures, In-memory computing, Convolutional neural networks
Abstract
Computing-in-Memory (CIM) is a promising solution to address the bottleneck of data movement in traditional Von Neumann architecture by performing in-situ computation in the memory. However, naively mapping the calculation of depthwise convolution onto general CIM leads to extreme computation latency or data storage size due to the limited reuse of input features. Unlike previous works focusing on designing CIM to support depthwise convolution effectively, we propose an input reuse dataflow and mapping strategy to support depthwise convolution without modifying the macro, achieving a good balance between computation latency and storage size. The key aspect of our dataflow and mapping is to maximize the reuse of input features by allowing CIM to produce partial sums of output features in the same cycle. Additionally, we propose a detailed hardware design of a partial sum processing unit to handle the reconstruction of output features from the partial sums. The experimental results show that our approach achieves up to 5.19x speedup with less than 1.74x required storage size compared to the baseline design, and only 6.3% area overhead compared to the CIM macro. Our work also achieves up to 2.87x model-wise energy efficiency.
8F-4
14:45-15:10

A Unified Compute-In-Memory Framework for Multisensory Emotion Recognition

*He Xiao (The University of Hong Kong), Yue Zhou, Dirui Xie, Qi Cheng, Xiaofang Hu (Southwest University), Zhengwu Liu, Ngai Wong (The University of Hong Kong)
Keywords
Speech emotion recognition, textual emotion recognition, multisensory, Transformer, RRAM, edge devices
Abstract
Transformer-based emotion recognition is becoming increasingly vital for human-computer interaction systems, yet faces dual challenges: software models frequently overlook the hierarchical interaction between localized semantic cues and global contextual dependencies, while hardware implementations struggle with the computational demands of parameter-heavy architectures in resource-constrained edge environments. To address these co-dependent limitations, we propose MEMO - an resistive random access memory(RRAM)-based Multisensory EMOtion recognition system featuring adaptive resource allocation through integrated hardware-software co-design. Our software framework, LiteERN, enables unified cross-modal feature processing by encoding modality-specific characteristics into hardware-compatible representations, eliminating redundant processing pipelines for heterogeneous sensory input (speech/text). Complementing this, our hardware architecture leverages RRAM-based compute-in-memory (CIM) to efficiently implement the proposed computational methods, resolving the inflexibility of existing accelerators that require modality-specific redesign. Benchmark validation demonstrates MEMO's significant efficiency improvements, achieving 14.9x to 53.5x energy reduction compared to GPU and CPU implementations while maintaining state-of-the-art accuracy (73.06% SER, 0.79 TER F1-score).
8F-5
15:10-15:35

Scope: A Scalable Merged Pipeline Framework for Multi-Chip-Module NN Accelerators

*Zongle Huang, Hongyang Jia (Tsinghua University), Kaiwei Zou (Capital Normal University), Yongpan Liu (Tsinghua University)
Keywords
NN inference, Multi-Chip-Module, Scheduling, Design space exploration
Abstract
Neural network (NN) accelerators with multi-chip-module (MCM) architectures enable integration of massive computation capability; however, they face challenges of computing resource underutilization and off-chip communication overheads. Traditional parallelization schemes for NN inference on MCM architectures, such as intra-layer parallelism and inter-layer pipelining, show incompetency in breaking through both challenges, limiting the scalability of MCM architectures. We observed that existing works typically deploy layers separately rather than considering them jointly. This underexploited dimension leads to compromises between system computation and communication, thus hindering optimal utilization, especially as hardware/software scale. To address this limitation, we propose Scope, a merged pipeline framework that incorporates this overlooked multi-layer dimension, thereby achieving improved throughput and scalability by relaxing tradeoffs between system computation, communication and memory costs. This new dimension, however, adds to the complexity of design space exploration (DSE). To tackle this, we develop a series of search algorithms that achieves exponential-to-linear complexity reduction, while identifying solutions that rank in the top 0.05% of performance. Experiments demonstrate that Scope achieves up to 1.73x throughput improvement while maintaining similar energy consumption for ResNet-152 inference compared to state-of-the-art approaches.

Session 9A

(T2-A) Advanced Embedded System Design and Optimization
15:55-18:00 | Thursday, January 22, 2026 | Snow White 1
Chair(s):
Xin Chen (University of New Mexico)
Caiwen Ding (University of Minnesota - Twin Cities)
9A-1
15:55-16:20

UltraMalloc: Efficient FPGA-based Memory Allocation Framework Optimized for HBM

*Yuwei Qu, Yiqing Mao, Jin Yanxing, Wai-Shing Luk, Lingli Wang (State Key Laboratory of Integrated Chips and Systems, Fudan University)
Keywords
Hardware acceleration, High Bandwidth Memory (HBM), Compiler, Memory Management
Abstract
Memory allocation efficiency remains a significant challenge in High-Level Synthesis (HLS) frameworks. Current dynamic memory management (DMM) techniques suffer from issues such as inefficiency, fragmentation, considerable hardware overhead, and difficulties in handling complex workloads. Conversely, existing static memory approaches often exhibit poor efficiency when addressing large-scale applications. Moreover, both dynamic and static methods lack sufficient support for High Bandwidth Memory (HBM), thereby limiting their effectiveness in complex neural network scenarios. To overcome these challenges, we propose transforming dynamic memory allocation into an optimized static memory allocation strategy specifically designed for FPGA systems. Our approach leverages the computational characteristic of neural network applications, which typically exhibit cyclic and fixed-bound behaviors. We tightly integrate an MLIR-based front-end compiler with an efficient static memory allocator, eliminating the need for dedicated allocator hardware while ensuring efficient runtime memory access. Furthermore, we introduce a customized AXI bus distribution mechanism and an address mapping strategy optimized for the multi-port and multi-bank architecture of HBM. This design significantly enhances bandwidth utilization and reduces latency. Experimental results confirm that our proposed methodology substantially improves allocation efficiency, achieves superior spatial utilization, effectively manages complex memory scenarios, and fully exploits HBM bandwidth, thereby outperforming existing state-of-the-art solutions.
9A-2
16:20-16:45

Viper: An ILP-Based Vectorization Framework for Fully Homomorphic Encryption

*Weidong Yang, Xinmo Li, Xiangmin Guo, Jianfei Jiang, Naifeng Jing, Qin Wang, Zhigang Mao, Weiguang Sheng (Shanghai Jiao Tong University)
Keywords
Homomorphic Encryption, SIMD, Vectorization, Integer Linear Programming, Compiler
Abstract
Fully Homomorphic Encryption (FHE) enables computation on encrypted data, facilitating computation tasks offloading to untrusted servers. However, it suffers from extremely high computational overhead. Fortunately, modern FHE schemes support packing multiple data into a single ciphertext, enabling a SIMD-style computation paradigm that amortizes the heavy computing overhead, leading to a growing demand for efficient vectorization methods. Nevertheless, existing vectorization approaches focus on regular programs or exhibit suboptimal performance due to local and limited optimization. In this paper, we present Viper, a novel vectorization framework for FHE that leverages global view of the program to enable intelligent scheduling and efficient rotation schemes. First, we identify the critical role of data replication in FHE vectorization and integrate it into the problem definition to unlock new optimization opportunities. Viper formulates vectorization as an integer linear programming (ILP) model, enabling global optimization of instruction packing and data layout. To improve solving efficiency, we propose a series of optimizations including schedule window pruning and rotation-aware lane assignment. Experimental results demonstrate that Viper achieves a geometric mean speedup of 3.05x on CPU and 4.19x on GPU compared to the state-of-the-art method Coyote.
9A-3
16:45-17:10

WPU: A Pipelined WebAssembly Processing Unit for Embedded IoT Systems

*Jinyeol Kim, Chaebin Lee, Jongwon Oh, Joungmin Park, Seung Eun Lee (Seoul National University of Science and Technology)
Keywords
Processor, WASM, Embedded System, IoT
Abstract
WebAssembly (WASM) is a portable, stack-based virtual machine designed to execute high-level languages such as C, C++, and Rust in modern computing platforms. However, its performance on embedded IoT systems remains significantly limited due to software interpretation overhead and resource constraints. This paper presents the WebAssembly Processing Unit (WPU), a dedicated hardware accelerator tailored for WASM execution in embedded environments. The WPU features a five-stage pipelined processor integrated with a Cortex-M0 core, supporting 94 WASM instructions across integer, floating-point, and control flow operations. The architecture supports both 32-bit and 64-bit data types, and includes LEB128 decoding with a specialized register file structure to match WASM’s stack-based execution model. To address pipeline hazards, a custom forwarding mechanism and branch resolution logic are introduced, optimized for WASM’s control structure. The design is implemented on an FPGA and achieves an average of 2.9x speedup over interpreter-based runtimes, and up to 95x acceleration over JIT-based runtimes in selected benchmarks. Evaluation across diverse benchmark applications confirms that the WPU provides reliable, low-latency performance. This work establishes WPU as a scalable and practical solution for accelerating WASM workloads in embedded systems.
9A-4
17:10-17:35

SARA: A Stall-Aware Memory Allocation Strategy for Mixed-Criticality Systems

*Meng-Chia Lee, Wen Sheng Lim, Yuan-Hao Chang, Tei-Wei Kuo (National Taiwan University)
Keywords
Memory Allocation, Resource Allocation, Mixed-Criticality Systems, Memory-Constrained Systems
Abstract
The memory capacity in edge devices is often limited due to constraints on cost, capacity, and power. Consequently, memory competition leads to inevitable page swapping in memory-constrained mixed-criticality edge devices, causing slow storage I/O and thus performance degradation. In such scenarios, inefficient memory allocation disrupts the balance between application performance, causing soft real-time (soft RT) tasks to miss deadlines or preventing non-real-time (non-RT) applications from optimizing throughput. Meanwhile, we observe unpredictable, long system-level stalls (called long stalls) under high memory and I/O pressure, which further degrade performance. In this work, we propose a Stall-Aware Real-Time Memory Allocator (SARA), which discovers opportunities for performance balance by allocating just enough memory to soft RT tasks to meet deadlines and, at the same time, optimizing the remaining memory for non-RT applications. To determine this minimal required memory, SARA leverages our insight into how latency, caused by memory insufficiency and measured by our proposed PSI-based metric, affects the execution time of each soft RT job, where a job runs per period and a soft RT task consists of multiple periods. Moreover, SARA detects long stalls using our definition and proactively drops affected jobs, minimizing stalls in task execution. Experiments show that SARA achieves an average of 97.13% deadline hit ratio for soft RT tasks and improves non-RT application throughput by up to 22.32x over existing approaches, even with memory capacity limited to 60% of peak demand.
9A-5
17:35-18:00

Zone-aware metadata placement in B-tree filesystem

*Ming-Feng Wei (National Taiwan University), Yun-Chih Chen (National Tsing Hua University), Yuan-Hao Chang, Tei-Wei Kuo (National Taiwan University)
Keywords
ZNS SSDs, Zoned Namespaces, copy-on-write file systems, B-tree filesystem
Abstract
While copy-on-write file systems like Btrfs might appear well-suited for zoned namespace SSDs (ZNS) due to their out-of-place data updates, we find that Btrfs actually exacerbates the performance interference that ZNS aims to mitigate. Btrfs’s proactive zone space reclamation introduces substantial CPU and I/O overhead, leading to severe performance degradation under heavy workloads. To address these issues, we propose a new space management design for Btrfs that includes a metadata marking strategy to predict near-term metadata deletions, a recycling urgency metric to prioritize block groups for reclamation, and a space allocation algorithm that dynamically places metadata based on its marked state and block group urgency. Our approach archives the average performance improvement by over 22% compared to the original Btrfs under sustained write conditions. Additionally, it delays the onset of performance degradation from a write volume of 23% to 31% of the device capacity. Our approach even delivers a 65% performance gain compared to Btrfs with a aggressive reclamation configuration during the entire performance deterioration period.

Session 9C

(T10-C) Advanced Thermal Analysis for Advanced Chips
15:55-18:00 | Thursday, January 22, 2026 | Snow White 3
Chair(s):
Zhiyao Xie (HKUST)
Tianshu Hou (The Chinese University of Hong Kong)
9C-1
15:55-16:20

Full-Chip Thermal Map Estimation by Multimodal Data Fusion via Denoising Diffusion

*Jing Li (Beihang University), Wei Xing (The University of Sheffield), Yuquan Sun, Yuanqing Cheng (Beihang University)
Keywords
Thermal Map Estimation, Thermal Sensors, Performance Counters, Multimodal Data Fusion, Diffusion, Probabilistic Model
Abstract
As the integration density on-chip increases, thermal challenges become more prominent, and the key to addressing these challenges lies in effective and efficient thermal analysis. Current thermal estimation methods face a fundamental paradigm limitation: they rely on single data source that captures only partial aspects of complex thermal behavior. Performance counter-based methods capture computational activity but lacks of accurate thermal readouts, while sensor-based approaches provide local temperature measurements but lack of comprehensive spatial coverage. This single-source paradigm has created an insurmountable accuracy ceiling in thermal map estimation,limiting the effectiveness of modern thermal management systems. We introduce the first multimodal data fusion framework for full-chip thermal estimation, leveraging denoising diffusion models to synergistically combine performance counters and thermal sensors. Our approach treats thermal mapping as a conditional generation problem, where complementary data modalities guide the reconstruction process through progressive denoising. The key insight is that thermal behavior is inherently multimodal—requiring both activity context (performance counters) and temperature ground truth (thermal sensors) for accurate estimation. Our multimodal fusion delivers transformative results: 90%+ improvement over single-source methods (0.347 vs 4.862-6.302 average RMSE), 41.8% average RMSE improvement over state-of-the-art approaches, and robust performance across diverse sensor configurations (9-25 sensors). Furthermore, our framework effectively captures hotspot locations, with errors typically within 0.5K. More importantly, this work establishes multimodal data fusion as a new paradigm for thermal analysis, opening new research directions and enabling next-generation thermal management systems with unprecedented accuracy and reliability, fundamentally changing how we approach thermal analysis in modern processors.
9C-2
16:20-16:45

Graph Attention-Based Current Crowding Analysis at TSV Interfaces in 3D Power Delivery Networks

*Zheng Yang (Georgia Institute of Technology), Zhen Zhuang (The Chinese University of Hong Kong), Tsung-Yi Ho (National Tsing Hua University), Sung Kyu Lim (Georgia Institute of Technology)
Keywords
3D Integrated Circuits, Through Silicon Via, Graph Attention Network
Abstract
The increased device density and chip stacking in 3D Integrated Circuits (ICs) result in significantly higher currents in the Power Delivery Network (PDN) than in 2D ICs, posing greater challenges for maintaining power integrity and reliability. The currents in the connections between Through Silicon Via (TSV) and power wires tend to exhibit crowding phenomena. Consequently, current crowding phenomena induce excessive IR-drop and cause electromigration in 3D IC PDNs. However, to the best of our knowledge, there is no PDN analysis flow specifically considering the current crowding effect for rapid 3D IC design iterations. To address these challenges, this paper first proposes a commercial tool-based framework for 3D IC PDN current crowding analysis. Additionally, a Graph Attention Network-based (GAT-based) framework with novel PDN and TSV graph modeling methods is proposed to accelerate the commercial tool-based framework for rapid 3D IC design iterations. Both flows can capture current crowding effects and detailed current distribution analysis in TSVs for 3D ICs. For instance-level IR-drop analysis, the proposed GAT framework achieves an R^2 score of 0.971, while for TSV input current predictions, it reaches an R^2 score of 0.997. Overall, our GAT framework demonstrates a speedup of 211x over the commercial tool-based flow.
9C-3
16:45-17:10

Efficient Simulation of IC Packages with TEC Based on Adaptive Segmented Method and Spatially-Aware Thermal Neural Network

*Longfei Li, Xinyue Li, Shunxiang Lan (Shanghai Jiao Tong University), Liang Chen (Shanghai University), Haibao Chen, Min Tang (Shanghai Jiao Tong University)
Keywords
integrated circuit package, thermoelectric cooler, spatially-aware thermal neural network, thermal simulation
Abstract
In this paper, an efficient method is proposed for the simulation of integrated circuit (IC) packages with thermoelectric coolers (TECs). The computational domain is initially partitioned into two distinct regions: the thermoelectric device region (TDR) and the non-thermoelectric region (NTR). In terms of the TDR, a novel adaptive segmented method (ASM) is proposed to address the TEC legs with the 1-D model. The computational domain is adaptively meshed as multiple segments, and the temperature-dependent parameters are approximated as constants within each segment, which facilitates efficient solution of the temperature distribution within the TDR. As for the NTR, a spatially-aware thermal neural network (SATNN) is presented to predict the temperature at the interface between the NTR and the TDR. A dual-branch architecture is proposed for extracting the spatial heat flux feature and a fully connected branch is introduced for global feature coupling in SATNN. Finally, the TDR and NTR are coupled via the continuity of temperature and heat flux, and the temperature distribution within the NTR is reconstructed based on the resulting heat flux density. Experimental results show that the proposed method can achieve a 434x computational speedup with good accuracy than the commercial software.
9C-4
17:10-17:35

Domain Transformation and Decomposition Method for Composable Thermal Modeling and Simulation of Chiplet-Based 2.5D Integrated System

*Shunxiang Lan, Min Tang (Shanghai Jiao Tong University)
Keywords
2.5D integration, chiplet, thermal modeling and simulation, domain transformation and decomposition method
Abstract
Dynamic thermal analysis is imperative to ensure the reliability of chiplet-based 2.5D integrated systems. However, conventional simulation techniques typically incur high computational cost, especially in transient scenarios. In this paper, we propose a domain transformation and decomposition method (DTDM) to perform composable thermal modeling and simulation with high efficiency and accuracy. Temporally, the DTDM expands the time-variant temperature response into a sum of weighted Laguerre polynomials (WLPs). By leveraging Galerkin’s method and the orthogonality of WLPs, the heat conduction equation is transformed from the time domain to the Laguerre domain, thus breaking through the bottleneck of conventional marching-on-in-time approaches in terms of stability and efficiency. Spatially, the DTDM decomposes the system into multiple composable modules (CMs) according to the structural and functional characteristics. For each CM, a novel Laguerre-based macromodel is constructed to represent its internal heat transfer properties by mapping the relationship between the temperature and heat flux at the interface. Finally, the CMs are assembled together as a complete system by treating the macromodels as equivalent thermal boundary conditions, which reduces the computational complexity in both the temporal and spatial domains and thus facilitates efficient transient simulation. Moreover, since the resulting Laguerre-based macromodels are reusable in different system configurations, the DTDM is particularly suitable for composable thermal modeling and simulation. Numerical experiments on typical 2.5D chiplet-based integrated systems show that the proposed method delivers a 485x speedup compared to the commercial software, while maintaining a maximum absolute error below 0.02 K.
9C-5
17:35-18:00

ThPA: Thermal Simulation for Advanced ICs

Bo-Wen Chen, *Yong-Han Lin, Chien-Yu Lin, Yu-Min Lee (National Yang Ming Chiao Tung University)
Keywords
Thermal simulation, Finite difference method, Algebraic multigrid method, Advanced integrated circuits
Abstract
With the continuous scaling of advanced integrated circuits, thermal management has emerged as a critical design challenge due to increasing power densities and structural complexity. To ensure system reliability, especially during the sign-off stage, thermal simulation tools must provide both high spatial resolution and computational efficiency. In this work, we present ThPA, a high-performance thermal simulation framework based on finite-difference discretization and an efficient fine-grained iterative solver. Specifically, ThPA leverages an aggregation-based algebraic multigrid (AgAMG) preconditioned conjugate gradient (PCG) solver, which incorporates a novel double pairing aggregation strategy to reduce AgAMG setup overhead and accelerates convergence using Krylov-subspace-based multigrid cycles as preconditioner. Experimental results demonstrate that ThPA achieves up to a 46x speedup over commercial solvers, while maintaining a mean absolute temperature difference below 0.08 ℃ and a root-mean-square temperature difference under 0.11 ℃. These results validate the effectiveness of ThPA as a fast and accurate simulator in advanced IC designs.

Session 9D

Panel: AI Accelerators at a Crossroads: Who Will Power the Next Decade of AI?
15:55-18:00 | Thursday, January 22, 2026 | Sleeping Beauty 1/2
Moderator:
Li Jiang (Shanghai Jiao Tong University)
Panelist
15:55-16:20

Yuan Xie (The Hong Kong University of Science and Technology)
Yu Wang (Tsinghua University)
Gang Chen (Shanghai Houmo Technology Co.,Ltd.)
Shaodi Wang (Hangzhou Zhicun (Witmem) Technology Co., Ltd.)
Guangyu Sun (Peking University)

Session 9E

(T9-A) Machine Learning and Optimization in Physical Design
15:55-18:00 | Thursday, January 22, 2026 | Sleeping Beauty 3
Chair(s):
Heechun Park (UNIST)
Rupesh Karn (NYU)
9E-1
15:55-16:20

DRLPlace: A Deep Reinforcement Learning-based Irregular and High-Density Printed Circuit Board Placement Method

*Lei Cai, Ke Cheng, Jixin Zhang (Hubei University of Technology), Haiyun Li (Tsinghua University), Zhiwei Ye (Hubei University of Technology)
Keywords
Printed Circuit Board, Placement, Deep Reinforcement Learning, Proximal Policy Optimization
Abstract
The placement of components on a printed circuit board (PCB) plays a critical role in modern PCB design workflows. Modern PCBs are characterized by irregular board outlines, significant variations in component sizes, and high component density, which make it challenging for traditional methods to effectively accomplish the placement task. To address these challenges, we propose a deep reinforcement learning-based PCB placement method, capable of handling irregular and high-density PCB with various real-world PCB constraints. Since accurate reward signals are difficult to obtain during the placement process before its completion, which hinders effective optimization guidance, we propose a proximal policy optimization (PPO)-based framework to predict the reward at each step of the placement process. Due to the limited training samples which constrain the reward prediction accuracy in the PPO-process, we extract the layout feature, net feature and density map, and send them into a pre-trained resnet to train a policy-value model to improve the limited accuracy. To improve the efficiency of training convergency, we introduce a chip-centric clustering method for better initialization. We also propose a push-and-reflection-based legalization method to ensure legalization within irregular shaped boundaries. The experimental results show that the proposed method can decrease half-perimeter wire length by up to 16% while handling multiple PCB placement constraints compared to manual designs, significantly exceeding the state-of-the-art PCB placement methods.
9E-2
16:20-16:45

CNN-Assisted Low-Power Clock Tree Synthesis for 3D ICs

*Chenbo Xi, Jindong Zhou, Pingqiang Zhou (ShanghaiTech University)
Keywords
Clock tree synthesis, 3D-IC, Convolutional neural network
Abstract
In this work, a CNN-assisted low-power clock topology generation method for 3D-CTS is proposed. Our approach considers both local and global costs in each merging step to prevent getting stuck in local optima. To explore the trade-off between local and global costs, we use CNN to set two important weighting factors, resulting in an optimized clock tree that reduces power consumption. Compared with conventional methods, experimental results on ISPD09 benchmarks show that our approach can reduce wirelength by up to 33.33% and power consumption by 20.28%. We also demonstrate the transferability of our method on larger-scale ISPD10 benchmarks.
9E-3
16:45-17:10

DALI-PD: Diffusion-based Synthetic Layout Heatmap Generation for ML in Physical Design

*Bing-Yue Wu, Vidya Chhabria (Arizona State University)
Keywords
Physical Design, Diffusion Model, Synthetic Data Generation
Abstract
Machine learning (ML) has demonstrated significant promise in various physical design (PD) tasks. However, model generalizability remains limited by the availability of high-quality, large-scale training datasets. Creating such datasets is often computationally expensive and constrained by IP. While very few public datasets are available, they are typically static, slow to generate, and require frequent updates. To address these limitations, we present DALI-PD, a scalable framework for generating synthetic layout heatmaps to accelerate ML in PD research. DALI-PD uses a diffusion model to generate diverse layout heatmaps via fast inference in seconds. The heatmaps include power, IR drop, congestion, macro placement, and cell density maps. Using DALI-PD, we created a dataset comprising over 20,000 layout configurations with varying macro counts and placements. These heatmaps closely resemble real layouts and improve ML accuracy on downstream ML tasks such as IR drop or congestion prediction.
9E-4
17:10-17:35

Partitioning-free 3D-IC Floorplanning

*Shuo Ren, Zhen Zhuang, Rongliang Fu, Leilei Jin, Libo Shen, Bei Yu, Tsung-Yi Ho (The Chinese University of Hong Kong)
Keywords
3DIC, 3D Floorplanning, Physical Design, Partitioning-free, Semi-definite Programming
Abstract
While 3D IC integration offers a promising path to alleviate interconnect bottlenecks in 2D designs, efficient 3D floorplanning remains challenging due to its increased spatial complexity. Prior approaches that directly extend 2D representations into 3D suffer from exponential solution spaces, while pre-partitioning strategies constrain the global optimization landscape by fixing block-to-die assignments early. We propose Great3D, a partitioning-free 3D floorplanning framework that jointly optimizes block placement and die assignment without relying on predefined die partitioning. By iteratively selecting critical subsets of blocks based on hybrid-aware connectivity and refining them through local 3D SDP backtracking optimization, Great3D effectively minimizes inter-die wirelength while satisfying outline constraints. Experimental results on GSRC benchmarks show that Great3D consistently achieves lower wirelength than all baseline combinations, with up to 60% reduction on large-scale designs. Furthermore, the method maintains competitive runtime performance while demonstrating better scalability and robustness across diverse benchmark sizes. These results establish Great3D as a scalable and effective partitioning-free solution for high-quality 3D IC floorplanning.
9E-5
17:35-18:00

A Heterogeneous Graph-based Gate Sizer Integrating Graph Attention Network and Transformer

*Jinmo Ahn, Jinoh Cho, Jaemin Seo, Jaeseung Lee, Jakang Lee, Seokhyeong Kang (Pohang University of Science and Technology)
Keywords
Gate Sizing, AI-EDA, Post Placement Optimization
Abstract
Gate sizing is a critical step in achieving the target power, performance, and area(PPA) in chip design. In recent years, machine learning(ML) methods have recently emerged as a new paradigm for gate sizing. Their promising results have gained attention; however, the practical applicability and performance of existing works is limited by at least one of the following factors: (1) long runtime due to test-time optimization or autoregressive prediction; (2) limited exploration of architectural choices; (3) an overly simplified data representation, known as a homogeneous graph, which merges pins and cells into a single node. To improve both practicality and performance, we introduce a novel ML-based gate sizer, dubbed DPH-Sizer, which directly predicts the appropriate gate sizes using a heterogeneous graph that separates cells and their pins into distinct nodes. This heterogeneous graph explicitly captures the relationships between different circuit elements, leading to enhanced performance. Lastly, we propose Inter-Cell and Intra-Cell GAT blocks to explicitly capture both intra-cell and inter-cell information. These are followed by transformer blocks, which are placed at the end of the GAT stack to capture global path-level features. In our experiments, we validate each of the proposed components and demonstrate that DPH-Sizer maintains power consumption within 2.0% on average while achieving improvements of 54.3% in timing (WNS) and 1.3% in area metrics.

Session 9F

(T5-C) Ultra-Low Precision and Quantization Techniques
15:55-18:00 | Thursday, January 22, 2026 | Sleeping Beauty 5
Chair(s):
Ngai Wong (The University of Hong Kong)
Yuzhe Ma (The Hong Kong University of Science and Technology (Guangzhou))
9F-1
15:55-16:20

Platinum: Path-Adaptable LUT-Based Accelerator Tailored for Low-Bit Weights Matrix Multiplication

*Haoxuan Shan, Cong Guo, Chiyue Wei, Feng Cheng, Junyao Zhang, Hai Li, Yiran Chen (Duke University)
Keywords
Ultra-low-bit quantization, Ternary weights matrix multiplication, Lookup Table-based accelerator, LLM
Abstract
The rapid scaling of large language models demands increasingly efficient hardware. Quantization offers a promising trade-off between efficiency and performance. With ultra-low-bit quantization, there abundant opportunities for results reuse, and thus it can be boosted with lookup tables (LUTs) based acceleration. However, existing LUT-based methods suffer from computation and hardware overheads for LUT construction, and rely solely on bit-serial computation, which is suboptimal for ternary-weight networks. We propose Platinum, a lightweight ASIC accelerator for mixed-precision matrix multiplication (mpGEMM) using LUTs. Platinum reduces LUT construction overhead via offline-generated construction paths and supports both general bit-serial and optimized ternary-weight execution through adaptive path switching. On BitNet b1.58-3B, Platinum achieves up to 73.6x, 4.09x, and 2.15x speedups over SpikingEyeriss, Prosperity, and 16-thread T-MAC (CPU), respectively, along with energy reductions of 32.4x, 3.23x, and 20.9x, all within a 0.96mm^2 chip area. This demonstrates the potential of LUT-based ASICs as efficient, scalable solutions for ultra-low-bit neural networks on edge platforms.
9F-2
16:20-16:45

ROMA: a Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM

Wenqiang Wang, *Yijia Zhang (Shanghai Jiao Tong University), Zikai Zhang, Guanting Huo (Peking University), Hao Liang (Shanghai Jiao Tong University), Shijie Cao (Microsoft Research Aisa), Ningyi Xu (Shanghai Jiao Tong University)
Keywords
LLM, Accelerator, ROM, SRAM
Abstract
As large language models (LLMs) demonstrate powerful capabilities, deploying them on edge devices has become increasingly crucial, offering advantages in privacy and real-time interaction. QLoRA has emerged as the standard approach for on-device LLMs, leveraging quantized models to reduce memory and computational costs while utilizing LoRA for task-specific adaptability. In this work, we propose ROMA, a QLoRA accelerator with a hybrid storage architecture that uses ROM for quantized base models and SRAM for LoRA weights and KV cache. Our insight is that the quantized base model is stable and converged, making it well-suited for ROM storage. Meanwhile, LoRA modules offer the flexibility to adapt to new data without requiring updates to the base model. To further reduce the area cost of ROM, we introduce a novel B-ROM design and integrate it with the compute unit to form a fused cell for efficient use of chip resources. ROMA can effectively store both a 4-bit 3B and a 2-bit 8B LLaMA model entirely on-chip, achieving a notable generation speed exceeding 20,000 tokens/s without requiring external memory.
9F-3
16:45-17:10

Noise-Agnostic One-Shot Training and Retraining for Robust DNN Inferencing on Analog Compute-in-Memory Systems

Ashish Reddy Bommana (Arizona State University), Ben Feinberg, Patrick Xiao, Christopher Bennet (Sandia National Laboratories), Matthew Marinella, *Krishnendu Chakrabarty (Arizona State University)
Keywords
Hardware-Aware Training, Analog Compute-in-Memory, One-Shot Training, Analog Noise Tolerance, ADC Error Tolerance
Abstract
Analog Compute-in-Memory (ACiM) architectures are a promising alternatives to traditional von Neumann-based systems for accelerating deep neural networks (DNNs), as they alleviate the memory bottleneck by performing in-situ matrix- vector multiplications. However, the analog nature of computation in ACiM makes DNNs highly susceptible to noise and process variations. To mitigate the effects of analog noise, existing approaches rely on variation-aware or noise-aware training, retraining, or fine-tuning. These methods, however, are not scalable, as they require chip-specific retraining and typically involve separate training runs for different levels of noise tolerance. Moreover, they overlook the inherent fault tolerance of analog-to-digital converters (ADCs). To address these limitations, we propose a one-shot training and retraining strategy for robust DNN inferencing on ACiM platforms. Our method is guided by a detailed analysis of error propagation through ADCs, revealing that robustness can be enhanced by strategically reshaping the weight distribution to better align with ADC resilience characteristics. Simulation results and experimental results with fabricated chips show that the proposed method improves inferencing accuracy by 70%-90% for ResNet-18 and DenseNet-121 under 70% noise injection on CIFAR-10 and SVHN, and by 50%-80% for VGG-16 under 50% noise. These gains are achieved with only a 5% energy overhead due to the modified weight distribution.
9F-4
17:10-17:35

Activation-free Implicit Neural Representation via Finite-State-Machine Based Stochastic Computing

*Xincheng Feng, Wenyong Zhou, Taiqiang Wu (The University of Hong Kong), Meng Li (Peking University), Zhengwu Liu (The University of Hong Kong), Ngai Wong (The University of Hong Kong)
Keywords
Stochastic Computing, Finite-State-Machine, Implicit Neural Representation, Activation Free.
Abstract
Implicit neural representations (INRs) have revolutionized signal encoding by using neural networks to map coordinates to signal attributes. Despite their success, INRs present significant hardware implementation challenges due to complex activation functions and floating-point operations. Unlike previous efforts, such as model pruning or quantization, we address these challenges by introducing AIRFSC, a novel activation-free stochastic computing (SC) architecture that leverages finite-state machines (FSMs). AIRFSC eliminates complex activation functions and processes data efficiently through stochastic bitstreams. Our approach decomposes the input signal into a series of Fourier basis functions, enabling the FSM-based architecture to learn smooth coordinate-to-attribute mappings for accurate signal reconstruction. Extensive experiments on diverse signal types demonstrate that AIRFSC achieves reconstruction quality comparable to state-of-the-art (SOTA) INRs implemented with multi-layer perceptrons (MLPs), while significantly improving hardware efficiency. Specifically, AIRFSC reduces power and area by 96.8% and 72.8% compared to Sinusoidal Representation Networks (SIREN), and by 97.9% and 81.3% compared to Wavelet Implicit Representation (WIRE).
9F-5
17:35-18:00

dLLM-OPU: An FPGA Overlay Processor for Accelerated Diffusion Large Language Models

*Yangbo Wei, Shaoqiang Lu (Shanghai Jiao Tong University), Junhong Qian (Southeast University), Lei He (Eastern institute of Technology, Ningbo), Qin Dongge, Xiao Shi (Southeast University), Chen Wu (Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo), Linfeng Zhang (Shanghai Jiao Tong University)
Keywords
dLLM, LLaDA, FPGA Accelerator
Abstract
Large Language Models (LLMs) are achieving unprecedented performance across diverse tasks, benefiting from autoregressive generation. However, this left-to-right decoding paradigm inherently limits contextual understanding quality. Diffusion-based LLMs (dLLMs) offer a promising alternative by iteratively refining sequences via denoising, enabling stronger bidirectional context modeling and improved generation quality. However, dLLMs face two main challenges: redundant computation and memory overhead in multi-step denoising, and excessive inference cost from over-denoising under fixed-step schedules. To address these issues, we propose dLLM-OPU, an FPGA overlay processor to accelerate dLLMs. Our solution features two key innovations: (1) a Region-Adaptive Caching for Dynamic Column Sparsity Framework that exploits temporal locality for selective recomputation without model retraining, and (2) a Token Entropy-based Early Stopping strategy that dynamically terminates the denoising process based on token-level convergence metrics. We implement these innovations through a specialized sparse processing element (PE) array that maximizes top-k sparsity utilization by minimizing idle cycles via row-column concatenation, complemented by an efficient cache management system that reduces memory access latency and a flexible entropy-based decoding unit.Implemented on a U200 FPGA, dLLM-OPU achieves 2.2x-5.1x speedup and 7.6x-20.3x energy efficiency over RTX4090 in LLaDA.
Agenda Overview