9B: Leading Edge Design Methodology for Processors


9B-1
Title Design Methodology for 2.4GHz Dual-Core Microprocessor
Author Noriyuki Ito, Hiroaki Komatsu, Akira Kanuma, Akihiro Yoshitake, Yoshiyasu Tanamura, Hiroyuki Sugiyama, Ryoichi Yamashita, *Ken-ichi Nabeya, Hironobu Yoshino, Hitoshi Yamanaka, Masahiro Yanagida, Yoshitomo Ozeki, Kinya Ishizaka, Takeshi Kono, Yutaka Isoda
Abstract This paper presents a design methodology that was applied to the design of a 2.4GHz dual-core SPARC64 microprocessor with 90nm CMOS technology. It focuses on the newly adopted techniques, such as efficient data management in dual-core design, fast delay calculation of the noise-immune clock distribution circuit, enhanced signal integrity analysis of a large-scale custom macro design, and enhanced diagnosis capability using a logic BIST circuit.
Slides (pdf file) 9B-1

9B-2
Title An Embedded Low Power/Cost 16-Bit Data/Instruction Microprocessor Compatible with ARM7 Software Tools
Author *Fu-Ching Yang, Ing-Jer Huang (National Sun Yat-Sen University, Taiwan)
Abstract A 16-bit THUMB instruction set microprocessor is proposed for low cost/power in short-precision computing. It achieves 40% gate count, 51% power consumption and 160% clock frequency comparing to ARM7, even the performance is 67% better in narrow width memory at the same clock frequency. The ARM7 software is also compatible.
Slides (pdf file) 9B-2

9B-3
Title A Novel Reconfigurable Low Power Distributed Arithmetic Architecture for Multimedia Applications
Author *Zhenyu Liu, Tughrul Arslan, Ahmet T. Erdogan (The University of Edinburgh, Great Britain)
Abstract The use of reconfigurable cores in system on chip (SoC) designs is increasingly becoming a trend. Such cores are being used for their flexibility, powerful functionality and low power consumption. Distributed Arithmetic (DA) is a powerful algorithm wildly used in many fields of multimedia for its efficiency. This paper presents a novel reconfigurable adder-based architecture for DA to realize the inner product which is the key computation in many digital signal processing applications. 1D DCT is mapped onto the architecture. Compared with some existing ASIC designs, the new architecture achieves good performance in area, speed and power.
Slides (pdf file) 9B-3

9B-4
Title Exploration of Low Power Adders for a SIMD Data Path
Author *Giacomo Paci (IMEC and DEIS,University of Bologna, Italy), Paul Marchal (IMEC, Belgium), Luca Benini (DEIS,University of Bologna, Italy)
Abstract Hardware for Ambient Intelligence needs to achieve extremely high computational efficiency (up to 40GOPS/W). An important way for reaching this is exploiting parallelism, and more specifically data-level parallelism enabled by SIMD. Whereas a large body of research exists on the benefits of, the architectural design of and compilation onto SIMD, the design of energy-optimal functional units for SIMD has received limited attention. It appears that existing SIMD functional units are designed in an area optimal, but not energy optimal way. By exploiting the difference in critical path length for the types of operations (e.g., 4x8/2x16/1x32), SIMD adders can be developed that save up to 40% of energy. In this paper, we will present these adders, the issues of building them and quantify their benefits for different usage scenarios and operating frequencies.
Slides (pdf file) 9B-4

9B-5
Title Micro-architecture Pipelining Optimization with Throughput-Aware Floorplanning
Author *Yuchun Ma, Zhuoyuan Li (Tsinghua University, China), Jason Cong (University of California, Los Angeles, United States), Xianlong Hong (Tsinghua University, China), Glenn Reinman (University of California, Los Angeles, United States), Sheqin Dong, Qiang Z
Abstract For modern processor designs in nanometer technologies, both block and interconnect pipelining are needed to achieve multi-gigahertz clock frequency, but previous approaches consider block pipelining and interconnect pipelining separately. For example, all recent works on wire pipelining assume pre-pipelined components and consider only inserting pipeline stages on point-to-point wire or bus connections. To the best of our knowledge, this paper is the first that considers block pipelining and interconnect pipelining simultaneously. We optimize multiple critical paths or loops in the micro-architecture and insert the pipelines stages optimally in the blocks and wires of these loops to meet the clock frequency requirement. We propose two approaches to this problem. The first approach is based on mixed integer linear programming (MILP) which is theoretically guaranteed to produce the optimal solution, and the second one is an efficient graph-based algorithm that produces near-optimal solutions. Experimental results show that simultaneous block and interconnect pipelining leads to more than 20% improvement over wire-pipeling alone on the overall processor performance. Moreover, the graph-based approach gives solutions very close to the MILP results ( 2% more than MILP results on average) but in a much shorter runtime.
Slides (pdf file) 9B-5
Last Updated on: January 29, 2007