Title | A Comprehensive and Accurate Latency Model for Network-on-Chip Performance Analysis |
Author | *Zhiliang Qian (The Hong Kong University of Science and Technology, Hong Kong), Da-cheng Juan (Carnegie Mellon University, U.S.A.), Paul Bogdan (University of Southern California, U.S.A.), Chi-Ying Tsui (The Hong Kong University of Science and Technology, Hong Kong), Diana Marculescu, Radu Marculescu (Carnegie Mellon University, U.S.A.) |
Page | pp. 323 - 328 |
Keyword | Queuing model, Analytical model, Network on Chip, Latency |
Abstract | In this work, we propose a new, accurate, and comprehensive analytical model for Network-on-Chip (NoC) performance analysis. Given the application communication graph, the NoC architecture, the task mapping and the routing algorithm, the proposed framework analyzes the links dependency and then determines the ordering of queuing analysis for accurate performance modeling. Toward this end, the channel waiting times in the links are estimated using a generalized G/G/1/K queuing model, which can tackle bursty traffic and dependent arrival times with general service time distributions. The proposed model is general and can be used to analyze various traffic scenarios for NoC platforms with arbitrary buffer and packet lengths. Experimental results on both synthetic and real applications demonstrate the accuracy and scalability of the newly proposed model. |
Slides |
Title | A Low-Latency Asynchronous Interconnection Network with Early Arbitration Resolution |
Author | Georgios Faldamis (Cavium, Inc., U.S.A.), *Weiwei Jiang (Columbia University, U.S.A.), Gennette Gill (D.E. Shaw Research, U.S.A.), Steven M. Nowick (Columbia University, U.S.A.) |
Page | pp. 329 - 336 |
Keyword | asynchronous, network-on-chip, low-latency, arbitration, mesh-of-trees |
Abstract | A new asynchronous arbitration node is introduced for use as a
building block in an asynchronous interconnection network. The target
network topology is a variant Mesh-of-Trees (MoT), combining a binary
fan-out (i.e. routing) network and a binary fan-in (i.e. arbitration) network,
which is becoming widely used for multi-core shared-memory interfaces.
The two key features are: (i) each fan-in node can resolve its arbitration
and pre-allocate the corresponding input channel,
before the actual data arrives; and (ii) a lightweight shadow monitoring network
fast forwards information as soon as data enters the network,
in continuous time, without synchronization to a fixed-rate clock,
notifying each fan-in node on its path to enable the early arbitration.
The router nodes were designed in IBM 90nm technology using a ARM
standard cell library. SPECTRE simulations indicate that the new arbitration
node provided significant reductions in latency of up to 54.4\% over prior
designs, while maintaining roughly comparable throughput.
Network-level simulations were then performed on eight diverse synthetic benchmarks,
comparing the new approach ("early arbitration")
with two earlier alternative asynchronous MoT networks
("baseline" and "predictive"), using a mix of random and deterministic
traffic. Considerable improvements in system latency were obtained
on all benchmarks, ranging from 13.0% to 38.7%.
The early arbitration strategy also showed direct benefits
for the two most adversarial benchmarks, "uniform random traffic" and "hotspot8". |
Slides |
Title | A Vertically Integrated and Interoperable Multi-Vendor Synthesis Flow for Predictable NoC Design in Nanoscale Technologies |
Author | *Alberto Ghiribaldi, Herve Tatenguem Fankem (University of Ferrara, Italy), Federico Angiolini (iNoCs, Switzerland), Mikkel Stensgaard, Tobias Bjerregaard (Teklatech, Denmark), Davide Bertozzi (University of Ferrara, Italy) |
Page | pp. 337 - 342 |
Keyword | Network-on-Chip, Design Flow, EDA tool, Embedded Systems |
Abstract | We deliver a design flow for the synthesis and convergence of application-specific networks-on-chip. The flow comes with novel features that can better address nanoscale design challenges: front-end driven floorplanning, dynamic IR-drop minimization, fast and accurate system-level power grid models, predictable link design. Above all, such features are addressed by different prototype engines, even from different vendors, that can be smoothly integrated into the flow by means of a common specification format the Communication Exchange Format (CEF), that enables unprecedented tool interactions. This flow is validated through an extensive demonstration framework. |
Slides |
Title | Fuzzy Flow Regulation for Network-on-Chip Based Chip Multiprocessors Systems |
Author | *Yuan Yao, Zhonghai Lu (KTH Royal Institute of Technology, Sweden) |
Page | pp. 343 - 348 |
Keyword | Network-on-Chip, Chip Multiprocessor, Flow regulation, Fuzzy logic |
Abstract | Flow regulation is a traffic shaping technique, which can be used to improve communication performance with better utilization of network resources in chip multi-processors (CMPs). This paper presents fuzzy flow regulation. Being different from the static flow regulation policy, our system makes regulation decisions fully dynamically according to traffic dynamism and the state of interconnection network. The central idea is to use fuzzy logic to mimic the behavior of an expert that can recognize the network status and then intelligently control the admission of input flows. As the experiment results show, the maximum improvement in average delay reaches 53.0% against static regulation and 37.4% against no regulation. The maximum improvement in average throughput reaches 37.5% against static regulation and 23.8% against no regulation. |
Slides |
Title | Adjustable Contiguity of Run-Time Task Allocation in Networked Many-Core Systems |
Author | *Mohammad Fattah, Pasi Liljeberg, Juha Plosila, Hannu Tenhunen (University of Turku, Finland) |
Page | pp. 349 - 354 |
Keyword | Run-Time Application Mapping, Dynamic Many-Core Systems |
Abstract | In this paper, we propose a run-time mapping algorithm, CASqA, for networked many-core systems. In this algorithm, the level of contiguousness of the allocated processors (α) can be adjusted in a fine-grained fashion. A strictly contiguous allocation (α = 0) decreases the latency and power dissipation of the network and improves the applications execution time. However, it limits the achievable throughput and increases the turnaround time of the applications. As a result, recent works consider non-contiguous allocation (α = 1) to improve the throughput traded off against applications execution time and network metrics. Experimental results show that relentlessly allowing non-contiguous allocation not only cripples the network performance, but also degrades the achievable throughput compared to moderated cases (0<α<1). More precisely, up to 35% drop in the network costs can be gained by adjusting the level of contiguity compared to non-contiguous cases, while the achieved throughput is kept constant. Moreover, CASqA provides at least 32% energy saving in the network compared to other works. |
Slides |