Thursday, August 23, 2007

From Hot Interconnects

I could find some time to attend the second day today at Hot Interconnects at Stanford. Here are some notes. Although some papers may not be directly relate dto the discussions in this blog, I decided to include them anyways.

Paper 1:

A Memory-Balanced Linear Pipeline Architecture for Trie-based IP Lookup on FPGA, USC

The trie data structure is used for IP lookup. Efficient pipelining is possible for IP lookup. In current approaches there exists unbalanced memory distribution across stages. Their solution is to improve pipelining in trie lookups.

Paper 2:
Building a RCP (Rate Control Protocol) Test Network, Stanford
They build an experimental network to support RCP using NetFPGA implementation and modifying Linux by adding a shim layer inbetween to support RCP.

Paper 3:
ElephantTrap: A low cost device for identifying large flows, Stanford
ElephantTrap identifies and measures big flows in a network using random sampling of packets and algorithms to have select the large flows in the network.

Intel invited talk titled On-Die Interconnect and Other Challenges for Chip-Level Multi-Processing

Chips will become increasingly multicore because performance tapes of with increase in transistors in a single core. However, gain from parallelism is lost in communication overhead. Speaker advocated on-chip high speed interconnect which ties cache, memory, network I/O etc to processors. Also advocated ring based topology for interconnect. He talked briefly about Intel terascale processor with 80 cores.

Paper 4:
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multicore
Environments, Virginia Tech

They consider the interaction of protocols with multicore architectures. They consider TCP/IP and iWarp protocols and the interaction between application, TCP/IP stack and network I/O. They show that if application is scheduled on different core than where interrupt processing is (but in same socket), it leads to best performance as CPU load is distributed across cores and it leads to lesser cache misses too. Did not consider MSI/MSI-X or multithreaded applications.

Paper 5:
Assessing the Ability of Computation/Communication Overlap and Communication Progress in Modern Interconnects, Queen's University

They consider interplay between processing and I/O so as to have minimum stalling cycles. They consider various MPI based applications over different interconnects: Infiniband, Myricom and 10 Gig Ethernet

Paper 6:
Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand
Architecture with Multi-Core Platforms, Ohio State Univ

They demonstrate how ConnectX (the latest generation of Infiniband technology) performs on multicore architectures.

Paper 7:
Memory Management Strategies for Data Serving with RDMA, Ohio Supercomputer Center

They demonstrate that virtual to physical memory translation is a considerable overhead in high-speed networks. They demonstrate how to optimize memory registeration to achieve better performance in RDMA

Paper 8:
Reducing the Impact of the Memory Wall for I/O Using Cache Injection, Universiy of New Mexico

Cache Injection is a technique by which data is transferred directly fron device to L2/L3 cache. They show the tradeoff between cache injection and prefetching and propose an algorithm to perform efficient cache injection for network I/O,

Thursday, August 16, 2007

UltraSparc T2 with integrated Network I/O

Although I do not follow hot chips, and have never blogged about microprocessors, I would love to talk about Sun's UltraSparc T2 launched last week, the first microprocessor ever to support an integrated network I/O support. Yes, UltraSparc T2 which I would prefer to call by its moniker Niagara 2, supports multithreaded dual 10 Gig Ethernet. From product brochures, this chip seems to be a true microsystem, just connect this to RAM and you get it running.

Although Sun has only mentioned some CPU intensive benchmarks such as SPEC int and SPEC float, may be they come up with some network I/O oriented benchmark numbers as well.

Also Sun is releasing all design documents under the GPL license. Wonder what that means? Can someone take up all the technical documentation and fab a chip independently? How does open sourcing work in the hardware market, does it help in added growth, or is this simply a propganda by Sun to get more interest in this chip, and new buyers thereby.

Sun's CEO Jonathan Schwatrz writes a very interesting blog, one of the best by public company CEO. JOnathan stresses that he thinks about Niagara 2 as a commodity, and Sun MicroElectronics division is encouraged to see the chip independent of Sun systems, and to even competitors of Sun. Interesting and encouraging indeed :)

Saturday, August 04, 2007

About Network I/O, CPU cycles, and Multicore architectures

With the advent of 10 Gig Ethernet, network I/O has become unbelievably fast. This poses substantial challenges for kernel designers, because the erstwhile thinking that I/O is not CPU intensive does not apply any more. With 10 GigE, you can easily exhaust all the CPU in your box just transmitting or receiving raw packets without doing anything substantial.

This poses an interesting dillema, (i) What is the best network I/O you can perform with your box having 10 GigE, and (ii) Given that an application is running in the background and is the producer/ consumer of network I/O, how do you allocate your limited CPU cycles between network I/O and your application. In my PhD dissertation, we tried to solve the above on a macro scale. We developed a stochastic model to predict the best rate at which network I/O may be performed given the application workload on our system. Our work presented at PfldNet 2007 is availeble here.

With the advent of multicore chipsets such as the Intel quad-core Xeon and AMD quad-core Barcelona besides the Sun Microsystems Niagara chip which has existed for almost two years, there is an added dimension to this problem, how to cooperate the application and network I/O across cores. And with applications becoming increasingly multithreaded, this problem is definitely trickier. Since the interplay between CPU cycles, cache misses, memory latencies etc. becomes even more interesting.

I read a recent paper by authors from Virginia Tech and Argonne National Lab illustrating how a single threaded application can be bound to a core to get higher network I/O throughput. It is a very interesting read, would strongly recommend it.

Slides on evolution of 1 Gig to 10 Gig EPON

An excellent presentation made by Dr. Glen Kramer, Chief Scientist at Teknovus at the ITU-T/IEEE Workshop on Carrier-Class Ethernet in Geneva Switzerland is available here.

The presentation motivates the need to upgrade to a 10 Gig EPON standard showing the high bandwidth requirement from a Multiple Dwelling Unit (MDU) with large number of EPON subscribers. Challenges in the standardization activity, particularly meeting the power budget are documented.

Although Dr. Kramer puts some emphasis on the bandwidth requirements of HDTV broadcast TV and 4th generation mobile, I believe that the greatest demand is going to depend on the success of Video on Demand (VoD). I believer that I have aired my views on the potential of VoD before, please read my thoughts here.