Nuclei: Parallel network coding on the GPU

It has been shown in recent research that network coding leads to more robust protocols with less overhead, and may be applied in large-scale content distribution and media streaming. Extensive simulation studies have shown that network coding delivers close to theoretically optimal performance levels. Unfortunately, to date, there has been no commercial real-world systems reported in the literature that take advantage of the power of network coding.

We believe that the main cause of this observation — and the main disadvantage of network coding — is the high computational complexity of randomized network coding with random linear codes, especially as the number of blocks to be coded scales up. We believe that it is crucially important to design and implement random linear codes such that its real-world coding performance is maximized, on modern off-the-shelf hardware platforms. This is particularly important for servers — such as streaming servers in peer-assisted video-on-demand (VoD) systems — that use network coding, since they need to sustain the bandwidth required for hundreds of directly connected peers, and to saturate the line speed of their Gigabit Ethernet interfaces.

Our previous work, completed in February 2007 and highlighted on this web site, has shown an accelerated multi-threaded implementation of network coding, that takes advantage of both multiple CPU cores with aggressive multi-threading, and SSE2 and AltiVec SIMD vector instructions on x86 and PowerPC processors as well.

In our ongoing project led by PhD candidate Hassan Shojania in the iQua group, called Nuclei, we wish to take full advantage of modern Graphical Processing Units (GPUs) and the recent CUDA programming platform from NVidia, and push the performance envelope of network coding by another substantial margin.

First Milestone — Network Coding on the NVidia 8800 GT GPU

In the first milestone of the Nuclei project since June 2008, we have worked on a Mac Pro Dual Quad-core 2.8 GHz Intel Xeon server, with a NVidia GeForce 8800 GT GPU featuring 112 cores, which is a mainstream GPU that retails for about $100. We have implemented both encoding and decoding processes of random network coding on the 8800 GT GPU. Our stellar performance results have clearly shown that, it is feasible to saturate a Gigabit Ethernet interface by combining 8-core Intel Xeon CPUs and the 8800 GT GPU. With respect to encoding on a streaming server, for example, the GPU alone is able to achieve an encoding rate of 67 MB/second with 128 blocks, outperforming all eight CPU cores combined.

At the end of August 2008, we submitted a research paper to IEEE INFOCOM 2009, summarizing this milestone achievement with network coding performed on the 8800 GT. Here are two performance charts in the paper, for encoding and decoding on the 8800 GT, respectively.

8800 GT Encoding

8800GT Decoding

To the best of our knowledge, our work in the first milestone of the Nuclei project, completed in August 2008, represents the first GPU-based implementation of network coding. In December 2008, this paper has been accepted to appear in the main conference of IEEE INFOCOM 2009, Rio de Janeiro, Brazil, April 19-25, 2009.

Second Milestone — Network Coding on the NVidia GTX 280 GPU

In the second milestone of the Nuclei project since September 2008, we have worked on the same Mac Pro Dual Quad-core 2.8 GHz Intel Xeon server, but with a new NVidia GeForce GTX 280 GPU featuring 240 cores, which is a top-of-the-line GPU in Fall 2008 that retails for about $450. Our research objective is to design and implement novel techniques to perform network coding on the GPU, so that we can further improve upon the performance records we have set in our first milestone.

With our new techniques, we have been successful to show a further performance improvement of 30% for encoding, and by 1.6 to 10 times for decoding across a range of practical configurations, on the same GTX 280 GPU. A single GTX 280 performing network coding at 128 blocks achieves encoding rates up to 172 MB/second and decoding rates up to 146 MB/second, far beyond the computation bandwidth required to saturate a single Gigabit Ethernet interface. In fact, at such high rates, unlike the first milestone, we believe that the GPU alone is sufficient on streaming servers, and the CPU cores can be set free to perform other computational tasks.

Our new performance optimization techniques are reported in a research paper submitted to IEEE ICDCS 2009 on November 22, 2009. Here are two performance charts in the paper, for encoding and decoding on the GTX 280 GPU, respectively:

GTX 280 Encoding

GTX 280 Decoding

Further performance optimizations of network coding on the GPU

After the submission deadline of our ICDCS 2009 submitted paper, we have continued to optimize the performance of our implementation of GPU-based network coding, and to develop one implementation to support all families of CUDA-enabled NVidia GPUs, including mobile versions. Here is a preview of the encoding performance after further optimizations are designed and implemented, again on the GTX 280:

GTX 280 Optimized Encoding

As one may observe, the encoding performance reaches 293 MB/second in the case of encoding 128 blocks, which represents a substantial 70% speedup as compared to our results reported in our ICDCS 2009 submission, without these optimizations. We intend to include these results in an upcoming manuscript of this paper.

To the best of our knowledge, our results represent the first and fastest possible GPU-based implementation of network coding.