Archive for the ‘Network coding’ Category

Nuclei: Parallel Network Coding on the GPU

Monday, November 24th, 2008

It has been shown in recent research that network coding leads to more robust protocols with less overhead, and may be applied in large-scale content distribution and media streaming. Extensive simulation studies have shown that network coding delivers close to theoretically optimal performance levels. Unfortunately, to date, there has been no commercial real-world systems reported in the literature that take advantage of the power of network coding.

We believe that the main cause of this observation — and the main disadvantage of network coding — is the high computational complexity of randomized network coding with random linear codes, especially as the number of blocks to be coded scales up. We believe that it is crucially important to design and implement random linear codes such that its real-world coding performance is maximized, on modern off-the-shelf hardware platforms. This is particularly important for servers — such as streaming servers in peer-assisted video-on-demand (VoD) systems — that use network coding, since they need to sustain the bandwidth required for hundreds of directly connected peers, and to saturate the line speed of their Gigabit Ethernet interfaces.

Our previous work, completed in February 2007 and highlighted on this web site, has shown an accelerated multi-threaded implementation of network coding, that takes advantage of both multiple CPU cores with aggressive multi-threading, and SSE2 and AltiVec SIMD vector instructions on x86 and PowerPC processors as well.

In our ongoing project led by PhD candidate Hassan Shojania in the iQua group, called Nuclei, we wish to take full advantage of modern Graphical Processing Units (GPUs) and the recent CUDA programming platform from NVidia, and push the performance envelope of network coding by another substantial margin.

First Milestone — Network Coding on the NVidia 8800 GT GPU

In the first milestone of the Nuclei project since June 2008, we have worked on a Mac Pro Dual Quad-core 2.8 GHz Intel Xeon server, with a NVidia GeForce 8800 GT GPU featuring 112 cores, which is a mainstream GPU that retails for about $100. We have implemented both encoding and decoding processes of random network coding on the 8800 GT GPU. Our stellar performance results have clearly shown that, it is feasible to saturate a Gigabit Ethernet interface by combining 8-core Intel Xeon CPUs and the 8800 GT GPU. With respect to encoding on a streaming server, for example, the GPU alone is able to achieve an encoding rate of 67 MB/second with 128 blocks, outperforming all eight CPU cores combined.

At the end of August 2008, we submitted a research paper to IEEE INFOCOM 2009, summarizing this milestone achievement with network coding performed on the 8800 GT. Here are two performance charts in the paper, for encoding and decoding on the 8800 GT, respectively.

8800 GT Encoding

8800GT Decoding

To the best of our knowledge, our work in the first milestone of the Nuclei project, completed in August 2008, represents the first GPU-based implementation of network coding. In December 2008, this paper has been accepted to appear in the main conference of IEEE INFOCOM 2009, Rio de Janeiro, Brazil, April 19-25, 2009.

Second Milestone — Network Coding on the NVidia GTX 280 GPU

In the second milestone of the Nuclei project since September 2008, we have worked on the same Mac Pro Dual Quad-core 2.8 GHz Intel Xeon server, but with a new NVidia GeForce GTX 280 GPU featuring 240 cores, which is a top-of-the-line GPU in Fall 2008 that retails for about $450. Our research objective is to design and implement novel techniques to perform network coding on the GPU, so that we can further improve upon the performance records we have set in our first milestone.

With our new techniques, we have been successful to show a further performance improvement of 30% for encoding, and by 1.6 to 10 times for decoding across a range of practical configurations, on the same GTX 280 GPU. A single GTX 280 performing network coding at 128 blocks achieves encoding rates up to 172 MB/second and decoding rates up to 146 MB/second, far beyond the computation bandwidth required to saturate a single Gigabit Ethernet interface. In fact, at such high rates, unlike the first milestone, we believe that the GPU alone is sufficient on streaming servers, and the CPU cores can be set free to perform other computational tasks.

Our new performance optimization techniques are reported in a research paper submitted to IEEE ICDCS 2009 on November 22, 2009. Here are two performance charts in the paper, for encoding and decoding on the GTX 280 GPU, respectively:

GTX 280 Encoding

GTX 280 Decoding

Further performance optimizations of network coding on the GPU

After the submission deadline of our ICDCS 2009 submitted paper, we have continued to optimize the performance of our implementation of GPU-based network coding, and to develop one implementation to support all families of CUDA-enabled NVidia GPUs, including mobile versions. Here is a preview of the encoding performance after further optimizations are designed and implemented, again on the GTX 280:

GTX 280 Optimized Encoding

As one may observe, the encoding performance reaches 293 MB/second in the case of encoding 128 blocks, which represents a substantial 70% speedup as compared to our results reported in our ICDCS 2009 submission, without these optimizations. We intend to include these results in an upcoming manuscript of this paper.

To the best of our knowledge, our results represent the first and fastest possible GPU-based implementation of network coding.

On Large-Scale P2P Streaming Systems with Network Coding

Monday, June 30th, 2008

Recent years have witnessed real-world success of peer-to-peer (P2P) live streaming systems, especially in start-up companies. Given these success stories, it is not surprising that there is a large base of research aiming at designing “good” P2P streaming systems. Even with wide research attention, however, a number of fundamental performance metrics that characterize “good” P2P streaming systems are not yet well understood. Let us revisit a few performance metrics as examples.

With respect to playback quality, if media segments do not arrive in a timely fashion, they have to be skipped at playback, which degrades the playback quality. How do we consistently maintain high playback quality at all participating peers? With respect to the initial buffering delay, which is experienced by a peer when it first joins or switches to a new channel, how do we improve user experience with the shortest initial buffering delay? With respect to server bandwidth costs, how do we minimize such costs by maximizing bandwidth contribution from participating peers? Last but not the least important, how do we design a system that scales well to accommodate a large flash crowd and a high degree of peer dynamics?

We believe that these performance metrics should be given priority when evaluating a protocol that is designed specifically for peer-to-peer live streaming. The playback quality and the initial buffering delay matter most to the user experience, which determines the level of user satisfaction. The server bandwidth costs, however, matter most to the companies in operation, as they directly determine most of the ongoing operational costs.

While it may be possible to implement a protocol design in real-world systems without “paper and pencil” analytical studies, we believe in the values of such analytical studies of large-scale P2P streaming systems. Analytical studies help us to understand the fundamentals with mathematical rigor, rather than by resorting to intuition or trial-and-error. Analytical studies also help us to understand properties that are hard to evaluate with real-world studies without risks, such as the ability to scale well to a large number of peers with good performance.

At first glance, developing a theoretical framework that helps protocol designers to understand all of these performance metrics may appear too good to be true. Most existing analytical studies therefore focused on only one or two performance metrics instead. For instance, Y. Liu [ACM Multimedia 2007] provided a centralized streaming algorithm that achieves minimum delay. Massoulie et al. [INFOCOM 2007] proposed a streaming algorithm that approaches maximum streaming rate. Bonald et al. [SIGMETRICS 2008] identified several streaming algorithms that achieve near-optimal streaming rate and delay.

In contrast, in our recent research paper to appear in ACM Multimedia 2008, we seek to mathematically analyze and understand a new family of streaming protocols, simultaneously considering all these performance metrics. Our rescue comes from the use of network coding, which not only eliminates some of the mathematical difficulties associated with previous theoretical models, but also leads to simple and practical streaming protocols. This simplification allows us to take into account all of the important performance metrics, even under a more realistic system model.

We identify several design principles for streaming systems with network coding and show that any protocol following our design principles is sufficient to achieve provably good overall performance in realistic settings. In particular, we have provided sufficient conditions on the required server bandwidth costs to ensure smooth playback under peer dynamics, and with short initial buffering delays. A closer look at these conditions further suggests that our design principles are able to achieve near-optimal bandwidth utilization during flash crowds and to handle severe peer dynamics with ease.

Our design principles can be readily implemented in practice with low overhead. In fact, our previous work on R2 is a particular example of such implementations. Through such an example of a practical protocol design, combined with our theoretical framework of fundamental properties, we have provided a striking demonstration of an effective streaming protocol design with network coding. We expect to see real-world implementations of our design principles in the near future. Clearly, the design principles identified in our work might not be unique, and our analytical studies are concerned with a family of such protocols, focusing on important elements, rather than protocol details. We believe our theoretical framework will shed considerable light into the journey towards the best possible protocols in P2P streaming systems.

A slide presentation on this project is available for further reading, in both Adobe Flash and Adobe PDF formats.

R2: Practical Peer-to-Peer Live Streaming with Network Coding

Saturday, October 13th, 2007

Recent real-world success of peer-to-peer live streaming, such as UUSee and PPLive, has led to the question of what further improvements one can make to existing streaming protocol design. It is well known that Peer-to-Peer (P2P) streaming conserves server bandwidth by shifting the load to peers and their ISPs, and as such scales well to a larger population of subscribers, and offers better resilience to flash crowds, when server bandwidth supply is temporarily insufficient to meet demand. However, when pushed too far, it may be hard to monitor and control the quality of streaming playback, as routine peer departures are detrimental to streaming performance in a transient fashion. The delay in delivering a live event to all peers is much higher with relaying, and the initial buffering delay can be higher.

Previous research in the literature has focused on constructing high-quality peer-level topologies. Tree-based per-slice push, such as Chunkyspread [ICNP 2006], may shorten the delays between live events and their playback time, but they suffer overhead of maintaining trees and unavoidable implementation complexity. Mesh-based per-segment pull, such as CoolStreaming [INFOCOM 2005], may be simple to implement and offer better resilience to frequent peer departures, but the overhead of requests and buffer availability exchanges is quite significant, and the delays between live events and their playback time may be much longer.

With the recent surge of interest in practical network coding, it is interesting to see if network coding would be helpful in delivering performance gains in P2P streaming, with respect to three important performance metrics: (1) resilience to peer dynamics; (2) the utilization of scarce server bandwidth; and (3) the time required for initial buffering delays when switching to new channels.

In the R2 project, our objective is to design, analyze, and implement a new peer-to-peer live streaming protocol, that hopefully combines the benefits of tree-based per-slice push and mesh-based per-segment pull, but without much complexity in its implementation. In the first phase, we are interested in weather network coding can offer any tangible benefits in P2P streaming, without revising the traditional mesh-based per-segment pull protocol. We have performed an experimental evaluation of the benefits of network coding over a vanilla pull-based streaming protocol, with the conclusion that there exist some benefits offered by network coding, but they are nothing to write home about. Our papers summarizing results in this step have been published in IEEE Transactions on Multimedia (PDF), and IEEE INFOCOM 2007 (PDF).

In the second phase, we are naturally curious about the possibility of designing a much improved protocol that is tailored to take full advantage of network coding. Towards this objective, we have designed R2 (Random push with Random network coding), and believed that R2 enjoys the advantages of both push-based and pull-based streaming protocols. To show its practicality and effectiveness, our experimental results are based on a complete implementation of R2 with emulated peers on a server cluster, coupled with an optimized library of random network coding tuned for performance. Our paper on R2 is to appear in the December 2007 issue of the IEEE Journal on Selected Areas in Communications, Special Issue on Advances in P2P Streaming.

If you are still reading, you may also be interested in our presentation slides dedicated to the R2 project, available as an Adobe PDF document.

Tuning the implementation of network coding

Wednesday, June 20th, 2007

Since 2001, network coding has received a tremendous amount of research attention in the networking community. The essence of network coding is to allow coding at intermediate network nodes. Network coding is shown to help to resist dynamics over time in peer-to-peer (P2P) content distribution, or to improve throughput in wireless networks.

The lack of commercial use of network coding in any P2P applications may be due to the high coding complexity of random linear codes, as the number of blocks to code scales up. As random linear nodes are used, or implicitly assumed, in almost all practical proposals of using network coding in recent papers, we thought it would be a good idea to see how much coding — what may be the typical block length — that modern desktop processors are able to handle, by carefully designing a performanced-tuned implementation of random network coding. Ideally, such an implementation would use the best development tools available today, and take advantage of all available hardware support from modern processors. We thought it would be an important and challenging step towards commercial use of network coding, especially in P2P applications that use modern processors.

Our recent paper titled “Parallelized Progressive Network Coding With Hardware Acceleration,” to appear in the Proceedings of IEEE IWQoS 2007 (held in Chicago, from June 21-22, 2007), has presented results from such an implementation that we have completed. In this implementation, we have used a combination of three tricks: progressive decoding using Gauss-Jordan elimination so that blocks can be decoded on-the-fly as they are received, a loop-based implementation of GF(256) multiplication that takes advantage of SIMD vector instructions in modern processors, and a threaded implementation that aggressively utilizes multi-core CPUs and their L2 caches. The result has been a network coding implementation written in C++, with a strong focus on multi-core performance, optimized for Intel, AMD, and PowerPC processor families (AltiVec on PowerPC and SSE2 on x86). In addition, our code takes care of platform-specific details (e.g., byte-aligned memory allocation for SIMD) and readily compiles and runs on Microsoft Windows, Mac OS X, and Linux.

This graph shows the decoding performance of our implementation (where n is the number of blocks to code):

performance.gif

This table shows the performance of both encoding and decoding on various processors and operating systems:

comparison.gif

A slide presentation of this work is available as an Adobe PDF document or a Flash slide show.

This work has brought us some good news. We show that for practical block sizes and block lengths (such as 128 blocks of 4 KB each), it is practical to perform network coding on-the-fly in P2P applications, with little computational overhead (since most of the time required for coding overlaps with the time for transmitting or receiving coded data). If the network is the bottleneck, the load on processors should be light, as the coding performance we have shown is far better than typical DSL bandwidth (at most 500 KB per second).