Tuning the implementation of network coding

Since 2001, network coding has received a tremendous amount of research attention in the networking community. The essence of network coding is to allow coding at intermediate network nodes. Network coding is shown to help to resist dynamics over time in peer-to-peer (P2P) content distribution, or to improve throughput in wireless networks.

The lack of commercial use of network coding in any P2P applications may be due to the high coding complexity of random linear codes, as the number of blocks to code scales up. As random linear nodes are used, or implicitly assumed, in almost all practical proposals of using network coding in recent papers, we thought it would be a good idea to see how much coding — what may be the typical block length — that modern desktop processors are able to handle, by carefully designing a performanced-tuned implementation of random network coding. Ideally, such an implementation would use the best development tools available today, and take advantage of all available hardware support from modern processors. We thought it would be an important and challenging step towards commercial use of network coding, especially in P2P applications that use modern processors.

Our recent paper titled “Parallelized Progressive Network Coding With Hardware Acceleration,” to appear in the Proceedings of IEEE IWQoS 2007 (held in Chicago, from June 21-22, 2007), has presented results from such an implementation that we have completed. In this implementation, we have used a combination of three tricks: progressive decoding using Gauss-Jordan elimination so that blocks can be decoded on-the-fly as they are received, a loop-based implementation of GF(256) multiplication that takes advantage of SIMD vector instructions in modern processors, and a threaded implementation that aggressively utilizes multi-core CPUs and their L2 caches. The result has been a network coding implementation written in C++, with a strong focus on multi-core performance, optimized for Intel, AMD, and PowerPC processor families (AltiVec on PowerPC and SSE2 on x86). In addition, our code takes care of platform-specific details (e.g., byte-aligned memory allocation for SIMD) and readily compiles and runs on Microsoft Windows, Mac OS X, and Linux.

This graph shows the decoding performance of our implementation (where n is the number of blocks to code):

performance.gif

This table shows the performance of both encoding and decoding on various processors and operating systems:

comparison.gif

A slide presentation of this work is available as an Adobe PDF document or a Flash slide show.

This work has brought us some good news. We show that for practical block sizes and block lengths (such as 128 blocks of 4 KB each), it is practical to perform network coding on-the-fly in P2P applications, with little computational overhead (since most of the time required for coding overlaps with the time for transmitting or receiving coded data). If the network is the bottleneck, the load on processors should be light, as the coding performance we have shown is far better than typical DSL bandwidth (at most 500 KB per second).

Leave a Reply