Navigation

Research Statement
in PDF format


My research interests focus on large-scale measurement and statistical study of real-world peer-to-peer applications, and the analysis, design and optimization of distributed algorithms in distributed systems with a strong theoretical foundation.

In particular, my research efforts focus on two fronts of peer-to-peer research. First, I am interested in deriving in-depth insights from extensive measurements and statistical studies of large-scale commercial peer-to-peer streaming applications, over a long period of time. Second, I am also interested in seeking solutions to the important problems motivated by the insights, by designing distributed optimized algorithms based on a variety of theories, including optimization, game theory, graph theory, and network coding.

My representative research projects are as follows.

Magellan | rStream | PeerBid | Diverse | Echelon

Magellan

Live P2P streaming applications have been successfully and commercially deployed in the Internet with up to millions of users at any given time, e.g., CoolStreaming, PPLive, UUSee, TVAnts. As a commonly adopted design, they deliver blocks of live media contents over a mesh overlay topology, featuring BitTorrent-like block exchanges among the peers. Within the academic community, there has been an increasing level of attention on measurement studies with these real-world streaming applications, which treat them as a "blackbox" and monitor their traffic characteristics. However, we believe the best way to gain in-depth insights and a complete understanding of P2P streaming characteristics is to collaborate with commercial solution providers and add instrumentation facilities into the product itself. With the Magellan project, we wish to achieve exactly that, by collaborating with UUSee Inc., one of the leading P2P live streaming solution providers in mainland China, and implementing detailed measurement and report facilities in the UUSee client software. Starting September 2006, we have been collecting continuous-time reports of vital statistics from the peers to dedicated trace servers. Utilizing such continuous-time snapshots of the entire large-scale P2P system, our study represents the following highlights:

The first thorough investigation of large-scale P2P streaming topologies.

As mesh-based streaming topologies play an important role towards the commercial success of P2P streaming, it is critical to acquire a thorough and in-depth understanding of the topological characteristics of these large-scale P2P meshes. In particular, it is an intriguing research challenge to investigate how the actually formed topologies actually behave in practice, dynamically evolve over time, and react to extreme scenarios such as huge flash crowds. The first milestone of Magellan &mdash "Charting large-scale peer-to-peer live streaming topologies" (ICDCS07) &mdash presents our observations from exploring graph theoretical properties in actually formed live streaming topologies, based on over 120 GB of traces and 10 million unique IP addresses that we have collected in UUSee network over a two-month period (September to October 2006). Our valuable discoveries include the scalability of the commercial P2P streaming application even in case of large flash crowds, the clustering phenomenon of peers in each ISP, as well as the reciprocal behavior among peers, etc.

An exhaustive statistical characterization of P2P streaming flows.

The available bandwidth among peers is of pivotal importance to large-scale P2P streaming applications. It is therefore important for peers in a peer-to-peer streaming system to select an appropriate and small number of neighbouring peers with active connections, so that there is an abundant &mdash or at least satisfactory &mdash amount of available bandwidth between itself and each of its neighbours. To derive useful insights towards high-bandwidth peer selection without active probing, our second milestone of Magellan &mdash "Characterizing Peer-to-Peer Streaming Flows" (JSAC07-P2PStreaming) &mdash is to comprehensively study any statistical properties of the TCP throughputs of streaming flows among peers, using more than 230 GB of UUSee traces and 370 million live streaming flows over a four-month period of time (November 2006 to February 2007).

In particular, we have investigated TCP throughput distributions in various peer ISP/area/type categories, statistically tested the correlation between TCP throughput and its application-layer factors by modeling them into regression models, and studied the evolutionary properties of TCP throughput values over the trace period. We have made quite a number of interesting observations. First, we have discovered that the ISPs that peers belong to are highly significant in determining inter-peer bandwidth, even more important than their geographic locations. Second, we have also found excellent linear correlations between the availability of peer last-mile bandwidth and inter-peer bandwidth within the same ISP and between a subset of ISPs, with different linear regression coefficients for different pairs of ISPs. Finally, we have observed daily evolutionary patterns of inter-peer bandwidth.

Based on these insights, we have designed a "throughput expectation index" that makes it possible to select high-bandwidth peers without active probing. This index is computed based on the ISPs that peers belong to, a table of linear regression coefficients for different pairs of ISPs, the availability of peer last-mile bandwidth, and the time of the day. All linear regression coefficients are derived off-line using historical measurement traces. We have cross-checked the set of peers selected using such an index against the set of real-world top-ranked peers in the traces, and discovered a surprisingly good match between the two.

Online server capacity provisioning across multiple streaming channels.

While the essence of P2P streaming is the use of peer upload bandwidth to alleviate the server load, capacity provisioning on dedicated streaming servers still plays a significant role in guaranteeing the streaming quality in the live streaming channels, due to peer instability and time-varying peer upload bandwidths. In our third milestone study &mdash "Multi-channel Live P2P Streaming: Refocusing on Servers" (INFOCOM 2008), we have investigated the utilization of upload capacity on dedicated streaming servers based on 400 GB of UUSee traces over a seven-month period of time (September 2006 to March 2007), and have observed that the available capacities on streaming servers are not able to keep up with the rapidly increasing demand from an increasing number of channels deployed in the P2P streaming network. We believe it is a common critical question faced by all the current-generation P2P streaming applications that are expanding fast in scale: how shall we decide the minimum amount of server capacity to provision, such that the dynamic bandwidth demand in all the concurrent channels can be effectively satisfied? In particular, if we further consider the rising tension between P2P applications and ISPs, it is preferable that such server capacity provisioning can effectively constrain the P2P traffic within ISP boundaries.

To address these intriguing challenges, we have proposed a novel online server capacity provisioning algorithm, that dynamically allocates the available server capacity to each of the concurrent channels with full ISP awareness. We believe such an algorithm may only be most useful when it works in a completely proactively fashion, in that we decide at time t the bandwidth to allocate to each channel at the future time t + 1. This is achieved using a number of time series forecasting and dynamic regression techniques. At each time t, our algorithm predicts the number of peers in each channel at time t+1, using the time series ARIMA model. Then, it predicts the server capacity demand in each channel at time t+1 with a dynamic regression learning approach. Based on these predictions, the algorithm allocates server bandwidth to each channel for time t+1, respecting the predicted demand and the priority of channels. Our algorithm is examined to be effective, based on its implementation in a P2P streaming system that replays the scenarios captured by the traces using a server cluster.