The lab was
created in 2002 to address some of the emerging problems in
computer networks. These included scalable video streaming and coding,
congestion control, active queue management (AQM), topology modeling, P2P
systems, bandwidth estimation, Internet measurement, flow sampling, and general
performance analysis.
Over
the years, research directions expanded to large-scale web crawling (IRLbot),
real-time domain ranking, spam avoidance, stochastic modeling of random graphs,
analysis of data staleness in distributed systems,
remote sampling of data churn under censored
observation, design of pull-based replication systems, high-performance
networking in Windows, and scalable OS fingerprinting.
The
latest research focuses on algorithms for managing large datasets in external
and distributed memory. The original motivation comes from IRLbot, where huge
input (100-200x RAM size) had to be manipulated on disk to support the crawl.
This led to the development of in-house C++ MapReduce tools for handling large
graphs, performing ranking and URL uniqueness checks, enforcing admission
control on pending links, and sorting vast amount of data, years before Hadoop
was available. Later on, many students faced similar problems, but for one reason or
another were required to write their own versions of this framework, which led
to duplicate effort, countless hours optimizing their implementation, and a
steep learning curve. Our current approach aims to replace these ad-hoc
solutions with a unifying platform that would offer a new C++ architecture for
simplifying high-performance data streaming and enabling rapid development of
new data-intensive algorithms for external-memory scenarios.
As it also
turned out, some of the created graph-ranking algorithms for IRLbot were
closely related to motif listing in graphs. Perhaps the most commonly used
problem in this area is searching for triangles, with numerous applications that
span networking, biology, data mining, graphics, databases, and complexity
theory. The main difficulty with this class of problems is that RAM is much
smaller than the number of edges in the graph, which ensures that no
I/O-reasonable partitioning scheme can guarantee that all three edges of a
triangle are simultaneously present in one of the subgraphs. Despite being a focus of research
for 40 years, triangle listing still has many open theoretical questions, such
as accurate modeling of CPU complexity for graphs with a given degree
distribution and designing algorithms that achieve optimal I/O in general
families of random graphs. Larger motifs (e.g., 4-cycles) promise to yield even
more interesting results. Besides traditional applications in biology and
network analysis, algorithms for quadrangle enumeration can be used for
computing the neighborhood function, all-node BFS, depth-2 supporter ranking,
and even multiplication of sparse matrices.