About IRL

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING | DWIGHT LOOK COLLEGE OF ENGINEERING | TEXAS A&M UNIVERSITY

HOME
ABOUT
COURSES
PEOPLE
PROJECTS
PUBLICATIONS
CONTACT

About IRL

The lab was created in 2002 to address some of the emerging problems in computer networks. These included scalable video streaming and coding, congestion control, active queue management (AQM), topology modeling, P2P systems, bandwidth estimation, Internet measurement, flow sampling, and general performance analysis.

Over the years, research directions expanded to large-scale web crawling (IRLbot), real-time domain ranking, spam avoidance, stochastic modeling of random graphs, analysis of data staleness in distributed systems, remote sampling of data churn under censored observation, design of pull-based replication systems, high-performance networking in Windows, and scalable OS fingerprinting.

The latest research focuses on algorithms for managing large datasets in external and distributed memory. The original motivation comes from IRLbot, where huge input (100-200x RAM size) had to be manipulated on disk to support the crawl. This led to the development of in-house C++ MapReduce tools for handling large graphs, performing ranking and URL uniqueness checks, enforcing admission control on pending links, and sorting vast amount of data, years before Hadoop was available. Later on, many students faced similar problems, but for one reason or another were required to write their own versions of this framework, which led to duplicate effort, countless hours optimizing their implementation, and a steep learning curve. Our current approach aims to replace these ad-hoc solutions with a unifying platform that would offer a new C++ architecture for simplifying high-performance data streaming and enabling rapid development of new data-intensive algorithms for external-memory scenarios.

As it also turned out, some of the created graph-ranking algorithms for IRLbot were closely related to motif listing in graphs. Perhaps the most commonly used problem in this area is searching for triangles, with numerous applications that span networking, biology, data mining, graphics, databases, and complexity theory. The main difficulty with this class of problems is that RAM is much smaller than the number of edges in the graph, which ensures that no I/O-reasonable partitioning scheme can guarantee that all three edges of a triangle are simultaneously present in one of the subgraphs. Despite being a focus of research for 40 years, triangle listing still has many open theoretical questions, such as accurate modeling of CPU complexity for graphs with a given degree distribution and designing algorithms that achieve optimal I/O in general families of random graphs. Larger motifs (e.g., 4-cycles) promise to yield even more interesting results. Besides traditional applications in biology and network analysis, algorithms for quadrangle enumeration can be used for computing the neighborhood function, all-node BFS, depth-2 supporter ranking, and even multiplication of sparse matrices.