| |
Streaming and
optimization of MapReduce
Sponsor: NSF
Abstract
As the field stands today, operating
systems provide a poor interface for data-intensive computing, requiring
programmers to engage in tedious, non-reconfigurable, and error-prone code
development. These software-engineering practices often lead to easily
exploitable vulnerabilities and devastating security breaches. This project
creates the algorithms and implementation for a virtual-stream interface that
enables development of big-data software that is more easily managed,
simpler to understand, inherently faster, and bug-free.
Conference
Papers
|
|
PDF,
PPT |
|
|
PDF,
PPT |
|
C. Hanel, A. Arman, D. Xiao, J. Keech, and D. Loguinov, "Vortex:
Extreme-Performance Memory Abstractions for Data-Intensive Streaming
Applications," ACM ASPLOS, March 2020.
|
PDF,
PPT |
Vortex
Vortex is a high-performance streaming produce-consumer framework for large
data workloads, with the reference implementation included below. The code also
contains such applications as file I/O, in-memory pipelines, and in-place
integer sorting. It can be compiled on Windows and Linux, each explained in a separate
section below.
- Windows
1) Visual Studio 2019 16.3.10 or earlier
is recommended. Later versions of the compiler have problems avoiding
conditional branches in sorting networks, for which we implemented a fix;
however, there is no guarantee that future versions of VS won't break. A
more long-term solution would be conversion of sorting networks to assembly,
which we may release in the future.
2) Make sure to check that the project is set for x64
Release.
3) To manage physical memory in user space, Windows requires that the
SE_LOCK privilege
be enabled for the current user. It is not sufficient to just be
an administrator.
4) The code itself must run with elevated privileges, which can be done by
any of the following: a) running Visual Studio as administrator; b) right
clicking the executable generated by Visual Studio and selecting "Run as
Administrator"; or c) disabling the
Run All Administrators in Admin Approval Mode local security policy.
Additionally, the project manifest can be modified to
automatically execute the program with admin permissions.
5) For maximum speed, Windows Server 2016 or older
is recommended. In newer kernels, MapUserPhysicalPages() (i.e., the kernel
API utilized by Vortex to map/unmap pages) is much slower, e.g., by 50% in
Server 2019 (build 17763) and 60% in Windows 10 (build 19035). Additionally,
both of these operating systems experience memory "rot" after some uptime,
which causes memory-intensive benchmarks (e.g., producer-consumer, sorting)
to take a ~30% performance hit. This can be fixed by rebooting the system.
- Linux
1) The code has been tested in
Ubuntu 18.10
and
gcc 9.3. If earlier versions of gcc are used, _xgetbv() may need to be
invoked
manually via inline assembly. This function is used to detect presence
of AVX, but is not critical to the rest of Vortex. It can thus be commented
out and cpuId.avx be hardcoded to either true or false.
2) Compilation is performed via the command line with the included makefile
in Vortex-1.0/Vortex. Type "make" and "./Vortex" upon success.
3) File I/O benchmarks have not been ported to
Linux due to the extra effort needed to rewrite the overlapped I/O model of
Windows.
4) Linux mremap() does not support remapping across VMAs. This leads to a
small increase in memory overhead during sorting compared to Windows.
- Additional considerations
1) Mitigation of Spectre/Meltdown bugs affects the speed of Vortex bucket
sort and various memory-intensive benchmarks. As such, it is recommended
that these features
be disabled to achieve peak performance.
2) The bucket sort prefers to use AVX offload during write-combine for a
small gain in speed; however, certain motherboards run AVX at lower
frequencies, which can make AVX slower than SSE. One option for overcoming
this is to set the AVX offset to zero in BIOS. The other is to assign
cpuId.avx = false and let Vortex fall back to SSE intrinsics.
3) If sorting large inputs (e.g., 256 GB), two levels of write-combine might
be beneficial, especially on multi-socket Xeons; however, this is currently
not implemented. Additionally, big arrays may need 5-level paging from the
OS and CPU to avoid running out of virtual memory. Future work will
look into this further.
4) While the main aim of Vortex sort is multi-GB datasets, small inputs
(e.g., a few hundred bytes) are supported; however, they are not meant to be
top-speed or memory-efficient.
5) For completeness, the code is templated to handle 8/16/32-bit keys in
addition to regular 64-bit keys, but these use cases are not well-developed,
extensively-tested, or super-efficient.
6) Vortex sort can handle non-uniform keys without overflowing memory or
running into problems; however, the nature of MSD radix sort guarantees that
certain skewed distributions will incur a performance drop. Future work will
examine how to overcome these issues and offer additional improvements
(e.g., sorting key-value pairs & variable-size keys).
The code is released under the
GPLv3 license. The
source can be downloaded here.
Origami
Origami is a fast SIMD mergesort framework for 32/64-bit keys and 128-bit
key-value pairs. It is released under the
GPLv3 license and its
source is available from
github or local download.
Tuxedo
Tuxedo is an efficient external-memory framework for bowtie streaming
(i.e., reading/writing concurrently from/to multiple files). Its main
contribution is to minimize the impact of seek delays on application throughput.
The Tuxedo source code is available from Dmitri
Loguinov.
|