Last year we looked at RAMCloud, an ultra-low latency key-value store combining DRAM and RDMA. (Also check out the team’s work on patterns for writing distributed, concurrent, fault-tolerant code and how to support linearizable multi-object transactions on RAMCloud with RIFL). Now the RAMCloud research group is looking to use use RAMCloud as a platform to build other higher-level services on top, and this is where the simple key-value interface has started to become a limitation.
Developing new systems and applications on RAMCloud, we have repeatedly run into the need to push computation into storage servers.
Applications that chase inter-record dependencies (fetching one record, then using the returned value to fetch other records) need multiple round trips, and even the 5
µs round trip time RAMCloud offers is too high under these conditions. In fact, 5
µs turns out to be a very awkward amount of time: too long for spinning without killing throughput, and too short to context switch without losing most of the gains in switching overhead.
Applications that are throughput bound can’t push down operations like projection, selection, and aggregation, causing way too much data to be shipped to the client for filtering there.
The team have five key criteria for a server-side user code model:
Near native performance. RAMCloud keeps all data in DRAM and uses kernel bypass networking with operations dispatched in 1.9
µs. Low latency is what it’s all about, sacrificing that makes no sense in a RAMCloud world.
Low invocation overhead. With potentially millions of stored procedure invocations per second, just a few cache misses on each could reduce server throughput by 10% or more.
The ability to add and remove procedures at runtime without restarting the system.
Inexpensive isolation. A large RAMCloud cluster is likely to support multiple tenants, switching between protection domains must be low overhead.
Low installation overhead – no expensive compilation phase to add procedures (the design should support ‘one-shot’ procedures).
What would you do?
Overall, the SQL’s main drawback is that it is declarative. For most workloads, this is a benefit, since the database can use runtime information for query optimization; however, this also limits its generality. For example, implementing new database functionality, new operators, or complex algorithms in SQL is difficult and inefficient.
With SQL ruled out, the default option is probably native code:
Since the high performance of DRAM exposes any overhead in query execution, we initially expected that native code execution with lightweight hardware protections would be essential to the design.
The procedure mechanism incurs three forms of invocation overhead:
The one-off cost to compile / install a procedure
The cost to invoke the procedure
The cost for the procedure to invoke database functionality.
The tricky part with native code turns out not to be the cost of running procedures, but the cost of providing the required isolation: “we considered several approaches including running procedures in separate processes, software fault isolation, and techniques that abuse hardware virtualization features. These techniques show little slow down while running code, but they greatly increase the cost of control transfer between user-supplied code and database code.”
Hosts of V8 can multiplex applications by switching between contexts, just as conventional protection is implemented in the OS as process context switches… A V8 context switch is just 8.7% of the cost of a conventional process context switch… V8 will allow more tenants, and it will allow more of them to be active at a time at a lower cost.
The paper doesn’t discuss WebAssembly at all, nor does it mention anywhere (that I spotted) what version of V8 the tests were run with. I note that WebAssembly is even faster than asm.js, and can also support Non-Web Embeddings. If you’re using V8 though as per this paper, you may not even need to do anything special to take advantage of WebAssembly – as of the recently created v6.1 development branch (August 2017) V8 will transpile any valid asm.js code to WebAssembly. (Ok, strictly embedders do need to do one small thing – enable the
Five key design features help to achieve good performance:
Exploit procedure semantics for efficient garbage collection. Most procedures complete quickly and this can be exploited to minimise garbage collection.
Lightweight per tenant state for fast protection domain switching.
Our next goal is to embed V8 into the RAMCloud server; to develop a smart API for procedures that exposes rich, low-level database functionality; and to begin experimenting with realistic and large scale applications.
And what might those applications be? We have some specific applications in mind:
distributed concurrency control operations
relational algebra operators
materialized view maintenance
partitioned bulk data processing as in MapReduce
custom data models such as Facebook’s TAO.