October 08, 2018
This paper is published on OSDI’18, and authors are from CMU, Pennsylvania State University, and Microsoft Research. The paper studies request tail latency in web services and are claimed to be the first caching system that minimizes the request tail latency. Through its novel solution, dynamically reallocate cache resources from the cache-rich to the cache-poor, it meets a 150ms P99 goal 99.7% of the time in the presence of load spikes under the testbed of Microsoft’s OneRF production system.
Providers of large user-facing web services have long faced the challenge of achieving low request latency. In real-world systems, commonly in multitier architectures, when incoming requests are complex, consisting of multiple queries, it is difficult to maintain low tail latencies. The queries generated by a single request are independent processed in parallel and spread over many backend services. The overall request latency is defined to be the latency of the request’s slowest query. Moreover, it is common in multi-tier architectures that particular backend causing high request latencies change over time.
Existing solutions include:
Load balancing between servers. This approach aims to equalize query tail latencies between servers. Unfortunately, load balance is profoundly limited in a multitier architecture, where a given backend typically is unable to answer a query initially intended for another backend system. While limited load balance can be done between replicas of a single backend system, it is impossible across different backends.
Dynamically auto-scaling. This approach tends to temporarily allocate additional servers to the backends currently experiencing high latency. Unfortunately, dynamic auto-scaling is difficult, because backends like OneRF are stateful. Many of the backends at Microsoft do some form of auto-scaling, and the systems are still affected by latency spikes.
Real-world Example - OneRF
OneRF page rendering framework at Microfost serves a wide range of content including news and online retail software stores. It relies on more than 20 backend systems. Each OneRF application server has a local cache. Incoming requests are split into queries which first lookup up in the cache. Cache misses are then forwarded, in parallel, to clusters of backend servers. Once each query has been answered, the application server can serve the user request. Thus, each request takes as long as its slowest query. A OneRF request can send any number of queries (including 0) to each backend system.
Time-varying latency imbalance. Latencies of different backends vary widely, and backend latencies also change over time. Most existing caching systems implicitly assume that latency is balanced, presented by FAIR policy. If latencies are imbalances between the backends, two cache misses to different backends should not be treated equally.
Latency is not correlated with specific queries nor with query rate. Many caching schemes share cache space among the backends and use a common eviction policy(such as LRU). Shared caching systems inherently favors backends with higher query rates. Another common approach is to partition cache space to maximize overall cache hit ratios, allocating cache spaces in proportion to query rate, which leads to suboptimal cache space allocations when latency is uncorrelated with query rate.
Latency depends on request structure, which varies greatly. The manner in which an incoming request is split into parallel backend queries by the application server varies between requests. Few caching systems incorporate latency into their decisions, and they consider the average query latency as opposed to the tail request latency.
In light of these challenges, RobinHood is proposed as a novel idea – dynamically allocates cache space to cache-poor backends (ones responsible for high request tail latency), while stealing space from cache-rich backends (ones do not affect the request tail latency). Unlike a traditional caching system, which is often designed only to improve average(not tail) latency of individual queries(not requests).
RobinHood Caching Algorithm
To reallocate cache space, RobinHood repeatedly taxes every backend by reclaiming 1% of its cache space, identifies which backends are cache-poor, and redistributes wealth to these cache-poor backends. RobinHood operates over time windows of ∆ seconds. Within the time window, RobinHood tracks the ID of the backend corresponding to the slowest query in the request. RobinHood then counts the number of times each backend produced the slowest query in a request. Each backend’s total count is called its request blocking count (RBC). RobinHood thus considers a backend’s RBC as a measure of how cache-poor it is and distributes the pooled tax to each backend in proportion to its RBC.
 Q. Pu, H. Li, M. Zaharia, A. Ghodsi, and I. Stoica. Fairride: Near-optimal, fair cache sharing. In USENIX NSDI, pages 393–406, 2016.