Leases: An efficient fault-tolerant mechanism for distributed file cache consistency – Gray & Cheriton 1989
This paper introduced the leasing model for distributed systems. Leases are conceptually very straightforward and bring a surprising number of benefits for such a simple mechanism. Also in this paper you’ll find the simple formulas that can help you figure out the optimum lease term for your workloads.
Caching introduces the problem of ensuring consistency between the cached data and its primary location of storage. By consistent, we mean that the behaviour is equivalent to their being only a single (uncached) copy of the data except for the performance benefit of the cache.
Now that’s one of the most straightforward definitions of consistent you’re likely to find!
A distributed system… can experience partial failures: a host may crash or messages may be lost. Existing approaches to consistency for file caches fall into two categories: those that assume reliable broadcast, and so do not tolerate communication failures, and those that require a consistency check for every read, and so fail to deliver good performance.
Those were simpler times 😉 Leases are a very elegant solution to these problems.
A lease is a contract that gives its holder specified rights over property for a limited period of time. In the context of caching, a lease grants to its holder control over writes to the covered datum during the term of the lease, such that the server must obtain the approval of the leaseholder before the datum may be written…. A cache using leases requires a valid lease on the datum (in addition to holding the datum) before it returns the datum in response to a read, or modifies the datum in response to a write.
From a performance perspective, one big advantage of leases is that once a lease is held, subsequent reads can return straightaway. Leases are also very easy to implement on the server-side: using short leases, you can always just wait for the lease time to expire to get back to a known state.
Short lease terms have several advantages. One is that they minimize the delay resulting from client and server failures (and partitioning communication failures). When the server cannot communicate with a client, the server must delay writes to a file for which the failed client holds a lease until that lease has expired.
Short lease terms also minimize the overhead due to false sharing – where a client wants to write to a file that is covered by a lease held by another client, when in fact that client is no longer using it.
Key to the overall performance of the system is choosing an appropriate lease term:
The choice of lease term is based on the trade-off between minimizing the lease extension overhead versus minimizing false sharing.
Given a read rate of R, write rate of W, and a file shared between S caches then the lease benefit factor B is given by B = 2R/SW. And you’ll be getting benefit from a leasing scheme so long as the amount of time a client holds a lease (without renewing) is > 1/(R(B-1)).
Leases have very nice fault-tolerant properties:
Leases ensure consistency provided that the hosts and network do not suffer certain Byzantine failures including clock failure. More specifically, consistency is maintained in spite of message loss (including partition), and client or server failures (assuming writes are persistent at the server across a crash). Moreover, availability is not reduced by the caches because an unreachable client at most briefly delays write access by other clients.
In a forward-looking statement in the conclusion, the authors state:
Lease appear well-suited to large-scale distributed systems. The improvement in response time that they offer is more significant for the faster processors and higher delay networks.
In their 2013 “Scaling memcache at Facebook” paper, Nishtala et al. discuss the introduction of a lease mechanism to cope with ‘thundering herds.’ Without leases, all of the cache misses resulted in a peak database query rate of 17K/s. With leases,the peak database query rate was 1.3K/s.