Panopticon: An Omniscient Lock Broker for Efficient Distributed Transactions in the Datacenter

Panopticon: An Omniscient Lock Broker for Efficient Distributed Transactions in the Datacenter – Tasci & Demirbas, 2015

Today we return to the theme of distributed transactions, and a paper that won a best paper award from IEEE Big Data in 2015. Panopticon is a centralized lock broker (like Chubby and ZooKeeper) that manages distributed (decentralized) locks (unlike Chubby and ZooKeeper). Full distributed transaction support is built on top of this lock broker. A central idea is that the lock for a data item does not have to be held in same place as the data item itself. This allows locks to migrate (even if the corresponding data items don’t), and improves the efficiency of the transaction commit protocol). The paper is an easy read, but I found myself wishing it was a little more formal at points – these things are hard to reason about!

(Distributed) transactions have gained a reputation for being unscalable…

The authors identify two main reasons for this:

  1. The coordination required for acquiring multiple locks in a distributed setting (two-phase locking, two-phase commit) becomes increasingly costly as the number of servers and data items involved increases.
  2. There is a big latency difference between local and remote access. When the lock for a data item is tied to the same location as a data item (and techniques such as consistent hashing are used to place data without locality considerations) this penalises transaction latencies significantly, especially during the lock acquisition phase.

Remember that we’re taking as a baseline the fact that an application may need to read and write multiple data items as part of some business use case. The overhead we’re concerned with here is doing so transactionally.

Traditional distributed transactions employ two phase locking to prevent deadlocks, which requires that the server initiating the transaction to contact the other servers for locks in increasing order of locks. Instead of contacting other servers trying to acquire locks in increasing order in a serial manner, it is more efficient to go to the broker and test/set all the locks at once. In Panopticon, the lock request is sent at once to the broker, and the broker takes care of deadlock prevention.

Panopticon divides locks into three categories based on their access patterns:

  1. Locks that are accessed from multiple different servers
  2. Locks that receive repetitive access from the same server
  3. Locks that aren’t accessed very much

Observe that:

It is best to host type 1 locks (locks that keep receiving across-server accesses) in the lock broker. And it is best to assign the type 2 locks to the requesting server to avoid the overheads of repetitive requests from that server to the broker.

Panopticon uses heuristics to migrate locks so as position them in the best place possible. All locks start out (as type 3 locks) on the same server as the data item they protect. A server contacts the lock broker only if it requires a lock that is not kept locally.

If a server s requests some lock l from the broker (i.e., a lock that is not local to s), then the lock is migrated to the broker if it is not already there. The broker then grants it to s. When s releases the lock, l stays resident centrally at the broker. l is now treated as a type 1 lock.

If a lock held at the broker is not accessed for a long time, and the broker needs to save space, an LRU policy can be used to migrate type 1 locks back to the server owning the corresponding data item (becoming type 3).

We use the following rule of thumb for declaring a lock to be of type 2 and migrating that lock to a server: If two consecutive requests for a given lock l (held at the broker) comes from the same server w, then the broker migrates lock l to server w. From that point on w treats l as its local lock, the lock locality of w is improved with this, since w does not need to contact the broker for l again.

A type 2 lock can migrate back to the broker again if a request for it is received from some other server than the one it currently resides on. In this manner,

As the centralized authority for mediating access to data, the broker learns about the access patterns of transactions at runtime and manages the migration of locks to servers in a way that improves lock access locality. The lock broker, however, is oblivious to the state of the transactions, and the servers are the ultimate transaction managers. Transactions are initiated and executed by the servers distributedly after checking that all the locks are available at the server.

Panopticon uses a form of two-phase locking to manage transactions. When a server initiates a distributed transaction, it requests all the locks it needs in a batch request sent to the broker. The broker orders the locks by data item ID so that all transactions always attempt to acquire locks in the same order (preventing deadlock). The broker works through the requested locks in order.

  • If it already owns a lock, it forwards it to the requesting server.
  • If the broker does not have the lock, it adds the requesting server’s name to a request queue that the broker maintains for that lock, and forwards a request for the lock to the current owning server. When the lock becomes available, the broker forwards it to the server at the head of the queue.

If the server initiating the transaction ultimately requires for example four locks – l1, l2, l3, and l4 in that order, it may be that at some point it holds locks 1,2, and 4. However lock l4 can be taken away from it at any point until the server has also acquired lock 3. Locks 1 & 2 are considered to be ‘authoritatively owned’ by the server and will never be stolen from it.

After a transaction is finished, the server needs to unlock the data items, which means returning the locks back to the broker. In this phase we propose an optimization where the server lazy unlocks the locks. Lazy unlocking means that the locks are released locally, but not transmitted back to broker until δ time elapses, where δ is empirically determined. Lazy unlocking provides efficiency benefits for cases when the server needs to access the same data items used in the terminated transaction immediately in the next transaction.

If a lazy-unlocked data item lock is requested in the meantime, it is immediately returned to the broker.

In order to give more scalability and avoid bottlenecks when using extremely large number of servers and locks, we can employ a hierarchical composition of the brokers. For this we have k level-0 lock brokers each overseeing a cluster of servers and a top-level lock broker overseeing
these k lock brokers… Moreover, using hierarchical composition of brokers at different datacenters, the Panopticon system can provide a partial answer to the across-datacenter/WAN transactions problem. Providing an efficient and complete system for across-datacenter transactions remains part of our future work.

Partition detection follows very simple rules, given that there is a central broker. If you can see the broker, you’re in the main partition, and if you can’t, you’re not! Providing continued operation for data items locked by locks outside of the main partition is future work.

Panopticon is built on top of Hazelcast, and compares favourably to the default distributed transaction protocol in Hazelcast. The tests use artificially generated workloads. It would be interesting to see how Panopticon performs with real workloads (in particular, the effectiveness of Panopticon is sensitive to the parameter Phist that determines how likely it is that a server uses the same objects in consecutive transactions – just how likely is this in real workloads with load balancing etc.? (genuine question on my part)).

A few other things I’m left wondering about as well: the prepare and commit phases still need to happen at every RM, even if the locks move around…. what gets logged and where in the WAL? (Records normally hold information about locks held). How does recovery work? How do you prevent the broker from becoming a single point of failure? And if it isn’t, do you have to run some kind of consensus protocol across multiple brokers…?