May 10, 2013
This is the last part of a 4-part series on configuring Riak’s oft-subtle behavioral characteristics
For my final post, I’ll tackle two very different objectives and show how to choose configuration values to support them.
Fast and sloppy
Let’s say you’re reading and writing statistics that don’t need to be perfectly accurate. For example, if a website has been “liked” 300 times, it’s not the end of the world if that site reports 298 or 299 likes, but it absolutely is a problem if the page loads slowly because it takes too long to read that value.
We want to maximize performance by responding to the client as soon as we have read a value or confirmed a write.
Before Riak 1.3.1, this would have been implied by
W=1, but that is no longer true. In any event, better to be explicit: we only need one vnode to send the data to its backend before the client receives a response.
We don’t need to spend time pulling and updating vector clocks; just write the latest value as quickly as possible.
Not so obvious
One problem with
R=1 is that
notfound_ok=true by default. This means that if the first vnode to respond doesn’t have a copy, the failure to find a value will be treated as authoritative and thus the client will receive a
If a webpage has been liked 5000 times, but one of the primary nodes has fallen over and a failover node is in place without a complete data set, you don’t want to report 0 likes.
notfound_ok=false instructs the coordinating node to wait for something other than a
notfound error before reporting a value.
However, waiting for all 3 vnodes to report that there is no such data point may be overkill (and too much of a hit on performance), so you can use
basic_quorum=true to short circuit the process and report back a
notfound status after only 2 replies.
Playing with fire
This is not for the faint of heart, but if you can accept the fact that that you will occasionally lose access to data if two nodes are unavailable, this will reduce the cluster traffic and thus potentially improve overall system performance. This is definitely not a recommended configuration.
Strict, slow, and failure-prone
At the other end of the spectrum, if you want to favor consistency at the expense of availability, you can certainly do so.
Read Your Own Writes (RYOW), as we discussed in an earlier post. Be prepared for more frequent failures, however, since the cluster will not be at liberty to distribute reads and writes as widely as possible.
If a conflict somehow does occur, give all of the values to the application to resolve.
Not so obvious
By retaining the vector clocks for deleted objects, we enhance the overall data integrity of the database over time and sidestep problems with resurrected deleted objects as we saw in the previous blog post.
This comes at a price: more disk space will be used to retain the old objects, and there will be many more tombstones for clients to recognize and ignore. (You have written your code to cope with tombstones and even tombstone siblings, haven’t you?)
I hope this series of blog posts has helped answer some of the mysteries, and will help you avoid some stumbling blocks, involved with running Riak.
If you occasionally wonder why relational databases have been around for so many decades and still have a hard time scaling horizontally, I suggest you think back to the tradeoffs and hard questions posed herein and consider this: all of this is to keep a simple key/value store running fast and answering questions as accurately as possible given the constraints imposed by inevitable hardware and network failure.
Imagine how much harder this is for a relational database with its dramatically more complex storage and retrieval model.
Distributed systems are not your father’s von Neumann machine.
If you’d like to simulate production load or experiment with various failure modes, here are tools which may be of assistance.
Basho Bench is a benchmarking tool that can generate repeatable performance tests.
- GitHub repository
Quick summary of configuration parameters
The effective default value is listed for each value. Most of these are also
covered in our online documentation.
As mentioned in part 2, the capitalized parameters I’ve been using are for aesthetic purposes, and the real values (e.g.,
n_val in place of
dw in place of
DW) are shown below.
Reading and writing
- All of the remaining values under this heading can be no larger than
- Value is in milliseconds, hence 3 seconds by default
- Alternative values are
- Brewer’s CAP theorem (Wikipedia)
- Brewer’s article on the CAP theorem, 12 years later
- Basho’s CRDT project
- Original CRDT research paper (PDF)
Eventual consistency (Riak docs) Vector clocks (Riak docs) Dynamo paper with annotations for Riak’s architecture Overview of Riak’s concepts Riak Glossary
- Failure scenarios in Riak, from Ryan Zezeski’s Try-Try-Try project