Skizze: Second Alpha Release
A few months ago, my good friend Seif Lotfy released the first alpha version of Skizze - a sketch datastore that dealt with problems around counting and sketching at scale.
The initial release of Skizze got some amazing feedback and we've spent the past few months working on Skizze when we can, implementing those good ideas, adding more structure to the project, adding client bindings, and thinking more deeply about the problems we'd like to solve.
While we still wanted to do more cleanups, refactoring, and features, our main goal was to build something that we could ask people to test as soon as possible. Both for early feedback and to avoid us getting lost in an endless cycle of tweaking!
We're hoping more people can test it out now, and we get a new wave of feedback on what we've done so far, and ideas going forward!
You can get the latest Skizze Server & CLI from the GitHub repo. There are also Go and Node.js clients.
I should reiterate that this is an alpha release - it's not ready for production data yet!
The rest of this post gives a bit of background on why Skizze exists, and some basics around its data structures & usage. More details around the technical aspects & benchmarks will be forthcoming in Seif's blog.
The Problem
Seif's research into the problem was born from the constant battle we faced while scaling Xamarin Insights - we had so many features that we wanted to feel 'live' on the dashboard, and the thought of calculating them in background jobs felt like the wrong UX for our customers. Many of these problems involved needing to answer the following kinds of questions immediately for an app, any user of an app, or any other unique piece of data generated by the app:
- Have we seen this user before this hour/day?
- We haven't seen this device so far this month, increment its make/model in the hashset, so the dashboard can quickly pull the current top 10 for the overview page
- How many unique users have used the app, sent this event, had a crash, had a warning, etc, etc in the timeframe?
- How often have I seen this user today?
- What are the trending crashes for the hour, the day, the week, etc?
These kinds of questions are easily solvable locally, mostly solvable in a small service, but get quickly out of hand when the numbers you start dealing with are millions of devices & users, which are producing hundreds of millions of crashes and billions of events.
Seif researched how to calculate this data real-time enough that the dashboard would feel like it was updating instantly. Our main concerns were around the speed, storage, and various calculation error-rates of this data. You could easily do this with sets, or hashsets, etc, but you'd quickly need an army of Redises (Redii?) to have any chance of keeping this in memory.
Instead of that, we explored using the various tricks that allowed us to store some of this data in memory but with a much smaller footprint, and while staying within limits for error-rates. During this time, we encountered and toyed with lots of different data structures, many of which you've have heard of before: Bloom filters, HyperLogLogs, topK, and Count Min Log, etc, etc.
We ended up implementing some of these in Redis and other memory stores (our data was rebuildable, and the dashboard could fall back to slower paths, so we were only interested in memory stores). However, we quickly realised that we could get much more performance out of a dedicated database that specialised in this problem. So, with some focus on the problem at hand in our spare time, Skizze was born!
Meet the Sketches
The core of Skizze are it's Sketches. Each sketch solves a specific problem with the data you pass into it. They all work inside error thresholds and use minimal memory. To use Skizze effectively is to learn its Sketches and how best to name, deploy, and query them (more blog posts on this soon!)
The examples below assume you're running Skizze server locally, and have a skizze-cli
session active.
Frequency
A Frequency Sketch will calculate frequencies of it's member values. This helps solve the How many times has X happened? problems:
skizze> CREATE FREQ users:20160301 10000
skizze> ADD FREQ users:20160301 natalie paul robert robert natalie natalie
skizze> GET FREQ users:20160301 paul natalie robert
Value: paul Hits: 1
Value: natalie Hits: 3
Value: robert Hits: 2
skizze> ADD FREQ users:20160301 natalie paul
skizze> GET FREQ users:20160301 paul natalie robert
Value: paul Hits: 2
Value: natalie Hits: 4
Value: robert Hits: 2
Membership
A Membership Sketch will keep a record of the membership of a given value in the set, answering the Have I seen this X before? :
skizze> CREATE MEMB users:20160301 10000
skizze> ADD MEMB users:20160301 natalie paul robert robert natalie natalie
skizze> GET MEMB users:20160301 paul natalie robert
Value: paul Member: true
Value: natalie Member: true
Value: robert Member: true
skizze> ADD MEMB users:20160301 natalie paul
skizze> GET MEMB users:20160301 paul natalie robert jono
Value: paul Member: true
Value: natalie Member: true
Value: robert Member: true
Value: jono Member: false
Rankings
A Rankings Sketch keeps a chosen 'top K' list of rankings, where K can be the value that makes sense for you (top 10, top 100, top 1000). Divided up by a timeframe, this can help you answer questions like What was the most popular make & model of device today? :
skizze> CREATE RANK users:20160301 10
skizze> ADD RANK users:20160301 natalie paul robert robert natalie natalie
skizze> GET RANK users:20160301
Rank: 1 Value: natalie Hits: 3
Rank: 2 Value: robert Hits: 2
Rank: 3 Value: paul Hits: 1
skizze> ADD RANK users:20160301 natalie paul jono
skizze> GET RANK users:20160301
Rank: 1 Value: natalie Hits: 4
Rank: 2 Value: robert Hits: 2
Rank: 3 Value: paul Hits: 2
Rank: 4 Value: jono Hits: 1
Cardinality
Cardinality is probably the best-known problem that needs solving at some time or another at scale. You have a bunch of values steaming in and it's important to quickly know the number of unique values in that set (sets can be for overall or time-based or segment-based). With Skizze, answering the How many unique values are in this set? is as easy as:
skizze> CREATE CARD users:20160301
skizze> ADD CARD users:20160301 natalie paul robert robert natalie natalie
skizze> GET CARD users:20160301
Cardinality: 3
skizze> ADD CARD users:20160301 natalie paul jono
skizze> GET CARD users:20160301
Cardinality: 4
Domains: Same Data, Many Questions
One of the first pieces of feedback we got with the initial release was: "My data needs all four Sketches and I'm sending the values four times to the server!". Which is a fair point, in most cases the stream of values you have, you almost always want all four kinds of questions answered. Domains help solve this problem.
When you create a Domain, behind the scenes, Skizze creates all four sketches of the same name. From then on, as you add values to that domain, they automatically get multiplexed into each Sketch that belongs to it.
Querying is the same as for other Sketches, you can query each kind of sketch directly, using the domain's name:
skizze> CREATE DOM users:20160301 100 100000
skizze> ADD DOM users:20160301 natalie paul robert robert natalie natalie
skizze> GET FREQ users:20160301 paul natalie robert
Value: paul Hits: 1
Value: natalie Hits: 3
Value: robert Hits: 2
skizze> GET MEMB users:20160301 paul natalie robert
Value: paul Member: true
Value: natalie Member: true
Value: robert Member: true
skizze> GET RANK users:20160301
Rank: 1 Value: natalie Hits: 3
Rank: 2 Value: robert Hits: 2
Rank: 3 Value: paul Hits: 1
skizze> GET CARD users:20160301
Cardinality: 3
Looking Forward
We have many things we want to fix, change, and add to Skizze. Some of the main things we'd like to do going forward are:
- Support merging Sketches for combined values (i.e. you might have a cardinality sketch per day for website users, but want the unique users across the last 30 days)
- More per-sketch configuration for controlling thresholds & error-rates to match your requirements
- Snapshotting - currently Skizze will replay its AOF at startup, but having snapshots will make restarts faster in the long run
- Provide release builds
- Split the core of Skizze into it's own library for people to easily create other servers/tools that require similar functionality
- Support streaming for both additions as well as querying, using the gRPC features to the max
- Refactored CLI with both REPL (currently supported) and one-shot (like redis-cli) usage
- More clients! (Python is on it's way!)
However, we're also excited to hear people's feedback and ideas. We'd rather build something that solves real world problems than something that works great but is never used 😀. So please don't hesitate to make an issue, or join us for a chat in our channel on the Gophers Slack on #skizze.