Cassandra Day London 2015

DataStax has made it again!!

So far I’ve attended to Cassandra Day London 2014, Cassandra Summit 2014 and today’s Cassandra Day London 2015 and several Cassandra Meetups, all organised by DataStax and I can only admire them, both for the organisation itself (food, merchandising, sponsors, etc…) but most important for the quality of the contents they deliver. I can arguably say that Cassandra would not be the same without DataStax.

But now let’s focus in what’s really important to us, the contents. I usually make notes on important things I listen to in conferences and then just transcribe them here for further reading and sharing.

Cassandra resilience through a catastrophe’s post-mortem.

by @cjrolo

They lost slightly more than 50% of their data center and their experience was that, after some tweaks and nights awake, Cassandra could still ingest all the data.

Their setup:

  • 1Tb writes per day
  • Three nodes cluster
  • Write consistency: ONE
  • Read consistency: QUORUM

Their recommendations:

  • Five nodes cluster (RF=3)
  • >1Gb links
  • SSDs
  • Avoid Batches and Counters.

They claim to have been in Cassandra since pre releases and that particular catastrophe happened to them before DataStax had released any OpsCenter whatsoever so I was curious to know how they were monitoring their cluster. They were using the bundled Graphite Reporter along with Statsd.

Using Cassandra in a micro services environment.

by @mck_sw

Mick’s talk was mostly about tools, particularly highlighting two:

  • Zipkin: A distributed tracing system developed by Twitter.
    • Useful for debugging, tracing and profiling on distributed services.
  • Grafana: OpenSource Graphs dashboard.
    • Very useful because can easily integrate with tools like Graphite or Cyanite.

One of the most interesting parts was, once again, the emphasis on monitoring.

Lessons learnt building a data platform.

by @jcasals & @jimanning, from british gas connected homes

They are building the Connected Homes product at British Gas, that is basically an IOT system that monitors temperature and boilers with several benefits for the customers.

They receive data back from users every two minutes.

And the lessons are:

  • Spark has overhead, so make sure it worths using it.
    • Basically, Spark takes advantage of parallelism and distribution across nodes, so if all computations are to be done in a single node then maybe you don’t need Spark.
  • Upsert data from different sources

Given this structure:

CREATE TABLE users(
id integer,
name text,
surname text,
birthdate timestamp,
PRIMARY KEY (id));

We can UPSERT like this

INSERT INTO users (id, name, surname) VALUES (1, 'Carlos', 'Alonso');
INSERT INTO users (id, birthdate) VALUES (1, 1368438171000);

Resulting in a completed record.

  • Tweak Spark jobs to avoid it killing Cassandra. Bear in mind that Spark is much more powerful than Cassandra and can, kill its memory. Check this little comic below for more info 😉

Spark vs Cassandra Comic

  • Gain velocity by breaking the barrier between Data Scientists and Developers in your team.

Amaze yourself with this visualisation of London’s energy consumption they showed!

London's Energy consumption

 

Cassandra at Hailo Cabs

by chris hoolihan, infrastructure engineer at hailo

At Hailo Cabs, they use Amazon AWS as their infrastructure to support Cassandra, particularly they use:

  • m1.xlarge instances in development systems
  • c3.2xlarge instances in production systems.
  • striped-ephemeral disks
  • 3 availability zones per DC

Again, one of the most interesting parts were the monitoring. They showed several really interesting tools, some of them developed by themselves!

  • Grafana
  • CTOP (Top for Casandra).
  • The Cassandra metrics graphite plugin.

And GoCassa, a Go Language wrapper for the Go Cassandra driver they developed themselves, to basically encourage best practices.

Finally he gave one final advice: Don’t put too much data!!

Antipatterns

By @CHBATEY, Apache Cassandra evangelist at Datastax

This talk was simply awesome, it really was a long time ago since I last had to make notes so fast and be so concentrated to try not to miss a word, and here are them!

Make sure every operation hits ONLY ONE SINGLE NODE.

Easy to explain, right? The more nodes, the more connections and therefore more time in resolving your query.

Use Cassandra Cluster Manager.

This is a development tool for creating local Cassandra clusters. Can be found here.

Use query TRACING.

Is the best way to profile how your queries perform.

  • Good queries trace small.
  • Bad queries trace long.

Cassandra cannot join or aggregate, so denormalise.

You have to find the balance between denormalisation and too much duplication. Also bear in mind that User Defined Types are very useful when denormalising.

‘Bucketing’ is good for time series.

It can help you distributing load among the different nodes and also achieving the first principle here: “Make sure every operations hits only one single node”.

It is better to have several asynchronous ‘gets’ hitting only one node each than a single ‘get’ query that hits several nodes.

Unlogged batches

Beware that this batches do not guarantee completion.

Unlogged batches can save on network hops but while the coordinator is going to be very busy while processing the batch, the other nodes will be mostly idle. It’s better to run individual queries and let the driver load the balance and manage the responses. Only if all parts of the batch are to be executed on the same partition, then, the batch is a good choice.

Logged batches

This ones guarantee completion by saving the batch to a particular commit log.

Logged batches are much slower than their unlogged counterpart (~30%) so only use them if consistency is ABSOLUTELY mandatory.

Shared mutable data is dangerous also in Cassandra.

This always reminds me of this tweet with a very descriptive explanation of how dangerous it is 😉

There are two main ways to avoid it:

  • Upserting (explained above)
  • Event sourcing: Basically just appending new data as it comes.
    • As this doesn’t scale, it’s good to combine it with some snapshot technique (taking a snapshot every night in batch job).

Cassandra does not rollback

So it’s pointless retrying failed inserts unless failed in the coordinator, because if it reached the coordinator, the it’ll have a hint to retry it later.

Don’t use Cassandra as a queue!!

Cassandra doesn’t delete, instead marks as deleted and those registers are around for a while and that will affect reads.

TTL’s also generate tombstones so beware!! (unless DateTieredCompaction)

Secondary Indexes

As Cassandra doesn’t know the cardinality it saves the index in local tables.

Local tables are on every node and only contains references to data that could be found on the corresponding node.

Therefore, to use them, a query will run on all the nodes.

You can see slides of this last talk here: http://www.slideshare.net/chbatey/webinar-cassandra-antipatterns-45996021

And that was it!! Amazing, huh?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s