04 - Cassandra

Introduction

In this lab we are going to have a closer look at Cassandra, a database engine which was designed for high availability and scalability. To achieve these goals the databases on the cassandra engine are distributed on any desired nodes. The Cassandra engine focuses upon a decentralized architecture. We build a simple cluster, create a table and modify data by using the Cassandra Query Language (CQL).

Further on we will have a quick look at the cqlsh, establish a connection to our Cassandra cluster and run simple read and write queries against the Cassandra database. We then examine an application architecture with whose we are able to handle request storms, in our case from an MQTT broker and write down data to a replicated database like Cassandra.

Prerequisites

  1. Hardware with following key specifications

    • >= 8 GB of RAM recommended

    • 4 Cores / 8 CPUs (Hyperthreading) recommended

    • Virtualization supported by the CPU (VT-x/AMD-V) would be a big plus.

    • At least 32 GiB of free disk space to place the virtual disk image of the guest machine.

  2. Virtualization environment (VMWare Player / VirtualBox)
    (Lab was developed using VirtualBox https://www.virtualbox.org/)

  3. Virtual harddisk image (hostname: iot-lab-vsrvr) provided by this lab.

    • Get it from inside of the lab subnet (net: 172.16.32.0/24)

  4. At least 1 Raspberry Pi to submit data to a MQTT broker

    • raspbian operating system

Getting started

Environment overview

infrastructure_cassandra_nodes.png

Alternate environment setup

If you are outside of the IoT subnet (172.16.32.0/24), and dont’t have access to the lab broker, you can start the mosquitto broker on the iot-lab-vsrvr by invoking the command: sudo service mosquitto start. Remember that in this case you have to change the virtual machine’s network adapter to a “Bridged Adapter”, so that <your_own_laptop>, the iot-lab-vsrvr and your Raspberry Pi’s reside in the same subnet.

Setting up the guest machine “iot-lab-vsrvr”

Get the virtual hard disk image from inside of the lab subnet (172.16.32.0/24) via:

ftp://jupiter.lab.iot/shared/IoT/p4

FTP-User: iotro

FTP-Password: iotro

You can download the image from outside the ZHAW LAN as well. When connected to the ZHAW VPN use the following FTP download site with the same credentials as above

ftp://jupiter-win.zhaw.ch/shared/IoT/p4

Uncompress the virtual harddisk image … this may take a little, time for a coffee break.

Set up a virtual machine on your virtualization environment and use the virtual disk image, you just downloaded (it is a 32 bit debian system). Recommended are the following settings:

  • 4 GiB of RAM

  • 2 or 4 CPUs (Just one single core will not work)

  • Enable PAE/NX (on VirtualBox)

  • Enable VT-x/AMD-V

  • Network adapter in NAT mode
    (or a bridged adpter if you use the alternate environment setup)

Start the virtual machine. On the virtual machine with the hostname iot-lab-vsrvr you can operate with these credentials:

Username

Password

is sudoer?

iotusr

iotlab

yes

root

iotlab

yes (root user)


Apache Cassandra

In this section we are going to set up a Cassandra cluster with a very trivial configuration, on two virtual nodes which run on the iot-lab-vsrvr, cassandra-node-0 and cassandra-node-1. Later on we use the cqlsh to get in touch with the CQL language and run simple queries against the distributed database.

Introduction

Architecture Overview

Some key features of a Cassandra database system are listed below:

  • Cassandra was designed with the understanding that system/hardware failures can and do occur

  • Peer-to-peer, distributed system

  • All nodes the same (No master-slave architecture or anything similar)

  • Data partitioned among all nodes in the cluster

  • Custom data replication to ensure fault tolerance

  • Read-/Write-anywhere design

  • Each node communicates with each other through the Gossip protocol, which exchanges information across the cluster every second

  • A commit log is used on each node to capture write activity. Data durability is assured.

  • Data also written to an in-memory structure (memtable) and then to disk once the memory structure is full (an SStable)

  • A keyspace is akin to a database in the RDBMS table but is more flexible/dynamic

  • 6A row in a column family is indexed by its key. Other columns may be indexed as well.

Source: http://www.slideshare.net/DataStax/an-overview-of-apache-cassandra

In our installation we are going to use a basic setup and only two nodes to replicate our databases:

Which means that we can run queries against arbitrary nodes in the Cassandra cluster and all the changes are replicated all over the cluster. As you can see, Cassandra builds a ring structure including all its available nodes.

Snitches

A snitch determines which data centers and racks are written to and read from. Snitches inform Cassandra about the network topology so that requests are routed efficiently and allows Cassandra to distribute replicas by grouping machines into data centers and racks. All nodes must have exactly the same snitch configuration. Cassandra does its best not to have more than one replica on the same rack (which is not necessarily a physical location).

Note: If you change the snitch after data is inserted into the cluster, you must run a full repair, since the snitch affects where replicas are placed.

Snitch types Cassandra supports:

  • Dynamic snitching
    Monitors the performance of reads from the various replicas and chooses the best replica based on this history.

  • SimpleSnitch
    The SimpleSnitch is used only for single-data center deployments.

  • RackInferringSnitch
    The RackInferringSnitch determines the location of nodes by rack and data center.

  • PropertyFileSnitch
    Determines the location of nodes by rack and data center.

  • GossipingPropertyFileSnitch
    Automatically updates all nodes using gossip when adding new nodes and is recommended for production.

  • Ec2Snitch
    Use with Amazon EC2 in a single region.

  • EC2MultiRegionSnitch
    Use this snitch for deployments on Amazon EC2 where the cluster spans multiple regions.

Source: http://docs.datastax.com/en/cassandra/1.2/cassandra/architecture/architectureSnitchesAbout_c.html

In our simple architecture with only 2 nodes in 1 “datacenter” we use a SimpleSnitch as endpoint snitch, which is sufficient for this purpose. In the configuration file this is the default and doesn’t have to be changed.

If you want to go in deep an explore all the configuration possibilites and setup concepts of a Cassandra cluster e.g. a distributed cluster over many datacenters, etc. you can do so by consulting the Datastax Documentations: http://docs.datastax.com/en/index.html

Installation

To install the cassandra cluster we operate as the user iotusr on the “iot-lab-vsrvr”.

As you have started your iot-lab-vsrvr in a previous section, one can open a terminal and cd in the labs directory: cd /home/iotusr/IoT/p4

We first have to create a virtual switch in NAT mode to build our own subnet which has also an uplink to the outside world.

To get the environment on the iot-lab-vsrvr like pictured above invoke the bash script:

sudo ./start_cassandra_env.bash

To control the virtual cassandra nodes, you can use the virsh tool and the following commands:

Command

Descripütion

sudo virsh list

list machines and their state

sudo virsh create <machine_xml_file>

create and start a virtual machine specified in <machine_xml_file>

sudo virsh reboot

reboot specified machine by its machine id

sudo virsh shutdown

shutdown specified machine by its machine id

visrsh provides many more options. For a verbose description of all of them, consult the manual pages: man virsh

Now that the Cassandra nodes are running try to login from the iot-lab-vsrvr:

Credentials:

Hostname

IP address

Username

Password

cassandra-node-0

192.168.122.10

root

iotlab

cassandra-node-1

192.168.122.20

root

iotlab

ssh <Username>@<IP address>

Due to simplicity we always act as root user on the Cassandra database nodes in this lab. Never do this in a productive environment. For example you can create a specific application user.

On each Cassandra node the Cassandra application package is already downloaded. To build a simple cluster with our two nodes, we have to define a seed node and another normal node.

Let’s start with the configuration of the seed node:

Login via ssh to the Cassandra node with the IP address 192.168.122.10, this is our seed node:

cd /usr/local/bin
tar xvf apache-cassandra-2.1.11-bin.tar.gz
cd apache-cassandra-2.1.11/conf
vim cassandra.yaml

Now we can edit the configuration file:

starting characters of line

actual value

new value

cluster_name

‘Test Cluster’

‘iotLabCassandraCluster’

- seeds

“127.0.0.1”

“192.168.122.10”

listen_address

localhost

192.168.122.10

rpc_address

localhost

192.168.122.10

Where:

  • rpc_address is the address on which Cassandra listens to the client calls like CQL queries.

  • listen_address is the address on which Cassandra listens to the other Cassandra nodes in the cluster.

In productive environments it is an optimization to have different ineterfaces/subnets for syncronizing the cassandra nodes and process CQL queries from applications or db admins.

Now let things roll:

cd ..
bin/cassandra

The node should now start up. This may take a while. Wait until you see a message on your console like:

INFO  19:01:40 No gossip backlog; proceeding

The seed node is now started. Now open another terminal on iot-lab-vsrv and repeat the whole configuration procedure for the other Cassandra node with the IP address: 192.168.122.20. Keep in mind to use the correct IP addresses and specify the right seed node (192.168.122.10).

If everything worked flawlessly, open two other terminals on “iot-lab-vsrv” and connect to the two cassandra nodes as root via ssh and type:

/usr/local/bin/apache-cassandra-2.1.11/bin/nodetool status

A sample output would be:

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns (effective)  Host ID                               Rack
UN  192.168.122.10  51.65 KB   256     100.0%            52ae0bff-9b6a-4020-b78a-e6175f6aa8f8  rack1
UN  192.168.122.20  66.06 KB   256     100.0%            f78bafe4-558e-473e-8137-2165337dda39  rack1

UN stands for Up Normal which means your node runs as expected and no serious errors were reported during the startup.

Get in touch with the cqlsh and CQL

To run database queries we use the CQL shell (cqlsh), which comes with the Cassandra installation package. The sample queries we run from the host iot-lab-vsrv.

Open a terminal on the iot-lab-vsrv and make sure that you are in your ~ directory (/home/iotusr).

Start the cqlsh by typing: ./cqlsh -C 192.168.122.[12]0 9042

It doesn’t matter if you connect to node 0 or node 1.

(To get more information about the cqlsh, type ./cqlsh -help)

As you may have noticed, we connected to the database without any authentication. Again, NEVER EVER do this in productive environments.

If you want to harden your Cassandra cluster by enabling and configuring security settings / adding users / alter passwords / … you may find a good entry point to this topic at: http://www.datastax.com/dev/blog/a-quick-tour-of-internal-authentication-and-authorization-security-in-datastax-enterprise-and-apache-cassandra

Now lets run a few simple queries against our Cassandra cluster from the cqlsh. In most cases we call pre-written scripts. These are located in the ~ directory of the user iotusr on the iot-lab-vsrvr.

If you want to use further CQL statements e.g. in your application later on you can always consult an up to date CQL documentation at: http://docs.datastax.com/en/cql/3.1/cql/cql_intro_c.html

Keyspaces

Keyspaces are analogous to databases as you know it from classic relational database engines.

Lets create one, named firstiot. In the cqlsh type: SOURCE 'create_keyspace.cql' and hit enter.

This will create a keyspace with the replication factor 2. The replication factor you define for a keyspace should never exceed the amount of nodes you run in a cluster. We use a SimpleStrategy for replication which is sufficient for a single datacenter. Study the script ‘create_keyspace.cql’.

Table creation

To create our first table in the keyspace “firstiot” you can run the script by typing the following in the cqlsh: SOURCE 'create_table.cql'

Study the script ‘create_table.cql’.

To work on the keyspace firstiot type: USE firstiot;

Let’s see if our table has been created as expected:

DESCRIBE TABLES;
DESCRIBE TABLE sen_logging;

Inserting data

To create our first dataset in our table sen_logging you can run the script by typing the following in the cqlsh: SOURCE 'insert_data.cql'

Study the script ‘insert_data.cql’.

Unfortunately our version of the cqlsh does not support UTF-8 characters with a width of more than one byte, when passing queries via a cql file. As you can see, we are able to smash this bug by the use of the function blobAsText() and pass a binary representation of the UTF-8 text string we want to insert into the specific varchar column.

To check if our data record was inserted correctly, we can perform a select statement on our table:

USE firstiot;
SELECT * from sen_logging;

Deleting data

To delete the previously inserted data record you can invoke the following commands in the cqlsh:

USE firstiot;
DELETE from sen_logging WHERE client_id = '<client_id>';

Check if the table is blank again:

SELECT * from sen_logging;

Store the data provided by the Raspberry Pi’s

In this section we are going to modify the the well known application from the previous lab (MQTT) to publish data via a broker to an apache storm application, where the data is being written down to our cassandra cluster.

Provide data from the Raspberry Pi

To generate some data that can be written down to our Cassandra cluster, let’s add a RandomSpammer class which does excatly this task.

Download the MQTT_thing-rg-0.2-dbg.zip and study the code. The thread of this class should generate any data and publish it to a broker, just the way the TimeResource did in lab 3.

The data should be published as message in SenML. To check if the random generator works correctly launch the application on your Raspberry Pi and subscribe to the given topic with a console tool, e.g.:

mosquitto_sub -h 172.16.32.5 -t iot/labs/04/randomspammer -v

You may have noticed that all Raspberry Pis publish to the same topic. To distinguish which device published a message the hardware address is published within the payload too (key: bn). Be sure, the hardware address of your Raspberry Pi appears in the subscribed messages.

Sensor Markup Language – SenML

In short, the sensor markup language specifies, how recorded data from a sensor can be structured. The whole specification is available at: https://tools.ietf.org/html/draft-jennings-core-senml-02

To format the messages we use the JSON representation of SenML. The base object variables have the following structure:

+------------+------+--------+
|      SenML | JSON | Type   |
+------------+------+--------+
|  Base Name | bn   | String |
|  Base Time | bt   | Number |
| Base Units | bu   | Number |
|    Version | ver  | Number |
+------------+------+--------+

Further on there are measurement or parameter entries which represent as follows:

  +---------------+------+----------------+
  |         SenML | JSON | Notes          |
  +---------------+------+----------------+
  |          Name | n    | String         |
  |         Units | u    | String         |
  |         Value | v    | Floating point |
  |  String Value | sv   | String         |
  | Boolean Value | bv   | Boolean        |
  |     Value Sum | s    | Floating point |
  |          Time | t    | Number         |
  |   Update Time | ut   | Number         |
  +---------------+------+----------------+

An example about how the random data could be formatted:

{
   "bn": "b8:27:eb:f5:bf:53",
   "e": [{"t": "1477672722000", "n": "iot/labs/04/randomspammer", "u": "C",
          "sv": "520"}]
}

To deal with JSON, we use the json module which comes with the defalut libraries of a python interpreter installation.

A basic Code snippet from the json module, which creates a json structure like the pictured above:

    '''
    Mapping Python/JSON "Types"
    
    JSON              Python
    ============================
    object            dict
    array             list
    string            str
    number (int)      int
    number (real)     float
    true              True
    false             False
    null              None
    '''
    
    random_data = { }
    sensor_list = [ ]
    measurement = { }
    
    random_data[ 'sv' ] = value
    random_data[ 't' ]  = timestamp
    random_data[ 'u' ]  = unit
    random_data[ 'n' ]  = self.pub_topic

    sensor_list.append( random_data )

    measurement[ 'e' ]  = sensor_list
    measurement[ 'bn' ] = self.client_id

    return json.dumps( measurement, sort_keys = False )

Process received data with Apache Storm

Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!

Source: http://storm.apache.org/

In this last section of the lab, we do have a quick look at the Apache Storm framework. We use it to write the data published by the Raspberry Pi’s to our previously set up Cassandra cluster.

Architecture Overview

On the iot-lab-vsrvr there is a version of the JDK and eclipse installed (icon on the desktop). Open it and choose the pre-selected workspace. Study the application iot_lab4_mqtt-storm. It’s implemented like pictured above. We assume, our Randomspammer delivers temperature data in different units (°C, °F, °K).

Therefore we set up 1 temperature spout, which has 4 bolts available to process data. How the bolts are chained to achieve the workflow pictured above is visible in the class TemperaturSpout. You have also the possibility to define multiple worker threads for each bolt, e.g. if you expect much more temperature data whose temperature unit is °K.

builder.setBolt( "temperatureKelvinBolt", 
                 new TemperatureKelvinBolt(), 1 )
               .shuffleGrouping( "temperature" );

In our sample application we use 1 worker thread for each bolt, which is sufficient for this topic. You can now start the application iot_lab4_mqtt-storm: Right click on the MQTTTopology class > Run As > 2 Java Application. If you want just your own random messages change the topic name to an individual name, other than “iot/labs/04/randomspammer”.

Deploy the sample RandomSpammer (MQTT_thing-rg-0.2-dbg.zip) project to your Raspberry Pi and start it. If you want just your own random messages change the topic name to an individual name, other than iot/labs/04/randomspammer.

We have now a running environment with the following components:

  • Cassandra database cluster on cassandra-node-0 and cassandra-node-1

  • At least 1 Raspberry Pi which acts as random data publisher

  • MQTT broker

  • Apache Storm powered application which acts as subscriber and writes the published data to our keyspace firstiot into the table sen_logging.

To check if the random sensor data from the Raspberry Pi RandomSpammer are written correctly to the Cassandra cluster open a cqlsh on the machine iot-lab-vsrvr and run some select statements against the database, e.g.:

./cqlsh -C 192.168.122.10 9042
SELECT * FROM firstiot.sen_logging;

If you still have time and energy, customize the applications and try to log data from components of your choice that you have attached to your grovePi board.

Grading

4 points maximum will be awarded for completion of the exercises.