Cassandra Database Literature review Example | Topics and Well Written Essays

Cassandra Database 1st 1st affiliation 1st line of address 2nd line of address Telephone number, incl. country 1st E-mailaddress 2nd Author 2nd authors affiliation 1st line of address 2nd line of address Telephone number, incl. country code 2nd E-mail 3rd Author 3rd authors affiliation 1st line of address 2nd line of address Telephone number, incl. country code 3rd E-mail ABSTRACT Cassandra database is an open source system designed to store and manages large quantities of data across the entire commodity servers. The database serves as a real-time data store for applications that are online and read intensive database for Business Intelligence systems. The database was originally developed for Facebook and designed to accommodate peer-to-peer symmetric nodes. The database automatically divides the data across nodes found in cluster of database. After Facebook decided to open source the code, Cassandra later became an incubator for Apache. As of this writing, the paper will discuss the general information about Cassandra database. Consequently, the paper will further discuss the database based on its data storage, query format, and its query processing Categories and Subject Descriptors The paper begins with the introduction of the paper, then general information about the Cassandra Database, Cassandra data storage, query format, and then query processing. General Terms Query Format, Query Processing, Cassandra Database, Cassandra Query Language. Keywords Cassandra Database, and Cassandra Query Language. 1. INTRODUCTION Cassandra Database is a wide spread open source NoSQL database. The database is best used to manage large quantity of data across many centers of data and cloud. Cassandra database is characterized as a continuous availability, operational simplicity, and linear scalability across various servers without a single failure. Additionally, the database has a powerful data model mandated to offer a maximum flexibility and a rapid response period. Based on its operation, Cassandra database has an outstanding plan and architecture, meaning that all the nodes are similar. Increasingly, the database offers automatic distribution of data across the nodes participating in a database cluster. 2. GENERAL INFORMATION ABOUT CASSANDRA DATABASE The database saves the administrators and developers the coding process in distrusting the data across the cluster since the data are partitioned in a transparent manner across the nodes. Consequently, the database provides a customized replication that stores redundant data across nodes participating the Cassandra ring. Meaning, assuming a node goes down, single, or multiple copies of the data will still be available on other cluster’s machines. Replication undergoes configuration to operate across zones of a single data center, multiple data centers, and many cloud zones. The database can be used in supplying linear scalability, meaning that a capacity can added easily through the addition of new nodes. For instance, two nodes can accommodate 100000 operations in a second, four nodes can accommodate 200000 operations in a second, and eight nodes can handle 400000 operations in a second. 2.1 Latest Version The latest version of the database is Cassandra 2.1. The new version has new features such as user-defined types, collection indexes, and improved metrics through metrics-core library. The new version has performance improvements based on the improved row cache, faster reads and writes, reduced heap, and new counters implementation. Additionally, the repair and compaction improvement include incremental node repair, and post-compaction read performance. Other notable changes on the database include unique table IDs, Bundled JNA, Improved logging, and new configurations. 2.2 Cassandra Query Language Increasingly, the primary and default interface of the database is Cassandra Query Language. Using CQL is same as using SQL. The two programming language share a similar abstract idea of tables constructed of rows. Compared to SQL, Cassandra do not support sub queries or joins rather than batch analysis via Hive. Conversely, the database focusses on denormalization through Cassandra Query Language features such as clustering and collection specified at schema. Cassandra Query Language is the preferred way of interacting with the database. 3. FEATURES OF THE DATABASE SYTEM 3.1 Data Storage 3.1.1 Storage Engine Cassandra database uses data storage similar to Log Structured Merge Tree. The database different to conventional relational database, it writes then reads it later. Normally, in systems that are largely distributed, they produce stalls during reading performance and other issues. To avoid this read before write scenario, the engine groups, updates and inserts to be carried, and consequently writes the updates sections of rows in append mode. The database rewrite or reread the existing data, and do not overwrite rows found in place. 3.1.2 Separate table directories Cassandra database offers a fine-grained control of disk’s table storage, writing tables on the disk with help of separate directories for every table. From the directory of installation, the data file are then stored with the help of the directories and the file naming technique on tarball installations as shown below /data/data/ks1/cf1-5be396077b811e3a3ab9dc4b9ac088d/ks1-cf1-hc-1-Data.db For installations that are packaged, the data files are kept and stored in similar format by in a default technique of /var/lib/Cassandra/data. Cassandra database develops a subdirectory for every table, which allows one to symlink tables to a given data volume or physical drive. This offers a capability to shift active table to a media that is faster like SSD. 3.1.3 Compaction write path Cassandra database processes data are various phases on the path, beginning with immediate compaction logging write and ending: The database intervals include logging data, writing data, flushing data, storing data, and compaction. When writing occurs, the database stores the written in memtable. In addition, it appends write to the disk’s commit log offering a configurable durability. The write are then received by commit log made to the node of the database. The memtable will then store the writes until a limit is reached and then flushed. When the content of a memtable exceed the configurable threshold, the data is then placed in queue for flushing. The queue length can be configured by altering the memtable_flush_queue_size found in Cassandra.yaml. If the flushed data exceeds the size of queue, the database then blocks the writes until succession of the next flush occurs. Flushing of data involves sorting the memtable by token and then writes the data on the disk sequentially. To store the data on disk, the data found in commit log undergoes purging after the corresponding data found in the memtable is taken to SSTable. As shown in the figure below. The SSTable and Memtable are then maintained on each table. The SSTable are not mutable. Later, the division is stored across the various SSTable file. For every SSTable, the database creates partition index structure, partition summary, and Bloom filter. From the diagram below, in the memtable, the data are organized and store in sorted order. For effectiveness and efficiency, Cassandra does not repeat column names in memory. For instance, the following writes will be arranged as Write (k1, c1:v1) Write (K2, c1:v1 C2:v2) Write (k1, c1:v4 c3:v3 c2:v2) The Cassandra database then stores the data after the writes are received k1 c1:v4 c2:v2 c3:v3 k2 c1:v1 c2:v2 In the disk’s commit log, Cassandra then stores the store after the writes are already received k1, c1:v1 k2, c1:v1 C2:v2 k1, c1:v4 c3:v3 c2:v2 In the disk’s SSTable, Cassandra then stores the data after the memtable is flushed k1 c1:v4 c2:v2 c3:v3 k2 c1:v1 c2:v2 4. QUERY FORMAT The data model of Cassandra is a divided row store having tunable consistency. The rows are organized in form of tables. The first feature of the tables’ primary key is a partition key, within the partition; the rows are grouped by remaining columns of keys. Other columns of the table are indexed separately from the partition key. Tables can be altered, dropped, and created at runtime without blocking the queries and updates. A compound primary key entails a partition key. The partition key determines the kind of node that will store the data. These keys include single or multiple additional columns that finds out the per-partition clustering. The database utilized the first column slot in the primary key definition as a partition key. For instance in a music playlist table, the music id is termed as partition key. The remaining columns are the clustering columns. The partition data are clustered by the clustering column. Conventionally, when the partition key’s rows are stored based on clustering columns format, retrieving the row will become inefficient. For instance, since the name in the playlist table is termed as the partition key, every song in the playlist will be clustered based on clustering column format. Deleting, inserting, and updating the operations on rows share a similar partition key for tables are carried out atomically and in isolation. One is able to query one sequential data set on a disk to find the songs for a given playlist. In Cassandra database, one can find collections that contain a given value. For instance, supposing one wants to find songs tagged in a given color, the venue map and tags set need to be indexed. Querying for values found in the venue map and tags set is as shown below CREATE INDEX ON playlists (tags); CREATE INDEX mymapvalues ON playlists (venue); Specifying an index name like mapindex is optional. To filter the kind of data in the collection, when the tags set is selected, the set of tags will be returned; SELECT album, tags FROM playlists; Internally, the query do not change the mapping of row and column from the mapping of Thrift API. Thrift and CQL uses a similar storage engine. It supports similar query-driven, principles of denormalized data modeling as shift. The existing application does not need to undergo upgrade to CQL. The abstraction layer of CQL makes the query easier for new applications. 5. QUERY PROCESSING Using Cassandra Query Language, one can easily query legacy table. The legacy table managed in Cassandra Query Language entails an implicit directive of WITH COMPACT STORAGE. When using the Cassandra Query Language without column names defined for the data within the partition, the database generates the first column and first value for the data. With the help of RENAME clause, one can change the name of default column to a meaningful id as ALTER TABLE users RENAME userid to user_id; The query support the dynamic table developed in CLI, Thrift API, and earlier versions of CQL. For instance, the dynamic table is queried as Cassandra Query Language offers an API to Cassandra simpler than Thrift for newer application. The query adds the abstraction layer hiding the implementation details of the structure and offers native syntaxes for collections and other conventional encodings. 5.1 Accessing Query Conventional ways of accessing CQL include starting the cqlsh. Cqlsh is the python based command line client found in Cassandra node command line. Then use the DataStax DevCenter, which acts as graphical user interface. To develop an application, one can use python, java, or other open source drivers. Finally, use set cql version for the programmatic access. 5.2 Updating and creating a keyspace Creating a keyspace defines the replication of data on nodes. Normally, a cluster has a single key space for every application. Replication process is controlled on keyspace to keyspace basis, therefore data with different replication needs resides in various keyspaces. The keyspaces are not developed to be used as crucial map layer within the model. The keyspaces are developed to control the replication of data for a collection of tables. When a key space is created, a strategy class will be classified to replicate the keyspaces. A SimpleStrategy class is used in evaluating the Cassandra. Using the NetworkTopologyStrategy for purposes of evaluation, a single node cluster will specify the center name of the default data, this is made possible by using the nodetool status command. 5.3 Replication Factor To increase the replication factor leads to an increase in the total number of copies for the data stored in Cassandra cluster. It is good to increase the system’s replication factor if one uses the security features since when one used the default and the node, they will not log into the cluster since the system_auth did not replicate. The procedure for this includes updating the cluster’s keyspace and change the options of replication strategy using the ALTER KEYSPACE system_auth WITH REPLICATION = {class : NetworkTopologyStrategy, dc1 : 3, dc2 : 2}; then on the affected node, run the command of nodetool repair. Then one has to wait until the repair is completed on the node, then shift to the next. Then creating a table with single primary key, one has to use the PRIMARY KEY, then key name, enclosed in the parenthesis. 6. REFERENCES 1. Bradberry, R., & Lubow, E. (2014).Practical Cassandra a developers approach. Upper Saddle River, N.J.: Addison-Wesley. 2. Mishra, V. (2013). Instant Apache Cassandra for Developers Starter. Birmingham: Packt Pub.. . Read More

Cassandra Database - Literature review Example

Extract of sample "Cassandra Database"

CHECK THESE SAMPLES OF Cassandra Database

An Introduction to Database Technology and Database Management

Database Mining Techniques

Database Test Plan of NFRCentralDatabase

Information on databases

Components of Database

International Tax and Estate Planning Discussion Post 6

Database Security and Privacy Principles

Database Management System