NoSQL databases: Overview

The article provides a brief overview of NoSQL databases.

Relational databases – Success factors and drawbacks

Relational databases aka RDBMS have been popular for 20+ years for the following reasons

  • Most popular choice for persistence
  • Great support for Concurrency through solid transactional support
  • Standard relational model (SQL standards)
  • Served as Integration database for various disparate applications

Relational databases have the following drawbacks

Impedance mismatch – Application in-memory data structures did not match relational model causing difficulty for developer in spite of the availability of Object Relational Mapping (ORM) tools like Hibernate
Tight enforcement of data model through schemas reducing agility of developers
Large growth in data demanded running the applications in clusters, which RDBMS are not good at.

NoSQL movement

The following factors influenced the NoSQL movement

21st century websites like Amazon/Google had a huge demand for processing large data in clusters in commodity hardware
Application integration shifted from databases to services which decoupled the databases from various application integration needs opening up various options
Running commercial RDBMS in clusters were prohibitively expensive, and open source RDBMS did not mature to support the clustering and high volume data processing.

Characteristics of NoSQL databases

  • Running well in clusters
  • Open source
  • Not using relational model
  • Used by 21st century web estates
  • Schemaless

The NoSQL movement created a phenomenon of Polyglot persistence meaning organizations will choose multiple databases to meet different persistence needs.

Categories of NoSQL databases

There are two major categories of NoSQL databases

Aggregate oriented databases: An aggregate is a collection of data that we interact with as a unit. Aggregates form the boundaries for ACID operations with the database. There are 3 categories of aggregate oriented databases
Key-value: Data (aggregate) is stored as a value with a key. The data is opaque to the database and be queried using the key only. Some key/value databases provide some kind of metadata added to aggregate to enable indexing.
Document: The aggregate is stored as a Document typically with a key inside the document. The document is visible to the database so that the queries can be based on the contents of document.
Column family: Using a row-key, a group of columns are associated. Each row key can have different sets of columns known as column families. It can be considered to be a two-level aggregate structure with with row-key as first identifier.
Graph databases: Graph database is one that uses nodes, edges and properties to represent and store data. Every element has a direct pointer to adjacent element and no indexing is necessary.

Distribution Models in NoSQL databases

Sharding

Sharding is a technique where you put different parts of data in different servers. Sharding can improve both read and write performances. Sharding can be done from application side or by suing database provider specific auto-sharding techniques.

Replication

By replication, the whole data is replicated across multiple servers. Replication can be done in two ways.

Master-Slave

Replicate data across multiple servers, but only one server is master or primary. Master is the authoritative source for data and is responsible for all writes. Slaves can serve read requests and improve read performance greatly. Also when master fails, a slave can become master automatically or manually.

Peer-Peer

I Peer-Peer replication, all nodes are equal and accept writes. Loss to anyone of them does not prevent access to datastore.

Key points to be considered while employing Distribution techniques

Master slave replication reduces update conflicts, peer-peer avoids single point of a write failure. Not all the database providers support peer-peer.
Sharding alone reduces availability, it is typically used along with replication
Consistency issues must be addressed when using any of the distribution techniques.
To resolve update conflicts, use Version stamps. In Peer-Peer replication, Vectpr of version stamps will be needed when different nodes have conflicting updates.

CAP Theorem or Brewer’s conjecture and NoSQL databases

 CAP theorem states that a distributed system cannot guarantee all of the following 3 characteristics.

  • (Atomic)Consistency
  • Availability
  • Partition tolerance

 Let us go over what these characteristics mean in a distributed system

 Consistency

 There must exist a total order such that each operation looks as if it were computed at a single instant. Any read operation that begins after a write operation myst return that value. This consistency is a combination of Atomicity and Consistency in the context of ACID properties in the traditional RDBMS context.

 Availability

 Every request received by a non-failing node must return a response. This means that a failed node does not mean not Available. A running node which cannot return response (for e.g. because it could not write because a write concern could not be met) is not considered Available in this context.

 Partition Tolerance

 When a network is partitioned, messages sent from one node to another can be lost. Availability and Consistency are qualified by the need to tolerate partitions. It means

  • Every response must be consistent (atomic consistent) even though messages could be lost
  • Every non-failed node must return a valid response even though other nodes may not be reachable during network partitioning. Nothing less than the total network failure should cause the system not to be able to provide a valid response.

 Examples

 CA system

 A single server system is an example of CA system. It can’t partition, hence no need to worry about partition tolerance. It is available just being one node. Consistency can be satisfied without much difficulty. Most relational databases fall into this category

 CP system

 A CP system compromises availability to achieve partition tolerance and consistency. For example MongoDB when safe=true is set, Consistency is guaranteed. But availability is sacrificed meaning if network partition happens, response may not be guaranteed,

 AP system

 A CA system like Cassandra still achieves high availability during network partition. It uses eventual consistency as a trade-off.

The below diagram provide visual representation of where NoSQL databases fit in the CAP landscape.

NoSQL databases: Comparison, Use-cases and Key players

This section provides comparison of NoSQL database categories along with use-cases and key players and customers.

Type

How it works

Suitable

Use-cases

Not-suitable

Use-cases

Key players and

Customers

Key Value

Data (aggregate) is stored as a value with a key. The data is opaque to the database and be queried using the key only. Some key/value databases provide some kind of metadata added to aggregate to enable indexing.

  • Store user session information
  • User profiles, preferences
  • Shopping cart data
  • When there is a relationship between different sets of data
  • Transactional operations like saving many keys in one operation
  • When needing to query the values
  • Need to operate on multiple keys

Redis (twitter/github/stackoverflow/craigslist)

Memcached (Wikipedia/Flickr/Youtube)

Riak (AT&T, Weather channel, Yahoo, Boeing)

DynamoDB (Amazon/AWS)

Ehcache (Wikimedia)

Hazelcast (Amex/Apple/HP/Ticketmaster)

Document Store

The aggregate is stored as a Document typically with a key inside the document. The document is visible to the database so that the queries can be based on the contents of document

  • Event logging
  • Content management systems
  • Blogging platforms
  • Realtime analytics
  • Ec-commerce applications needing flexible schema
  • Complex transactions spanning different operations
  • If aggregate keeps varying a lot, then querying will not be optimal

MongoDB (expedia/ebay/Linkedin/Metlife)

CouchDB (Horoscope/payroll)

Couchbase (Adobe/AOL/Adidas/BMW)

MarkLogic( Warner Bros,Citi/Dow Jones)

RavenDB (MSNBC, easyfly)

Gemfire (US Defense, Wallstreet)

Column Family

Using a row-key, a group of columns are associated. Each row key can have different sets of columns known as column families. It can be considered to be a two-level aggregate structure with with row-key as first identifier.

  • Event logging
  • Content management systems
  • Blogging platforms
  • Counters
  • Expiring usage
  • Systems that need ACID transactions for writes and reads
  • Need to aggregate queries using functions like SUM/AVG
  • Not good for initial prototypes since column family design could change frequently

Cassandra (Netflix.Instagram,Reddit, Facebook)

HBase (Twitter,meetup, Facebook)

Accumulo (NSA)

Hypertable (EBay, Rediff.com)

Graph

Graph database is one that uses nodes, edges and properties to represent and store data. Every element has a direct pointer to adjacent element and no indexing is necessary

  • Connected data like social networks
  • Routing, Dispatch and Location based services
  • Recommendation engines
  • When data needs to be updated in all nodes

Neo4j – (Accenture, Careerbuilder, Cisco)

Titan

OrientDB

Sparksee

Giraph

References

Standard