Relational databases – Success factors and drawbacks
Relational databases aka RDBMS have been popular for 20+ years for the following reasons
- Most popular choice for persistence
- Great support for Concurrency through solid transactional support
- Standard relational model (SQL standards)
- Served as Integration database for various disparate applications
Relational databases have the following drawbacks
Impedance mismatch – Application in-memory data structures did not match relational model causing difficulty for developer in spite of the availability of Object Relational Mapping (ORM) tools like Hibernate
Tight enforcement of data model through schemas reducing agility of developers
Large growth in data demanded running the applications in clusters, which RDBMS are not good at.
The following factors influenced the NoSQL movement
21st century websites like Amazon/Google had a huge demand for processing large data in clusters in commodity hardware
Application integration shifted from databases to services which decoupled the databases from various application integration needs opening up various options
Running commercial RDBMS in clusters were prohibitively expensive, and open source RDBMS did not mature to support the clustering and high volume data processing.
Characteristics of NoSQL databases
- Running well in clusters
- Open source
- Not using relational model
- Used by 21st century web estates
The NoSQL movement created a phenomenon of Polyglot persistence meaning organizations will choose multiple databases to meet different persistence needs.
Categories of NoSQL databases
There are two major categories of NoSQL databases
Aggregate oriented databases: An aggregate is a collection of data that we interact with as a unit. Aggregates form the boundaries for ACID operations with the database. There are 3 categories of aggregate oriented databases
Key-value: Data (aggregate) is stored as a value with a key. The data is opaque to the database and be queried using the key only. Some key/value databases provide some kind of metadata added to aggregate to enable indexing.
Document: The aggregate is stored as a Document typically with a key inside the document. The document is visible to the database so that the queries can be based on the contents of document.
Column family: Using a row-key, a group of columns are associated. Each row key can have different sets of columns known as column families. It can be considered to be a two-level aggregate structure with with row-key as first identifier.
Graph databases: Graph database is one that uses nodes, edges and properties to represent and store data. Every element has a direct pointer to adjacent element and no indexing is necessary.
Distribution Models in NoSQL databases
Sharding is a technique where you put different parts of data in different servers. Sharding can improve both read and write performances. Sharding can be done from application side or by suing database provider specific auto-sharding techniques.
By replication, the whole data is replicated across multiple servers. Replication can be done in two ways.
Replicate data across multiple servers, but only one server is master or primary. Master is the authoritative source for data and is responsible for all writes. Slaves can serve read requests and improve read performance greatly. Also when master fails, a slave can become master automatically or manually.
I Peer-Peer replication, all nodes are equal and accept writes. Loss to anyone of them does not prevent access to datastore.
Key points to be considered while employing Distribution techniques
Master slave replication reduces update conflicts, peer-peer avoids single point of a write failure. Not all the database providers support peer-peer.
Sharding alone reduces availability, it is typically used along with replication
Consistency issues must be addressed when using any of the distribution techniques.
To resolve update conflicts, use Version stamps. In Peer-Peer replication, Vectpr of version stamps will be needed when different nodes have conflicting updates.
CAP Theorem or Brewer’s conjecture and NoSQL databases
CAP theorem states that a distributed system cannot guarantee all of the following 3 characteristics.
- Partition tolerance
Let us go over what these characteristics mean in a distributed system
There must exist a total order such that each operation looks as if it were computed at a single instant. Any read operation that begins after a write operation myst return that value. This consistency is a combination of Atomicity and Consistency in the context of ACID properties in the traditional RDBMS context.
Every request received by a non-failing node must return a response. This means that a failed node does not mean not Available. A running node which cannot return response (for e.g. because it could not write because a write concern could not be met) is not considered Available in this context.
When a network is partitioned, messages sent from one node to another can be lost. Availability and Consistency are qualified by the need to tolerate partitions. It means
- Every response must be consistent (atomic consistent) even though messages could be lost
- Every non-failed node must return a valid response even though other nodes may not be reachable during network partitioning. Nothing less than the total network failure should cause the system not to be able to provide a valid response.
A single server system is an example of CA system. It can’t partition, hence no need to worry about partition tolerance. It is available just being one node. Consistency can be satisfied without much difficulty. Most relational databases fall into this category
A CP system compromises availability to achieve partition tolerance and consistency. For example MongoDB when safe=true is set, Consistency is guaranteed. But availability is sacrificed meaning if network partition happens, response may not be guaranteed,
A CA system like Cassandra still achieves high availability during network partition. It uses eventual consistency as a trade-off.
The below diagram provide visual representation of where NoSQL databases fit in the CAP landscape.
NoSQL databases: Comparison, Use-cases and Key players
This section provides comparison of NoSQL database categories along with use-cases and key players and customers.
How it works
Key players and
Data (aggregate) is stored as a value with a key. The data is opaque to the database and be queried using the key only. Some key/value databases provide some kind of metadata added to aggregate to enable indexing.
Riak (AT&T, Weather channel, Yahoo, Boeing)
The aggregate is stored as a Document typically with a key inside the document. The document is visible to the database so that the queries can be based on the contents of document
MarkLogic( Warner Bros,Citi/Dow Jones)
RavenDB (MSNBC, easyfly)
Gemfire (US Defense, Wallstreet)
Using a row-key, a group of columns are associated. Each row key can have different sets of columns known as column families. It can be considered to be a two-level aggregate structure with with row-key as first identifier.
Cassandra (Netflix.Instagram,Reddit, Facebook)
HBase (Twitter,meetup, Facebook)
Hypertable (EBay, Rediff.com)
Graph database is one that uses nodes, edges and properties to represent and store data. Every element has a direct pointer to adjacent element and no indexing is necessary
Neo4j – (Accenture, Careerbuilder, Cisco)
- Brewer’s Conjecture
- CAP Theorem