When it comes to storing data, there are tons of options. Depending on functional requirements, type and the size of the stored data, one might gravitate towards either relational or distributed storage – but how to decide what is the best option for your use-case?
Relational Database Management System (RDBSM/SQL)
Concept of Relational Database Management System (RDBMS/RDS) dates back to the 1980s. It supports the paradigm of table-like data structure with fixed schema and column types, very intuituve transition for MS Excel users. Tables relate to each other through common id columns (i.e. column customer_id in tables: payments, orders and customers) – common id columns allow us to establish the relation between these tables when performing searches, filers and joins.
Examples:
- MySQL, PostgreSQL, Oracle, MS SQL Server, MariaDB, SQLite
The language used to query the RDBSMs is called Structured Query Language (SQL) – that is why the RDBMS are often called SQL Databases, which is a simplification. RDBMS were and still are immensely popular due to many benefits.
Pros:
- Strong consistency – with traditional Relational tables, there is a single source of truth and thus many RDBSM solutions benefit from ACID properties.
- Easy to scale – scaling requires acquiring larger machine – often we can also scale reads by using read relicas.
- Less prone to issues – RDBSM is a solution that has been there for years – databases are well-maintained and field-tested – what’s more there are less moving elements compared to distributed systems and thus any debugging and testing is significantly simpler.
Cons:
- Vertical scaling has a limit – theoretically we can increase the size of our machine indefinitelly but practically cost will start to increase exponentially after a certain threshold.
- Performance – because processing by nature does not happen in parallel, queries and searches on large datasets can take long to finish.
- Can be prone to resiliency issues – traditionally, a single server has been a vulnerable point of failure, however many solutions now offer read replicas that can be promoted to a master node in case of a failure of the main server.
In the past, the choice was straightforward – your data size would rarely exceed the size and processing power of a single machine. In fact, this is an excellent choice for applications that do not serve large number of requests – think smaller companies, apps for internal customers or fixed amount of upfront users, etc..
But what to do if your data can’t fit on a single computer?
Distributed Storage (NoSQL)
Modern, large-scale data ingestion pipelines often operate at the speed of GBs/sec. Big companies often use clusters of thousands of machines to run their applications and to split the load of simultaneously serving millions of users.
Examples:
- MongoDB, Cassandra, HBase, InfluxDB, Couchbase, CouchDB, CockroachDB.
So distributed databases seem like an excellent choice, right?
The fastest way to become RDBMS advocate is to maintain a distributed data storage.
Well, not really. As always in Software Engineering, everything is a tradeoff. As Uncle Ben once said:
Why we love distributed storage (the power):
- Almost unlimited (horizontal) scaling – done by simply adding more machines and data volumes to the cluster.
- Performance – data is accessed in parallel which provides millisecond operation time at scale.
- Resiliency and continuous uptime – data is usually replicated 3 times across multiple machines – chances of data loss or downtime significantly reduces with each replica.
- Lower costs of storage – distributed systems are designed to use large fleets of budget machines.
Why we would avoid it, if possible (the responsibility):
- Eventual consistency – with the need to replicate records, we lack the ACID properties – it can create data access inconsistencies, insert fails or stale data – as such, this storage option is not always suitable for cases, where consistency is crucial, i.e. financial sector.
- High complexity – testing, debugging and maintenance often becomes exponentially more convoluted.
- Observability issues – because we need to monitor availability, performance and utilisation of many machines, monitoring becomes challenging.
Summary
The choice of your database is paramount for your project – right choice will serve you well, while suboptimal one might qickly turn out to be a burden. The choice of a relational database is a good default choice due to the ease of use and consistency properties. However, not every use-case would fit into the application so reaserch up-front can save a lot of sweat later on.
Chers!
Filip
Hello to all, the contents existing at this website are in fact amazing
for people knowledge, well, keep up the nice work fellows.
Thank you for your good blog as always.Have a great day today
Hey! This is my first visit to your blog! We are a team of volunteers and starting a new project in a community in the same niche. Your blog provided us useful information to work on. You have done a extraordinary job!