Sunday, October 29, 2023

A Guide to Database Database Design : How to design and optimize databases for different applicationsπŸ“Š

  • Databases serve as the backbone of most applications, from small-scale mobile apps to large enterprise systems. A well-designed and optimized database is crucial for ensuring data integrity, performance, and scalability. 
  • In this post, we'll explore the principles and best practices for designing and optimizing databases to suit various applicationsπŸ“Š.

Understand Your Application's Requirements & Choose the Right Database Model 🏒

  • Before diving into database design, it's essential to thoroughly understand the requirements of your application. Consider factors such as data volume, access patterns, security, and the expected growth rate. 
  • Databases are a foundational component of modern computing and information management. They come in various types, each designed to handle specific data requirements and use cases.
  • Selecting the appropriate type of database for your application is crucial, as it directly impacts data management, performance, and scalability. The choice should be based on your application's specific requirements, data structure, and expected growth. 
  • Each database type has its strengths and limitations, and the decision should align with your long-term goals and project constraints.

1. Relational Databases (RDBMS):

  • Relational databases are structured databases that use a tabular format for data storage. Data is organized into tables with rows and columns, and relationships between tables are defined using keys.
  • Use Cases: Relational databases are suitable for applications with structured data, well-defined relationships, and complex querying needs. Commonly used for transactional systems, accounting, and complex business applications.
  • Examples: MySQL, PostgreSQL, Oracle Database, Microsoft SQL Server.

2. NoSQL Databases:

  • NoSQL databases, or "Not Only SQL," are designed to handle semi-structured or unstructured data. They provide flexibility in data modeling, allowing for schema-less data storage.
  • Use Cases: NoSQL databases are well-suited for applications that require high scalability, rapid development, and flexibility. They are often used in web applications, content management systems, and real-time analytics.
  • Examples: MongoDB, Cassandra, Redis, Couchbase.
3. NewSQL Databases:
  • NewSQL databases aim to combine the scalability and flexibility of NoSQL with the ACID (Atomicity, Consistency, Isolation, Durability) properties of traditional relational databases.
  • Use Cases: NewSQL databases are suitable for applications that require the benefits of both relational and NoSQL databases. They are used in systems that need strong consistency and horizontal scalability.
  • Examples: Google Spanner, CockroachDB,VoltDb.

3. Document Stores:

  • Document databases store data in a format similar to JSON or XML documents, allowing for nested structures and key-value pairs. Each document can have its own structure.
  • Use Cases: Document stores are excellent for content management systems, catalogs, and applications with semi-structured or hierarchical data.
  • Examples: MongoDB, CouchDB, RavenDB.
  •  

4. Key-Value Stores:

  • Key-value stores are the simplest NoSQL databases, where data is stored as key-value pairs. They offer high performance for read and write operations.
  • Use Cases: Key-value stores are often used for caching, session management, and real-time analytics where speed is critical.
  • Examples: Redis, Amazon DynamoDB.

5. Column-Family Stores:

  • Column-family stores are optimized for storing and retrieving vast amounts of data, typically in columnar format. They are well-suited for big data and analytical workloads.
  • Use Cases: Column-family stores are used in applications requiring efficient analysis of large datasets, like time-series data, log files, and data warehousing.
  • Examples: Apache Cassandra, HBase.

6. Graph Databases:

  • Graph databases are designed to represent and store data in a graph structure, where entities are nodes, and relationships between them are edges. They excel at traversing complex relationships.
  • Use Cases: Graph databases are suitable for applications with highly interconnected data, such as social networks, recommendation systems, and fraud detection.
  • Examples: Neo4j, Amazon Neptune.

7. In-Memory Databases:

  • In-memory databases store data primarily in RAM, which results in extremely fast data access but limited storage capacity. They are often used for real-time data processing.
  • Use Cases: In-memory databases are employed in applications requiring rapid data retrieval, such as caching, high-frequency trading, and real-time analytics.
  • Examples: Redis, Memcached, VoltDB.

8. Time-Series Databases:

  • Time-series databases are optimized for handling data points with timestamps. They offer efficient storage and retrieval of time-stamped data.
  • Use Cases: Time-series databases are used in applications where data changes over time and needs to be analyzed, like IoT (Internet of Things), monitoring, and financial systems.
  • Examples: InfluxDB, Prometheus, TimescaleDB.
9. Multimodel Databases:
  • Multimodel databases support multiple data models within a single database system. They provide versatility by allowing users to choose the best model for their data.
  • Use Cases: Multimodel databases are used in applications where different data models are required for different types of data or services.
  • Examples: ArangoDB, OrientDB.

Scaling databasesπŸ“ˆπŸ”

  • Scaling databases is a crucial aspect of managing the growth and performance of applications. Two common approaches to scaling are horizontal scaling and vertical scaling. In this section, we will explore these concepts in detail.

1. Horizontal Database Scaling:

  • Horizontal scaling, often referred to as scaling out, involves adding more servers or instances to distribute the data and processing load. It aims to enhance the database's capacity by adding more identical units (nodes) to a system. πŸ“Š
    Key Points:
  • Distribution of Data: In horizontal scaling, data is divided across multiple servers or instances. Each server stores a portion of the data, reducing the amount of data each server has to manage. πŸ”€
  • Improved Scalability: Horizontal scaling offers excellent scalability as new servers can be easily added to accommodate a growing dataset or increased workload. It's a cost-effective way to handle increased traffic. πŸ“ˆ
  • Load Balancing: To ensure even distribution of data and queries, load balancers are used to route requests to the appropriate server. Load balancing mechanisms can be based on various algorithms, such as round-robin or least connections. πŸ”„
  • Data Sharding: Horizontal scaling often involves data sharding, where data is partitioned based on certain criteria, such as user IDs, geographical location, or a specific category. This ensures that related data is stored together, reducing the need for complex joins. πŸ”—
  • Challenges: Maintaining data consistency across distributed servers and handling complex queries that span multiple shards can be challenging. It may require sophisticated strategies for data synchronization and management. 🌐

   Use Cases:

  • Web applications that experience rapid user growth and need to handle increased traffic efficiently. 🌐
  • Databases with datasets that exceed the capacity of a single server. πŸ’½
  • Systems where redundancy and high availability are essential, as failures in one server do not lead to data loss or service interruption. πŸ›‘️
2. Vertical Database Scaling:
  • Vertical scaling, often referred to as scaling up, involves adding more resources (such as CPU, memory, or storage) to a single server to improve the performance and capacity of the database. It focuses on enhancing the capabilities of an existing server. πŸ”
    Key Points:
  • Resource Enhancement: In vertical scaling, you invest in upgrading the server's hardware resources, allowing it to handle more data, users, and complex queries. πŸ› ️
  • Simplified Management: Managing a single server is typically simpler than managing a distributed environment with multiple nodes. πŸ—ƒ️
  • Data Coherence: Vertical scaling maintains data integrity and consistency because all data remains on a single server. πŸ”’
  • Limitations: Vertical scaling has limitations as it relies on the capacity of the underlying hardware. There is a point at which further scaling becomes impractical or prohibitively expensive. πŸ’°

    Use Cases:

  • Applications that require a straightforward and less complex solution for handling increased loads or data volume. πŸ—️
  • Scenarios where the existing hardware resources can be upgraded, and the budget allows for investing in more powerful servers. πŸ’Έ
  • Systems that need to maintain strong data consistency, where complex data distribution is not necessary. πŸš€
Horizontal vs. Vertical Scaling in Databases:
  • The choice between horizontal and vertical scaling depends on various factors, including: πŸ”€πŸ”
  • Application Requirements: Consider the specific needs of your application. Horizontal scaling is often favored for applications with unpredictable growth, while vertical scaling is suitable for those with a steady increase in resource demands. 🎯
  • Budget: Horizontal scaling is often more cost-effective, as it involves adding commodity servers. Vertical scaling can become expensive when high-end server hardware is required. πŸ’²
  • Data Distribution: If your application requires distributing data across geographical locations or handling massive data volumes, horizontal scaling may be more appropriate. πŸ—Ί️
  • Data Consistency: Vertical scaling is generally simpler to manage in terms of data consistency. If maintaining strong consistency is a top priority, vertical scaling may be preferred. πŸ”

Conclusion:

  • In practice, many applications combine both horizontal and vertical scaling strategies to meet their specific needs and maintain a balance between scalability, performance, and cost-efficiency. 
  • The choice should be based on a careful assessment of the application's requirements and a long-term scalability plan. πŸš€πŸ”

The CAP theoremπŸ“š

  • The CAP theorem, also known as Brewer's theorem, is a fundamental concept in distributed systems that helps us understand the trade-offs involved when designing and operating distributed databases. 
  • It was formulated by computer scientist Eric Brewer in 2000 and has since become a guiding principle for architects and engineers building distributed systems. 
  • The CAP theorem states that in a distributed database system, you can prioritize two out of three of the following attributes: Consistency (C), Availability (A), and Partition Tolerance (P). 🧩

Consistency (C):

  • In the context of the CAP theorem, consistency refers to the idea that all nodes in a distributed system see the same data at the same time. It ensures that, after a successful write operation on one node, all subsequent read operations will reflect that write.
  • Use Cases: Consistency is crucial for systems where data accuracy and correctness are of the utmost importance, such as financial applications and inventory management systems. πŸ’Ή
  • Trade-Offs: Achieving strong consistency might require blocking or delaying read requests until the data is fully replicated across all nodes. This can impact system availability and performance, especially during network partitions or high loads. πŸ•’

Availability (A):

  • Availability means that every request, whether it's a read or write operation, gets a response (success or failure) without any guarantees regarding the data's consistency. In an available system, there is no downtime, and all nodes are responsive.
  • Use Cases: High availability is essential for applications where uninterrupted service is a priority, such as e-commerce websites and real-time communication platforms. 🌐
  • Trade-Offs: Sacrificing consistency for availability can result in temporary data inconsistencies or the need for conflict resolution when multiple nodes receive conflicting updates. πŸš€
Partition Tolerance (P):
  • Partition tolerance refers to a system's ability to continue operating, even in the presence of network partitions, where nodes can't communicate with one another due to network failures or congestion.
  • Use Cases: Partition tolerance is a critical requirement for distributed systems that must remain operational under less-than-ideal network conditions, which is common in large-scale, geographically distributed systems. 🌍
  • Trade-Offs: Achieving partition tolerance often leads to a trade-off between consistency and availability, as the system may need to make decisions independently on different sides of a network partition. πŸ”’

Two of these attributes of CAP theorem

  • The CAP theorem asserts that it's impossible for a distributed database system to simultaneously provide all three attributes—Consistency, Availability, and Partition Tolerance—at the highest levels. In a networked environment, where network failures can occur, you must prioritize two of these attributes while compromising on the third. This is often represented as:
  • CA: 
    • Systems that prioritize Consistency and Availability are not partition-tolerant. In the face of network partitions, they might become unavailable or exhibit inconsistent behavior.
  • CP: 
    • Systems that prioritize Consistency and Partition Tolerance may sacrifice availability during network partitions to ensure that data remains consistent.


  • AP: 
    • Systems that prioritize Availability and Partition Tolerance may not guarantee strong consistency, allowing nodes to continue serving requests independently, even if it results in temporary data inconsistencies.

Conclusion

  • It's important to note that the CAP theorem doesn't dictate a strict choice between these trade-offs; rather, it provides a framework for understanding the fundamental challenges of distributed systems. 
  • Depending on your specific application and requirements, you can make informed design decisions to balance these attributes and optimize your system accordingly.
  • Many real-world distributed databases, like Cassandra, Couchbase, and Riak, are designed with a focus on partition tolerance (AP) while providing tunable consistency and availability settings to meet various use cases. 
  • Understanding the CAP theorem is essential for architects and engineers working on distributed systems to make informed decisions about their design, ensuring they align with the desired trade-offs for their specific application. πŸŒπŸ”’πŸ§©

How to select the right database for the service? πŸ”’

  • It is a very crucial step when it comes to databases in designing systems. In order to get the right database for our data, we need to first look over 5 factors that are as follows:
    1. Structure of Data
    2. Pattern query
    3. Amount of scalability
    4. Cost
    5. Maturity of the database 
When we to use relational and non-relational database in system?
  • When we are having structured data(tabular data) than using relational database is optimal because here the data is in form of rows and columns and can easily be stored. Popular examples: MySQL, oracle, postgres, SQL server. 
  • But if we do not need ACID properties than it is upto us which one to choose as per the requirements.  
  • Choosing a database is depicted in a flowchart below as follows: 

Whar are Challenges to databases while Scaling πŸ’Ή

  • We are facing a problem of increased cost for query operations no matter what the type of database. It is because the CPU is responsible for query operation whereas our data is stored in hard disk(secondary memory). 
  • Now CPU is computing a million input per second (MIPS) whereas our hard disk is only doing <100 operations per second no matter how fast it be. So they cannot interact with each other directly but have to correspond to which we bring primary memory (RAM) into play which can operate faster via caching but it is not optimized as perceived from the below media:
  • As seen above data is stored across sectors(blocks) in our secondary memory and can not be fully transferred to RAM else it will be lost completely as studied in the operating system we pay a cost every time while dealing without data.

How to overcome challenges to Databases while Scaling πŸ—‚️

  • Now let us discuss below concepts that help us in scaling our databases and overcoming these challenges that are as follows: 
    1. Indexing
    2. Data Partitioning
    3. Database Sharding

1. Indexing πŸ’

  • Indexing is a database optimization technique that involves creating a data structure (an index) to enhance the speed of data retrieval operations on a database table. 
  • These indexes are similar to the index in a book, where you can quickly find specific information without reading the entire book. In the context of databases, indexes help locate and access data more efficiently, especially in large datasets. πŸ”
How Indexing Helps in Reducing Costs?
  • Indexing contributes to cost reduction in the following ways:
  • Improved Query Performance: By reducing the time it takes to retrieve data, indexing reduces the need for additional hardware resources and optimizes query execution, saving both time and money. πŸ’°
  • Lower Maintenance Costs: Well-designed indexes can decrease the need for fine-tuning and query optimization, saving on database administrator (DBA) and developer time. πŸ› ️
  • Reduced I/O Operations: Efficient indexes minimize the number of disk I/O operations, which can lead to cost savings on storage and server hardware. πŸ’Ύ
Advantages of Indexing:
  • Faster Data Retrieval: Indexes speed up data retrieval, making queries run more efficiently. ⚡
  • Optimized Read Operations: Select queries benefit significantly from indexing, resulting in quicker response times. πŸƒ‍♂️
  • Improved Data Integrity: Unique indexes enforce data integrity constraints, preventing duplicate or invalid data. πŸ”’
  • Reduced Locking: Properly indexed tables can reduce contention and locking issues, allowing for better concurrency. πŸ—‚️
  • Support for Joins: Indexes facilitate efficient joins, which are essential for relational databases. 🀝
Disadvantages of Indexing:
  • Increased Storage: Indexes consume additional storage space, which can become a concern for very large databases. πŸ“ˆ
  • Slower Write Operations: While reads benefit, write operations (inserts, updates, and deletes) can be slower due to index maintenance. 🐒
  • Maintenance Overhead: Indexes need to be maintained and periodically rebuilt, which may require additional resources. 🧰
  • Complexity: Over-indexing or indexing the wrong columns can lead to query performance issues and increased complexity. 🧩
  • Performance Trade-offs: Indexes are not a one-size-fits-all solution; choosing the right indexes and strategies is crucial. πŸ€”

Use Cases of Indexing πŸ‘₯:

  • The sales team frequently queries the CRM database to retrieve customer information, check purchase history, and log recent interactions. As the database grows, these queries become slower, and the sales team experiences delays in serving customers, which can lead to missed opportunities and customer dissatisfaction.
SolutionπŸ’Ό
  • To address this performance issue, the CRM system's database administrators implement indexing. Here's how it works:
  • Customer Name Index: An index is created on the "Customer Name" field in the customer database table. This index allows the CRM system to quickly locate and retrieve customer records based on their names.
  • Purchase History Index: An index is also created on the "Purchase Date" field in the purchase history table. This index accelerates queries related to a customer's purchase history and allows the sales team to retrieve data about past transactions more efficiently.
  • Communication Log Index: Similarly, an index is applied to the "Interaction Date" field in the communication log, making it easier to retrieve information about recent interactions with customers.

2. What is Data partitioning?

  • Data partitioning, also known as database partitioning, is a database design technique that involves splitting a large database into smaller, more manageable pieces called partitions. Each partition contains a subset of the data and is stored separately. This division of data can significantly improve database performance, manageability, and scalability. It's particularly valuable for large and growing datasets 🌍.

1. Horizontal Partitioning:

  • Horizontal partitioning involves dividing a table into smaller segments based on rows. Each partition holds a specific range of rows or a set of rows that meet certain criteria. For example, you might partition a customer database horizontally by storing customers from different countries in separate partitions.
  • Advantages: It can improve query performance by narrowing down the data needed for a particular operation. It also simplifies backup and maintenance tasks related to specific data subsets.
  • Disadvantages: Managing a large number of partitions can become complex, and some queries may still require scanning multiple partitions.

2. Vertical Partitioning:

  • Vertical partitioning involves dividing a table into smaller segments based on columns. Each partition holds a subset of the table's columns. This technique is used to reduce the amount of data read from disk, which can be beneficial when many columns in a table are rarely used togetherπŸ“Š.
  • Advantages: It reduces the I/O overhead for queries that access only a subset of the columns. Vertical partitioning can improve query performance, reduce storage costs, and make it easier to manage less frequently accessed data.
  • Disadvantages: Queries that need all the columns may require joining multiple partitions, which can add complexity to query execution.

3. Directory-Based Partitioning:
  • Directory-based partitioning involves using a directory or metadata system to manage the location and organization of partitions. This approach helps the database system efficiently locate and access the appropriate partition for a given query.
  • Advantages: It simplifies partition management and allows for dynamic reconfiguration of partitions, making it easier to scale and adapt to changing data requirements.
  • Disadvantages: The directory itself can become a point of contention or a potential bottleneck in high-throughput scenarios.
4. Partition Criteria:
  • Key or Hash-based Robin Partitioning: To determine the partition number, we apply the hash function to the entry’s key attribute.
  • List Robin Partitioning: The column that corresponds to one of the sets of discrete values is used to choose which partition to use. A set of appropriate values is assigned to the specific partition.
  • Round Robin Partitioning: The ith tuple is assigned to partition number i%n if there are n partitions. This implies that (i%n) nodes would receive the ith data. Sequential assignments are made to the data. The distribution of data is guaranteed by this partitioning criterion.
  • Consistent hashing: This form of division is novel. The hash-based partitioning had the drawback of requiring a change in the hash function when adding new servers. A server outage and data redistribution would result.
Advantages of Partitioning:
  • Improved Performance: Queries can run faster because they access only the relevant partitions, reducing the amount of data scanned.
  • Scalability: Partitioning supports horizontal scalability by distributing data across multiple storage devices or servers.
  • Easier Maintenance: Smaller partitions are easier to back up, restore, and manage. Maintenance operations can be more targeted.
  • Cost Savings: You can optimize storage costs by storing more frequently accessed data on high-performance storage and less frequently accessed data on cheaper storage.
Disadvantages of Partitioning:
  • Complexity: Managing a large number of partitions can be complex and require careful planning.
  • Overhead: Directory-based partitioning can introduce overhead in the form of metadata management and lookup operations.
  • Query Complexity: Some queries may require accessing multiple partitions, leading to more complex queries and potential performance bottlenecks.

3. What is Sharding? πŸ’

  • Sharding is a database architecture strategy used to horizontally partition a large database into smaller, more manageable pieces called shards. 
  • Each shard is a self-contained database that stores a subset of the data. Sharding is primarily employed in distributed database systems to improve performance, scalability, and data management.

Need for Sharding:

  • Consider a very large database whose sharding has not been done. For example, let’s take a Database of a college in which all the student records (present and past) in the whole college are maintained in a single database. So, it would contain a very large number of data, say 100, 000 records. Now when we need to find a student from this Database, each time around 100, 000 transactions have to be done to find the student, which is very very costly. 

  • Now consider the same college students’ records, divided into smaller data shards based on years. Now each data shard will have around 1000-5000 students’ records only. So not only the database became much more manageable, but also the transaction cost each time also reduces by a huge factor, which is achieved by Sharding. Hence this is why Sharding is needed
  • Scalability: Large databases may struggle to handle increasing amounts of data and concurrent users. Sharding allows for horizontal scaling by distributing data across multiple servers or clusters.
  • Performance: Queries on large, unsharded databases can become slow, as they involve scanning extensive data. Sharding can significantly improve query performance by limiting the data involved in each query.
  • Data Localization: Sharding can help localize data, making it geographically closer to end-users or specific regions, which reduces latency and improves user experience.

Features of Sharding:

  • Sharding makes the Database smaller
  • Sharding makes the Database faster
  • Sharding makes the Database much more easily manageable
  • Sharding can be a complex operation sometimes
  • Sharding reduces the transaction cost of the Database
  • Each shard reads and writes its own data.
  • Many NoSQL databases offer auto-sharding.
  • Failure of one shard doesn’t affect the data processing of other shards.
Sharding Architectures:

1. Key-Based Sharding: 
  • Data is partitioned based on specific keys or attributes, such as user IDs or geographical regions. Each key corresponds to a particular shard.
  • Key Selection:This key is a field or a set of fields in the database schema that is used to determine which shard a particular piece of data belongs to. Common keys include user IDs, customer names, geographical locations, or any attribute that suits the data distribution requirementsπŸ”’.
  • Data Distribution: The sharding algorithm typically involves applying a hash function or range-based calculation to the sharding key to determine the target shard. This ensures that data with the same key values are stored on the same shard, promoting data locality and efficient query routingπŸ“….
  • Shard Independence: Each shard operates independently and contains a portion of the data. This isolation means that the failure of one shard does not affect the entire database, improving system fault toleranceπŸ“‚.
  • Query Routing:When a query is issued, the sharding system routes the query to the appropriate shard. This process ensures that only relevant data is accessed, reducing the volume of data scanned and improving query performanceπŸ—Ί️.
  • Scalability: Key-Based Sharding supports horizontal scalability, allowing databases to grow with increased data and traffic. New shards can be added as the database expands, accommodating additional data and usersπŸš€.
  • Load Balancing: Properly designed Key-Based Sharding should aim for even data distribution across shards to prevent performance bottlenecks. Load balancing mechanisms may be employed to redistribute data as the database evolvesπŸ‹️‍♂️.
  • Considerations: The selection of the sharding key is a critical decision. It should consider the access patterns of queries, distribution of data, and the expected growth of the databaseπŸ€”.

2. Horizontal or Range-Based Sharding: 

  • Data is divided into ranges based on a specific attribute, like time or numerical values. This approach is often used for time-series data.
  • Each shard is responsible for storing data records falling within its designated range. Horizontal or Range-Based Sharding is particularly useful for datasets where the distribution of data is sequential, such as time-series data or data with numerical identifiers. ⏰πŸ”’
  • Attribute Selection: In Range-Based Sharding, you start by selecting an attribute in your database schema that has a range of values suitable for division. Common examples include date/time stamps, numerical IDs, or alphabetical ranges, such as product names or geographical regions. πŸ“†πŸ”’πŸŒ
  • Data Range Partitioning: The selected attribute is used to partition the data into distinct ranges or intervals. Each range corresponds to a particular shard. For instance, if you are sharding based on time, you might create daily, weekly, or monthly intervals. πŸ“ŠπŸ“…
  • Data Distribution: Data records are assigned to the shard whose range corresponds to the value of the chosen attribute in each record. This assignment ensures that data within a given range is stored together on the same shard, promoting data locality and efficient querying. πŸ“‚πŸ“¦
  • Query Routing: When a query is executed, the sharding system routes the query to the appropriate shard based on the range criteria in the query. This ensures that the query operates only on the relevant shard and a subset of the data, enhancing query performance. πŸš€πŸ—Ί️
  • Scalability: Horizontal or Range-Based Sharding supports horizontal scalability by allowing new shards to be added as data grows. Each new shard covers a new range of values within the chosen attribute. πŸ“ˆπŸš€
  • Load Balancing: Properly designed Range-Based Sharding should aim for even distribution of data across shards. Load balancing mechanisms may be employed to ensure that each shard remains within its operational capacity. ⚖️πŸ‹️‍♂️
  • Considerations:The choice of the attribute for range-based sharding should consider the data distribution and access patterns of queries. For example, if you're dealing with time-series data, the time attribute is a natural choice for range-based sharding. πŸ€”πŸ“Š

3. Vertical Sharding: 

  • In vertical sharding, data is divided by columns rather than rows. Different shards store different sets of columns, allowing for a reduction in I/O operations.
  • This technique allows for more efficient storage and improved query performance by reducing the amount of data that needs to be read and loaded into memory. πŸ“ŠπŸ’
  • Attribute Selection: In Vertical Sharding, you start by selecting the columns or attributes of a table that should be divided among the shards. The selection is based on usage patterns and access requirements. For example, you might have a user profile table with many columns, but you may choose to shard it based on frequently accessed attributes like user ID, name, and email. πŸ”
  • Data Partitioning: The selected columns are partitioned across different shards. Each shard stores a subset of the columns. For instance, Shard A might contain user IDs, names, and email addresses, while Shard B might store other attributes like addresses and phone numbers. 🧩
  • Data Distribution: Data records are assigned to the shard that contains the relevant columns for that record. Each record's columns determine which shard stores that record's data. πŸ”€
  • Query Routing: When a query is executed, the sharding system routes the query to the appropriate shard. This ensures that only the relevant columns are accessed, reducing I/O overhead and improving query performance. πŸš€πŸ“‘
  • Scalability: Vertical Sharding supports horizontal scalability. As data grows, new shards can be added to accommodate additional columns or attributes, and each new shard is responsible for storing a subset of the columns. πŸ“ˆπŸ”Ό
  • Load Balancing: Load balancing in Vertical Sharding is often related to the distribution of column data. It's essential to maintain even column distribution across shards to prevent performance bottlenecks. ⚖️πŸ‹️‍♂️
  • Considerations: The choice of columns for vertical sharding should be based on access patterns. Frequently accessed columns should be grouped together to maximize query performance. πŸ€”πŸ“‰

4. Directory-Based Sharding: 

  • A metadata or directory system is used to map data to specific shards. This approach simplifies the management of shard locations.
  • This approach simplifies the management of shard locations and can provide flexibility in dynamically reconfiguring partitions. πŸ—„️🌐
  • Directory Creation: In Directory-Based Sharding, a directory is created to store information about which shard is responsible for each piece of data. The directory can be a separate database, a distributed data store, or even a dedicated service that maintains this mapping. πŸ“πŸ“Š
  • Data Insertion: When data is inserted into the database, it does not go directly to a shard. Instead, it is sent to the directory first. The directory, based on a predefined set of rules or attributes, determines which shard should store the data and records this information. πŸ“₯πŸ“©
  • Query Routing: When a query is executed, the directory is queried first to identify the shard that holds the relevant data. The query is then routed to the appropriate shard for processing. This ensures that the data access is directed to the correct location, even if data redistribution is necessary. πŸš€πŸ—Ί️
  • Shard Management: The directory system manages the metadata about shard locations, such as which servers or nodes host which shard. This makes it easier to scale the database by adding new shards or redistributing data as needed. πŸ’πŸ”
  • Flexibility and Adaptability: Directory-Based Sharding provides flexibility to adapt to changing data distribution and access patterns. If data characteristics change or new shards need to be added, the directory can be updated without changing the application logic. πŸ”„πŸ“ˆ
  • Considerations: Directory-Based Sharding may introduce additional overhead in terms of managing the directory system. However, this overhead is often outweighed by the advantages it offers in terms of flexibility and dynamic reconfiguration. ⚖️🧠


Advantages of Sharding:

  • Scalability: Sharding supports horizontal scalability, allowing databases to grow with increased data and traffic.
  • Improved Performance: Sharding reduces the amount of data that needs to be processed for each query, resulting in faster response times.
  • Data Localization: Shards can be geographically distributed, reducing data access latency for users in different regions.
  • High Availability: Sharding enhances fault tolerance, as the failure of one shard doesn't affect the entire system.

Disadvantages of Sharding:

  • Complexity: Sharding introduces complexity in terms of data distribution, query routing, and shard management.
  • Query Routing Overhead: Routing queries to the correct shard adds an overhead, and this process must be efficient.
  • Data Balancing: Uneven data distribution among shards can lead to performance issues. Balancing data across shards can be challenging.
  • Backup and Recovery: Managing backups and recovery processes becomes more complex when dealing with a large number of shards.

What are the differences between sharding and partitioning? πŸš€

  • While sharding and partitioning share the common goal of dividing a large database into smaller ones, they have different approaches to achieve this. When sharding a database, the data is distributed across multiple servers, resulting in new tables spread across these servers. 
  • On the other hand, partitioning involves splitting tables within the same database instance. Sharding is referred to as horizontal scaling, and it makes it easier to scale as you can increase the number of machines to handle user traffic as it increases. 
  • Partitioning splits based on the column value(s). All columns should be retained when partitioned – just different rows will be in different tables. It is also easier to manage data with partitioning, as all partitions are in one database instance.

Conclusion

  • Database design and optimization are critical steps in building reliable and high-performance applications. By understanding your application's specific requirements, choosing the right database model, and following best practices in data organization, security, and performance optimization.
  • you can ensure that your database supports your application's needs and can scale as your user base grows. It's an ongoing process that demands attention, but the benefits in terms of application stability and performance are well worth the effort.

System Design  Other Links 

You may also like

Kubernetes Microservices
Python AI/ML
Spring Framework Spring Boot
Core Java Java Coding Question
Maven AWS