Distributed Databases



index
Disabled back button Next Section
printable version

Section 0: Module Objectives or Competencies
Course Objective or Competency Module Objectives or Competency
Students will be introduced to some of the theory behind the development of database engines, and will understand the complexity and challenges of developing distributed databases. The student will be able to explain the definition of a distributed database management system, as well as what features are required in order for a DBMS to be considered distributed.
The student will be able to list and explain the components of a DDBMS.
The student will be able to explain that database systems can be classified on the basis of the extent to which process distribution and data distribution are supported.
The student will be able to explain that a DDBMS has various degrees of transparency, which refers to whether the complexities of distribution are noticeable to the user.
The student will be able to explain how database distribution makes concurrent transactions even more complex.
The student will be able to explain how a DDBMS handles transactions that involve data at distributed locations.
The student will be able to explain the implications of database distribution on database design.
The student will be able to explain the CAP Theorem and its implications on distributed databases.
The student will be able to list and explain Date's Commandments for a fully distributed database.
Section 1: Overview

A Distributed Database Management System (DDBMS) consists of a single logical database that is split into a number of fragments, each stored on one or more computers under the control of a separate DBMS, with the computers connected by a communications network.


Multi-site database:

Requirements:

Justification:


Distributed DBMS Advantages and Disadvantages.
Section 2: DDBMS

A fully distributed database management system governs both the storage and processing of logically related data over interconnected computer systems in which both data and processing functions are distributed over several sites.

A DDBMS has the following characteristics:

A DBMS must have at least the following functions to be classified as distributed:


A fully DDBMS must perform all of the functions of a centralized DBMS, but it must also handle all necessary functions imposed by the distribution of data and processing. These additional functions must be performed transparently to the end user.

The figure below shows a single logical database that consists of two database fragments A1 and A2 located at sites 1 and 2. The end users see only one logical database. It is irrelevant to the users that the database is divided into two separate fragments or where those fragments reside.

A fully distributed database management system.
Section 3: DDBMS Components

DDBMS components include

The communication among DPs and TPs is possible through a set of rules or protocols used by the DDBMS.

The protocols determine how the distributed database will:

Data processors (DPs) and transaction processors (TPs) can be added to the system without affecting the operation of the other components.

Section 4: Levels Of Data and Process Distribution

Database systems can be classified on the basis of the extent to which process distribution and data distribution are supported.

A DBMS may store data in a single site (centralized DB) or in multiple sites (distributed DB) and may support data processing at a single site or at multiple sites.

Levels Of Data and Process Distribution.

SPSD (Single-Site Processing, Single-Site Data)

MPSD (Multiple-Site Processing, Single-Site Data)

MPMD (Multiple-Site Processing, Multiple-Site Data)

Section 5: Distributed Database Transparency Features

A DDBMS requires functional characteristics that can be grouped and described as a set of transparency features, which have the common property of hiding, or making transparent to the user, the complexities of a distributed database.

Section 6: Distribution Transparency

Distribution transparency allows the management of a physically dispersed database as though it were a centralized database.

Fragment locations.

Three levels of distribution transparency are recognized:

Section 7: Transaction Transparency

Transaction transparency ensures that database transactions will maintain the distributed database's integrity and consistency. It ensures that a transaction will be completed only if all database sites involved in the transaction complete their part of the transaction.

A distributed request allows the user to reference data from several remote DP sites. Since each request can access data from more than one DP site, the transaction can access several sites. The ability to execute a distributed request provides fully distributed database processing capabilities because it allows:

The location and partition of the data should be transparent to the end user.

The figure below shows a distributed request that includes a transaction that uses a single SELECT statement to reference two tables, which are located at different sites.

A distributed request.

The next figure demonstrates that the distributed request feature allows a single request to reference a physically partitioned table. The CUSTOMER table is divided into two fragments, C1 and C2, located at sites B and C. The user wants a list of all customers whose balances exceed $250.00.

Another distributed request.

Transaction transparency ensures that distributed transactions are treated as centralized transactions, insuring serializability of transactions. That is, the execution of concurrent transactions, whether or not they are distributed, will take the database from one consistent state to another.

Section 8: Failure Transparency

In a distributed system, failure transparency refers to the extent to which errors and subsequent recoveries of hosts and services within the system are invisible to users and applications.

In the distributed environment, the DDBMS must account for:

The DDBMS must ensure the atomicity of each transaction, which means ensuring that subtransactions either all commit or all abort.

Section 9: Performance Transparency and Query Optimization

Performance transparency mandates that the DDBMS should have a comparable level of performance to a centralized DBMS.

In a distributed environment, the system should not suffer any performance degradation due to the distributed architecture, such as the presence of the network.


Performance transparency also requires the DDBMS to determine the most cost-effective strategy to execute a request.

Most current databases require the user to specify what is required in a query, but not how to acquire it.

The DDBMS must also consider the additional processing and communication costs associated with the distribution of data.

In a distributed environment, the DDBMS must take account of the communication cost, which may be the most dominant factor in WANs with a bandwidth of a few kilobytes per second.

One approach to query optimization minimizes the total cost of time that will be incurred in executing the query.

An alternative approach minimizes the response time of the query, in which case the QP attempts to maximize the parallel execution of operations.

Section 10: Distributed Concurrency Control

Concurrency control is especially important in a distributed database environment because multi-site, multiple-process operations are much more likely to create data inconsistencies and deadlocked transactions than are single-site systems. All parts of the transaction, at all sites, must be completed before a final COMMIT is issued to record the transaction.

The following figure shows a situation in which each transaction operation is supposed to be committed by each local DP, but one of the DPs can not commit the transaction's results. Such a transaction would yield an inconsistent database, with its inevitable integrity problems, because it is not possible to uncommit committed data. The solution is described next.

Effect of a premature COMMIT.
Section 11: Two-Phase Commit Protocol

Because distributed databases make it possible for a transaction to access data at several sites, a final COMMIT must not be issued until all sites have committed their parts of the transaction. The two-phase commit protocol guarantees that, if a portion of a transaction operation cannot be committed, all changes made at the other sites participating in the transaction will be undone to maintain a consistent database state.

Each DP maintains its own transaction log. The two-phase commit protocol requires that each individual DP's transaction log entry be written to permanent storage before the database fragment is actually updated. This is referred to as the write-ahead protocol.

The two-phase commit protocol defines the operations between two types of nodes: the coordinator and one or more subordinates. The protocol is implemented in two phases:

Phase 1: Preparation

Phase 2: The Final COMMIT

If any of the subordinates does not commit, the coordinator sends an ABORT message, thereby forcing all changes to be rolled back.

The objective of the two-phase commit is to ensure that all nodes commit their part of the transaction, or the transaction is aborted. If one of the nodes fails to commit, then the information necessary to recover the database is in the transaction log, and the database can be recovered.

Section 12: Distributed Database Design

Although the design principles discussed earlier apply equally to distributed databases, designers also need to consider:

Data fragmentation and data replication address the first issue, and data allocation deals with the second.




Data Fragmentation

Data fragmentation allows a single object to be broken into two or more segments or fragments.

The object can be a user's database, a system database, or a table.

The fragments can be stored at any node on the network.

Data fragmentation information is stored in the distributed data catalog (DDC), where it is accessed by the transaction processor (TP) to process user requests.

Data fragmentation strategies are based at the table level and consist of dividing a table into logical fragments:


Illustrations

An illustration.


Horizontal Fragmentation

Corporate management requires information about its customers in all three states, but company locations in each state (TN, FL, and GA) require data regarding local customers only.

Sample table.

The resulting fragments yield the three tables shown below.

Sample tables.


Vertical Fragmentation

The CUSTOMER relation can also be divided into vertical fragments that are composed of a collection of attributes.

Vertical fragments.

The vertical fragmentation results are shown below.

Vertical fragmentation results.


Mixed Fragmentation

Mixed fragmentation requires both horizontal and vertical table partitioning.

Mixed fragmentation requires a two-step procedure

The mixed fragmentation yields the results shown below.

Mixed fragmentation results.

Each of the fragments shown above contains customer data by state and, within each state, by department location, to fit each department's requirements. The tables corresponding to the fragments are data shown below.

Mixed fragmentation tables.



Data Replication

Data replication refers to the storage of data copies at multiple sites.

Fragment copies may be stored at several sites to serve specific information requirements (and to provide some degree of redundancy in the event that a node goes down).

Since the existence of fragment copies may enhance data availability and response time, they help to reduce both communication costs and total query costs.

Suppose database A is divided into fragments A1 and A2. Within a replicated distributed database the scenario depicted below is possible. Fragment A1 is stored at sites S1 and S2, while fragment A2 is stored at sites S2 and S3.

Data replication.

Replicated data are subject to the mutual consistency rule, which requires that all data fragments must be identical. To maintain data consistency among the replicas, the DDBMS must ensure that a database update is performed at all sites where the replicas exist.

Despite the advantages associated with replication, it also requires additional processing overhead, because each of the data copies must be maintained by the system. The DDBMS must perform the following processes on the system:

Three replication conditions exist:

  1. A fully replicated database stores multiple copies of each database fragment at multiple sites. All database fragments will be replicated, which tends to be impractical due to the amount of overhead that it will impose on the system.
  2. A partially replicated database stores multiple copies of some database fragments at multiple sites. Partial replication does not demand excessive overhead and so is practical.
  3. A database that is not replicated stores each fragment at a single site only, so there are no duplicate fragments.

Two factors influence data replication: usage frequency and database size. When the usage frequency of remotely located data is high and the database is large, replication is likely to reduce the cost of data requests.




Data Allocation

Data allocation refers to the process of deciding where to locate data. Data allocation is closely related to the way a database is divided or fragmented.

Data allocation algorithms take into consideration a variety of factors:

Section 13: CAP Theorem

In 2000, Eric Brewer presented the CAP Theorem, which states that there are three essential system requirements necessary for the successful design, implementation, and deployment of applications in distributed computing systems – Consistency, Availability, and Partition Tolerance – but in the majority of instances, a distributed system can only guarantee two of the three features.

The correct way to think about CAP is that in case of a network partition one needs to choose between availability and consistency.


Links

Section 14: Client/Server Architecture

Many database vendors refer to distributed databases as client/server.

Client/server architecture actually refers to the way in which computers interact to form a system.

Section 15: Date's Commandments

C.J. Date provided a list of commandments for a fully distributed database in 1987. While no current DDBMS conforms to all of them, they describe the ideal or eventual target system.

Section 16: Distributed MySQL?

MySQL Cluster

MySQL Cluster: Two webserver setup