Big Data



index
Disabled back button Next Section
printable version

Section 0: Module Objectives or Competencies
Course Objective or Competency Module Objectives or Competency
The student will learn alternatives for relational databases for storing Big Data, the type of data that includes unstructured and semi-structured data that is not well suited for traditional databases. The student will be able to list and explain various types of Big Data.
The student will be able to list and explain the problems associated with Big Data.
The student will be able to list and explain the characteristics of Big Data.
The student will be able to list and explain emerging Big Data technologies.
Section 1: Overview

The amount of data being collected has been growing exponentially in size and complexity.

Web data in the form of browsing patterns, purchasing histories, customer preferences, behavior patterns, and social media data from sources such as Facebook, Twitter, and Linkedln have inundated organizations with combinations of structured and unstructured data.

What is Big Data and how does it work?

What is Big Data?

Section 2: Issues

The problem is that the relational approach does not always match the needs of organizations with Big Data challenges.

Section 3: Big Data Characteristics

Big Data generally refers to a set of data that displays the characteristics of volume, velocity, and variety (the "3 Vs") to an extent that makes the data unsuitable for management by a relational database management system.

Note that the "3 Vs" have been expanded to "5 Vs" (or more).

There are no specific values associated with these characteristics. The issue is that the characteristics are present to such an extent that the current relational database technology struggles with managing the data.


Volume

Volume, the quantity of data to be stored, is a key characteristic of Big Data because the storage capacities associated with Big Data are extremely large.

As the quantity of data needing to be stored increases, the need for larger storage devices increases as well. When this occurs, systems can either scale up or scale out.


Velocity

Velocity refers to the rate at which new data enters the system as well as the rate at which the data must be processed.

Rate at which new data enters the system

The issues of velocity mirror those of volume.

Other newer technologies, such as RFID, GPS, and NFC, add new layers of data-gathering opportunities that often generate large amounts of data that must be stored in real-time.

Rate at which the data must be processed

The velocity of processing can be broken down into two categories.


Variety

Variety refers to the vast array of formats and structures in which the data may be captured.

Data can be considered to be structured, unstructured, or semistructured.

Although much of the transactional data that organizations use works well in a structured environment, most of the data in the world is semistructured or unstructured.

Big Data requires that the data be captured in whatever format it naturally exists, without any attempt to impose a data model or structure to the data.


Veracity

Veracity is the authenticity and credibility of the data.


Value

Value refers to the ability to turn data into value.


But wait! There's more!

5 V’s of Big Data

Big Data: The 6 Vs You Need to Look at for Important Insights

The 7 V’s of Big Data

The 10 Vs of Big Data

The 17 V’s Of Big Data

The 42 V’s of Big Data and Data Science

The 51 V's Of Big Data: Survey, Technologies, Characteristics, Opportunities, Issues and Challenges

Fifty-Six Big Data V’s Characteristics and Proposed Strategies to Overcome Security and Privacy Challenges

99 Big Data Vs on the Wall

Section 4: Tools

Emerging Big Data technologies allow organizations to process massive data stores of multiple formats in cost-effective ways.

Some of the most frequently used Big Data technologies are Hadoop, MapReduce, and NoSQL databases.

Hadoop technologies provide a framework for Big Data analytics in which data (structured or unstructured) is distributed, replicated, and processed in parallel using a network of low-cost commodity hardware.

NoSQL databases provide distributed, fault-tolerant databases for processing nonstructured data.

Section 5: Summary

Big Data represents a new wave in data management challenges, but it does not mean that relational database technology is going away.

What has changed is that now, for the first time in decades, relational databases are not necessarily the best way for storing and managing all of an organization's data.

Section 6: Resources

Intro to Big Data

Distributed Database

A distributed database consists of a single logical database that is split into a number of fragments, each stored on one or more computers under the control of a separate DBMS, that communicate via a computer network.

Scalability

Database scalability focuses on a database's capability of handling growth in data and users, i.e., to increase capacity or add computing nodes based on the workload it is subjected to.

Fault Tolerance

Fault tolerance means that if one of the nodes in the distributed database fails, it will keep operating as normal.

Polyglot

Polyglot, as an adjective, can mean involving many diverse means of expression.

Terabyte

1 TB is equal to 1,099,511,627,776 bytes or 1,024 GB.

Petabyte

A petabyte is a measure of memory or data storage capacity that is equal to 2 to the 50th power of bytes. There are 1,024 terabytes (TB) in a petabyte – or 1 million gigabytes (GB) – and approximately 1,024 PB make up one exabyte. See also Megabytes, Gigabytes, Terabytes… What Are They?

2.5 Quintillion Bytes?

2.5x1018 / 1012 = 2,500,000 TB.