Related Specializations and Techniques

Section 0: Module Objectives or Competencies

Course Objective or Competency	Module Objectives or Competency
The student will be introduced to related specializations and techniques in the areas of data analytics and database.	The student will be able to explain specialization in the field of data analytics such as data integration, data migration, and data visualization.
	The student will be able to explain database related specializations such as database testing, database deployment, and database performance tuning.

Section 1: Overview

There are additional topics related to the content of this course with which students should become familiar.

Those related to data analytics will be addressed first, followed by those associated with databases.

Section 2: The Realm of Data Analytics

Data Science

Data science involves the application of statistics, machine learning, and analytical approaches to solve critical business problems.

The primary function of data science is to help organizations turn their volumes of big data into valuable and actionable insights.
Data science is driven and automated by machine learning and computer science.
Data scientists also interpret and deliver the results of their findings by visualization techniques, building data science apps, or narrating stories about the solutions to their data problems.
In order to analyze the data, a data scientist should have very broad knowledge of different techniques in machine learning, data mining, statistics and big data infrastructures.
While data analysts examine large data sets to identify trends, develop charts, and create visual presentations to help businesses make more strategic decisions, data scientists design and construct new processes for data modeling and production using prototypes, algorithms, predictive models, and custom analysis.

Data Engineering

Data engineering requires the knowledge and skills to prepare the data infrastructure to be analyzed by data scientists, software engineers who design, build, integrate data from various resources, and manage big data.

Then, with the goal of optimizing the performance of their company’s big data ecosystem, they write complex queries to ensure that the data is easily accessible.

Data engineering may also involve running some ETL (Extract, Transform and Load) on top of big datasets and creating big data warehouses that can be used for reporting or analysis by data scientists.
Typical tasks may include creating and maintaining optimal data pipeline architectures, assembling large, complex data sets that meet functional / non-functional business requirements, and creating data analytics tools that help optimize the analytics process to provide actionable insights.

Data Analytics Infrastructure

An optimal data analytics infrastructure makes it possible to analyze vast amounts of data in parallel across multiple clusters.

For example, the data analytics infrastructure typically allows for massive scalability of unstructured data across hundreds or thousands of cluster servers.
This makes it possible to analyze vast amounts of data in parallel across multiple clusters.
The speed at which an organization is able to transform the volume and variety of data available into insight-driven actions is the key differentiator in creating value from data and analytics today.
Data analytics concepts include knowledge of tools like Zookeeper, which is a is a Hadoop Admin tool used for managing the jobs in the cluster.
Familiarity with computing clusters is essential. Zookeeper provides coordination service for distributed applications.
Hadoop provides a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models, and is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
MapReduce is a programming paradigm that allows for massive scalability of unstructured data across hundreds or thousands of clusters servers in an Apache Hadoop cluster.
This makes it possible to analyze vast amounts of data in parallel across multiple clusters.

Data Analytics Administration

The primary responsibility of a data analytics administrator involves the configuration, security, and maintenance of the data analytics/business intelligence platform at the enterprise which includes administration of the infrastructure as well as configuring, managing and maintaining the platform.

A data analytics administrator also monitors and ensures that the environment is running, and troubleshoots system-wide issues and outages.
A data analytics administrator maintains the integrity of the environment by implementing standards and security.

Hadoop Administrator

A Hadoop administrator administers and manages Hadoop clusters and all other resources in the entire Hadoop ecosystem.

The role of a Hadoop administrator is mainly associated with tasks that involve installing and monitoring Hadoop clusters.
A Hadoop administrator is responsible for the implementation and ongoing administration of Hadoop infrastructure.
The role requires coordinating with the systems engineering team to propose and deploy new hardware and software environments required for Hadoop and to expand existing environments.

The typical responsibilities of a Hadoop admin include deploying a Hadoop cluster, maintaining a Hadoop cluster, adding and removing nodes using cluster monitoring tools like Ganglia Nagios or Cloudera Manager, configuring the NameNode high availability and keeping a track of all the running Hadoop jobs, implementing, managing and administering the overall Hadoop infrastructure, taking care of the day-to-day running of Hadoop clusters, and working closely with the database team, network team, BI team and application teams to make sure that all the big data applications are highly available and performing as expected.

It requires working with data delivery teams to setup new Hadoop users, setting up Linux, setting up Kerberos principals and testing Hadoop Distributed File System, Apache Hive data warehousing software, Pig and MapReduce access for the new users.
A Hadoop administrator is responsible for capacity planning and estimating the requirements for lowering or increasing the capacity of the Hadoop cluster, responsible for deciding the size of the Hadoop cluster based on the data to be stored in HDFS, resource and security management, and performance tuning of Hadoop clusters and Hadoop MapReduce routines.

Data Ingestion

Data ingestion gathers data and brings it into a data processing system where it can be stored, analyzed, and accessed.

It often refers to integrating data from several different sources into a system for analytics.
A data ingestion pipeline moves streaming data and batched data from pre-existing databases and data warehouses to a data lake.
When data is ingested in real time, each data item is imported as it is emitted by the source.
When data is ingested in batches, data items are imported in discrete chunks at periodic intervals of time.
An effective data ingestion process begins by prioritizing data sources, validating individual files and routing data items to the correct destination.
The process of data ingestion ‒ preparing data for analysis ‒ usually includes ETL.

Data Integration

Data integration is the process of combining data residing at different sources and providing a unified view, that is, the combination of technical and business processes used to combine data from disparate sources into meaningful and valuable information.

Data integration becomes increasingly important in cases of merging systems of two companies or consolidating applications within one company to provide a unified view of the company's data assets.
Data integration is often a prerequisite to other processes including analysis, reporting, and forecasting.
ETL is a type of data integration.

Data Migration

Data migration consists of transferring data stored from a source system to a destination one without affecting operations.

The migration process includes all tasks related to extract from the source system, clean, transform, validate and finally data load in the destination system.
Data migration is typically the most overlooked component of a project.
- It is commonly, but incorrectly, assumed to be simple and small in scale, and those responsible for the task frequently fail to initiate it early enough in the project, leading to missed deadlines and budget overruns.
- Even worse, the new system could begin operation with invalid data from the old system or no data at all.
- How data is collected and stored in the old system and how it is collected and stored in the new system determines whether the data migration process will be straightforward or difficult.
Data that eventually exceeds the required time horizon may be archived, purges, or summarized.

Data Uploading

Uploading data involves transmitting data from one computer system to another, typically to make it available to another computer system due to system upgrade, integration with another system, or migration from another system.

Common methods of uploading include operating system utilities, File Transfer Protocol, or ETL utilities.
Uploading data often requires data conversion to transform data from its original format to the format required by the new system.
How data is collected and stored in the old system and how it is collected and stored in the new system determines the complexity of the data upload process.

ETL

ETL is a type of data integration that refers to the three steps (extract, transform, load) used to blend data from multiple sources into structured, organized datasets, which is also known as a data pipeline, and serves as the foundation of many Business Intelligence solutions.

During this process, data is taken (extracted) from a source system, converted (transformed) into a format that can be analyzed, and stored (loaded) into a data warehouse or other system.
The extract stage determines different data sources, the refresh rate (velocity) of each source, and the priorities (extract order) between them – all of which heavily impact time-to-insights.
- After extracting the data into an ETL environment, transformations bring clarity and order to the initial data swamp.
- For example, dates are consolidated into specified time buckets, strings are parsed to their business meanings, transactions are modeled into events, location data is translated to coordinates, zip codes or cities/countries, measures are summed up, averaged, rounded and useless data and errors are set aside for later inspection.
The ‘transform’ part of ETL may require the development of a specialized ETL script to transform that data, but before the script can be run the data must be analyzed and cleaned to ensure that the ETL can execute without failure.
Controlling the ETL process flow can be accomplished by selecting a platform that automates and provides visibility into the data flow end to end.
- One such tool is a distributed job scheduler, which offers both the reach and control necessary to manage the ETL process, as well as the input, output, and notification functions associated with a complete data flow.

Data Extraction

Data extraction is the process of obtaining data from a source for further data processing, storage, or analysis elsewhere.

Data can be replicated to a destination ‒ such as a data warehouse ‒ designed to support online analytical processing (OLAP).
Virtually all data extraction is performed to archive the data for secure long-term storage, for use within a new context, or in order to prepare it for later stage analysis.
Many organizations extract data related to routine tasks and processes, such as payroll and benefits, to better understand outcomes and improve operational efficiency.
Data extraction is the first step in ETL.

Data Transformation

Data transformation is the process of translating data from one format to another.

The goal of data transformation is to prevent data loss or corruption by maintaining the integrity of the data and embedded structures.

It involves extracting data from the source, transforming it and loading the data to the target system.
Data transformation is a critical step in the process of data integration that enables the data to be read, altered, and executed in an application or database other than that in which it was created.
The transformation may require processing using specialized conversion software, or it may involve a complex process of transitioning through intermediary stages, or involving complex export and import procedures that may convert from one file format to another.

Data Loading

Data loading refers to the "load" component of ETL.

After data is retrieved and combined from multiple sources (extracted), cleaned and formatted (transformed), it is then loaded into a storage system, such as a cloud data warehouse.
ETL aids in the data integration process that standardizes diverse and disparate data types to make it available for querying, manipulation, or reporting for many different individuals and teams.
To schedule data loads, data load definitions are created and then run with an external scheduler designed to automate many tasks.

Data Cleansing

Data cleansing is the process of ensuring that data is correct, consistent and useable by identifying any errors or corruptions in the data, correcting or deleting them, or manually processing them as needed to prevent the error from happening again.

Data cleansing occupies a vital role in the ETL process, helping to ensure that information is consistent, accurate, and high-quality.
After cleansing, a data set should be consistent with other similar data sets in the system.
The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores.
The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities.
- The validation may be strict (such as rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records that partially match existing, known records).
- Some data cleansing solutions will clean data by cross-checking with a validated data set.
- A common data cleansing practice is data enhancement, where data is made more complete by adding related information.
- Cleansing may involve removing irrelevant data, i.e., those that are not actually needed, and don’t fit under the context of the problem we’re trying to solve, as well as duplicated data, or those data points that are repeated in a dataset.
- Cleansing may also require type conversions to ensure that numbers are stored as numerical data types, dates as a date objects or Unix timestamps, etc.
- In addition, syntax errors must be dealt with, including removing extraneous white space at the beginning or end of a string, padding strings with spaces or other characters to a certain with, putting data in a standardized format, and transforming data so that it fits within a specific scale.
Data cleansing is the most important aspect of data quality management.
Data hygiene is a collective set of processes conducted to ensure the cleanliness of data.

Data Validation

Data validation is a method for checking the accuracy and quality of data, typically performed prior to importing and processing.

Data validation ensures that data is complete (no blank or null values), unique (contains distinct values that are not duplicated), and the range of values is consistent with what is expected.
Often, data validation is used as a part of ETL, where data is moved from a source database to a target data warehouse so that it can be integrated with other data for analysis.
Data validation helps ensure that when analysis is performed, the results are accurate.

Data Integrity

Data integrity refers to the maintenance of, and the assurance of the accuracy and consistency of data over its entire life-cycle by protecting data against improper maintenance, modification, or alteration as well as ensuring data authenticity.

Data integrity involves the accuracy of information, including its authenticity and trustworthiness.
There are two types of data integrity – physical integrity and logical integrity – and both are a collection of processes and methods that enforce data integrity in databases and data warehouses.
The overall intent of any data integrity technique is to ensure that data is recorded exactly as intended and upon later retrieval, ensure the data is the same as it was when it was originally recorded.
In short, data integrity aims to prevent unintentional changes to information.

Data Profiling

Data profiling involves examining, analyzing and reviewing data to gather statistics surrounding the quality and hygiene of the dataset.

The first step of data profiling is gathering one or multiple data sources and its metadata for analysis.
The data is then cleaned to unify structure, eliminate duplications, identify interrelationships and find any anomalies.
Once the data is clean, different data profiling tools will return various statistics to describe the dataset.

Data Analytics Deployment

Data analytics deployment ensures that the final solution is ready to be used within the operational environment and that end users have all the required tools to act upon the analytical insights.

The goal is to deliver actionable insights to end users and integrate this intelligence into internal organizational processes.
Typically, deployment of data analytics systems includes all operations to generate reports and recommendations for end users, visualization of key findings, self-service and data discovery functionalities for business users, and implementation of workflows that integrate analytical outputs with custom, operational and core systems.

Data Visualization

Data visualization is the presentation of data in a pictorial or graphical format to help people understand the significance of the data.

Data visualization has become the de facto standard for modern data analytics and business intelligence.
There are many design tools available to extract the data from different data sources and allow it to be analyzed and visualized without using additional tools, and such tools typically allow developers to build insightful visualizations.

Section 3: The Realm of Database

Database Testing

Database testing is the process of validating that the metadata (structure) and data stored in the database meets the requirement and design, and is a critical component of quality assurance.

Database testing is important because it helps identify data quality and application performance issues that might otherwise get detected only after the application has been live for some time.
Database testing require scrutinizing the following areas: testing database objects, including tables, views, triggers, and stored procedures, validating data being entered and stored in the database, ensuring the system is adhering to the constraints and data modifications (insert/delete/update) are reflecting constraints correctly, ensuring the system can execute end to end database transactions and support concurrency, verifying performance of database indices, triggers, and procedures performing security compliance testing to rule out any unauthorized access or any threats to data.
Database testing also involves testing the database schema, tables, transactions, triggers, stored procedures, data validity, data integrity, consistency, and field constraints.
It may involve creating complex queries to load/stress test the database and check its responsiveness.
Often, web software accesses the database from different backends, i.e. they access heterogeneous databases.
Database testing requires tester expertise in checking tables, writing queries, and writing procedures.
Testing can be performed in web applications or desktop, and will involve database management systems like SQL Server or Oracle.
Database systems typically consist of multiple layers: the user interface (UI) layer, the business layer, the data access layer and the database itself, which requires testing at these different layers to ensure a consistent database system.

Database Deployment

Database deployment includes all of the steps, processes, and activities that are required to make a database or update available to its intended users.

System deployment is typically not complete unless the database objects that support the application are also deployed.
- These objects can be deployed by creating scripts that create both the database objects and any necessary data, such as seed data for lookup tables.
- The database objects include tables, views, functions, packages, and others created to implement the application logic.
Database deployment needs a test environment, for the initial deployment, for thorough testing of the application before it is deployed in any other environment, and perhaps also for training of application users.
The production environment contains the actual data and database objects for the normal operation of the organization.
Any objects in the test environment should be tested before moving them into the production environment.
Data must be persisted before, during and after the deployment.
Automated deployment tools provide control over database development by making the deployment process repeatable, reliable and consistent, and they removes/reduces human intervention and increase the speed of response to change.
Database deployment requires advanced knowledge of database concepts since the dependencies between the database objects must be determined and thoroughly understood in advance.
- Database objects must be created in the correct order, so that if any objects depend on other objects, the dependent objects exist in each case.
- If a dependent object is missing, serious errors or problems will occur.

Database Monitoring

Database monitoring involves tracking database performance and resources in order to create and maintain high performance and a highly available application infrastructure.

Database monitoring may observe query details, session details like current user connections and locks, scheduled jobs, replication details, and database performance including buffer, cache, connection, lock, and latch.
Data from each of these categories is collected and analyzed in order to minimize, or ideally prevent, database outages or slowdowns.
The selection of the data points and how they are analyzed will vary based on the type of database.
Database monitoring is a critical part of any application’s maintenance.
Finding database issues in time can help the application remain healthy and accessible.
Database monitoring requires advanced knowledge of database concepts in order to select which data points to monitor and how to analyze and assess their behavior, a critical skill since database monitoring that provides fast, accurate problem resolution is critical to helping IT troubleshoot problems before they affect end-users.

Database Performance Tuning

Database performance tuning encompasses steps to optimize performance with the goal of maximizing the use of system resources for greater efficiency.

Fine-tuning certain database elements such as index use, query structure, data models, system configuration (e.g., hardware and OS settings), and application design can significantly impact the overall performance of applications.
Database tuning is essential to easily organizing and accessing database data.
Database performance tuning may involve database optimization, indexing to tune the database, and taking other steps to make data retrieval easier and query responses timelier and more comprehensive.
It may also involve query optimization, which is the process of determining the most efficient means of executing SQL statements.
Database tuning requires a significant amount of expertise in order to develop execution plans and re-write improved SQL.

Section 4: Closing

The most enchanting aspect of a career in Information Systems or Computer Science is the opportunity to continue learning throughout your life, as no field changes more quickly or inexorably.

The most daunting aspect of a career in Information Systems or Computer Science is the responsibility and the requirement to learn and adapt as these changes take place.