The Data Revolution
is in progress...
History
Not so long ago, most businesses ran on mainframe
computers. These computers were expensive to purchase and were typically
stored in corporate headquarters.
Internal staff had access to applications via a mainframe
terminal. Data was typically stored in VSAMfiles. Individual fields
were determined by the character position in the line of data.
COBOL programmer wrote code to pull the data by indicating the start and
stop positions for the fields requested. Reports were then sent to
a printer off hours through batch jobs. The Business Managers sifted
through reams of paper to find information. It was a slow, tedious
process that required skilled programmers with domain knowledge. The
information was not shared freely for the most part.
Relational Databases
The Relational Database was introduced to store volumes of data
by connecting related tables through common keys, by connecting related tables
through Primary and Foreign Key relationships.
This allowed for 4th generation programming languages
and reporting applications to query the database in a language called SQL.
The resulting data appeared on the screen in WYSIWYG (What
you see is what you get) format, which could be exported to a spreadsheet, PDF,
printer or emailed to another user. This removed the need for specialized
programmers between the business users and the data.
Data Warehousing
Traditional reports pulled data from the live systems, locking
the data and causing performance issues to the underlying applications. The Data
Warehouse was introduced to solve this issue by implementing a
standard methodology for storing data.
A developer created a model, either a Star
Schema or a Snowflake
Schema, of the data through the use of Fact and Dimension
tables. Values were stored in the Fact tables
such as Sums, Averages, Min, Max, etc. and Dimension tables
contained descriptive adjectives such as Customer, Location, Time or Product.
So you could quickly manipulate the data to determine how many
sales occurred in a specific time frame in specific region by a particular
sales person at a specific store based on a specific product. This process
pulled data from the source system, loaded it into a Staging Database, and finally moved it to the Data
Warehouse. As the data flowed through each phase of the Extract, Transform and Load (ETL)
process, the developer applies business logic, such as creating a “CustomerName”
which concatenates (Lastname + “, “+ FirstName) to comply with corporate
standards and easier data manipulation.
Although the business now had a Single Version of the Truth, the cost of
building and maintaining the DW were high. This resulted in storing
limited sets of data. Finding and retaining qualified programmers to
build and maintain it was a challenge, and adding new data sources was not easy
either.
Self Service Reporting
As technology matured, each company formed its own Information
Technology (IT) Department. But they still were not able to satisfy the demands
of the business. They were not receiving accurate data in a timely
manner. As a result, they enlisted internal staff or hired
consultants to build reports in silos, without letting the IT department know
about it. By piecing together bits of data from different locations, many
of the report writers did not follow best practices or adhere to company policies regarding
the storage and methods used to access data. Vendors soon saw the demand
and created applications to allow business users to pull their own reports
without having programming skills –– this is
known as self-service reporting. Now any department with a corporate
credit card could access the company's data without assistance from the IT
department.
Big Data
Business users now had the ability to run reports in real-time
against Data Warehouses, Traditional Relational Databases, and Self
Service. However, they could only access data stored in database
format. There were still mountains of idle data scattered throughout each
organization. These organizations were unable to utilize these datasets,
primarily because it was "unstructured" or "semi-structured"
data.
Semi-structured data does not comply with
standard relational database formats, yet it has a certain degree of
predictability. Likewise, unstructured
data, such as email archives, have no predefined data model
format. However, we can add structure to data by utilizing a new
product called Apache Hadoop.
Hadoop is
an open-source software framework written in Java for distributed storage and
distributed processing of very large data sets on computer clusters built from
commodity hardware.
Originally, developers wrote an intricate Java code called Map
Reduce to parse the data in the Hadoop File System (HDFS) to conform to relational database
formats.
Hadoop has evolved over time, allowing access through a SQL
query language called Apache HiveSQL. However, when it has some
speed limitations when it gets translated to Map Reduce code. Another
language called Apache
Pig allows developers to manipulate the data to perform
calculations and aggregations and format data for reporting.
Over time, Hadoop has transformed into a suite of
mini-applications and is no longer dependent on skilled Java developers. Hadoop
is licensed as Open Software and available for download for
free.
There are many advantages to Hadoop. It allows developers
to query large data sets, read structured and unstructured data, and mash
together different types of data. More recently, a new product
called Apache
Spark allows similar data manipulation and can run within
the Hadoop ecosystem or stand alone.
Analytics
The industry soon realized that although data was
once "nice to have," it rose in status to "have to
have" because it became apparent that data could be used to drive business decisions.
Increase sales. Reduce costs. Streamline
processes. Find patterns in the data by converting it to
information. And then analyze the insights, and take action on the new
information. With the rise of open data sets, social media, reduced costs
in software and hardware, and availability in programmers, organizations are
leveraging this new technology to gain competitive advantage.
Data Scientist
The new field of Data Scientist, was labeled as the sexiest job of the 21st Century.
This is probably because a Data Scientist combines the skill sets of
Programmer, Statistician, and Business Analyst. The intersection of these
three fields can prepare data, apply algorithms, and translate insights into a
common language for consumption. Data Scientists understand the data,
business, and statistics and can crunch the data using traditional
relational database or through unstructured and semi-structured data sets.
Some of the algorithms are used to identify patterns and predict future
behavior. This is a highly sought-after position in almost
any industry.
Artificial Intelligence
Artificial Intelligence first came about
in the 1950s when people saw the need for computers to think. Although
they lacked the processing power at the time, they laid the foundation for
future work. As price of hardware and software decreased over time,
advancements in pattern recognition, data mining, predicative
analytics started to gain pace. One of the main theoretical concepts used in
these areas is Artificial Neural Networks. They are
basically a series of connected nodes, weighted by probability, which
are activated according to certain criteria, which then activate
other nodes further downstream. By training a neural net, it
can learn over time and remember things and events, as well
as perform simulations projected into the future in order to better
predict what different outcomes will look like. These algorithms are
growing in both the public and private space because they can automate
many repetitive processes. Many organizations are investing in AI to streamline
processes and reduce overhead costs.
Summary
Each business or organization is now in the software business.
This is because every company runs software, which accumulates data
that can be harnessed and mashed with other data to provide useful
insights.
This has created an explosion
in data in the current Data Revolution. Data
Scientists are able to extract knowledge from large volumes of
data, from both structured and unstructured data sets. Companies are now
able to extract personal information from customers in marketing campaigns, via
sentiment analysis and algorithms that predict customer behavior. Being a
data-driven enterprise is becoming the standard.
About the
author
Jonathan Bloom has been in the Data Science
space since 1995. Based in Safety Harbor, Florida, he specializes in the
Financial Services, Education, Hi-Tech, and Insurance industries.
No comments:
Post a Comment