Learn Data Science
A Guide to Building the Technology
Stack for Turning Data Lakes into
Business Assets...
About the Author:
Zain is a consulting manager for
decision science, data science, data engineering, machine
learning, robotics, artificial intelligence, computational
analytics and business intelligence.
Acknowledgments:
To Denise: I am fortunate enough to have created a way of life I love . . . But you have
given me the courage and determination to live it! Thanks for the time and patience to
complete the book and numerous other mad projects.
To Laurence: Thank you for all the knowledge shared on accounting and finance.
To Chris: thank you. Your wisdom and insight made this great! Best of luck with your future.
To the staff at Apress: your skills transformed an idea into a book. Well done!
Introduction:
People are talking about data lakes daily now. I consult on a regular basis with
organizations on how to develop their data lake and data science strategy to serve
their evolving and ever-changing business strategies. This requirement for agile and
cost-effective information management is high on the priority list of senior managers
worldwide.
It is a fact that many of the unknown insights are captured and stored in a massive
pool of unprocessed data in the enterprise. These data lakes have major implications for
the future of the business world. It is projected that combined data scientists worldwide
will have to handle 40 zettabytes of data by 2020, an increase of 300 times since 2005.
There are numerous data sources that still must be converted into actionable business
knowledge. This achievement will safeguard the future of the business that can achieve it.
The world’s data producers are generating two-and-a-half quintillion bytes of
new data every day. The addition of internet of things will cause this volume to be
substantially higher. Data scientists and engineers are falling behind on an immense
responsibility.
By reading this introduction, you are already an innovative person who wants to
understand this advanced data structure that one and all now desire to tame.
To tame your data lake, you will require practical data science.
I propose to teach you how to tame this beast. I am familiar with the skills it takes to
achieve this goal. I will guide you with the sole purpose of you learning and expanding
while mastering the practical guidance in this blog. You will understand what is in your business’s data lake and how to apply data
science to it.
Think of the process as comparable to a natural lake. It is vital to establish a sequence
of proficient techniques with the lake, to obtain pure water in your glass.
Do not stress, as by the end of this blog, you will have shared in more than 9 years
of working experience with data and extracting actionable business knowledge. I will
share with you the experience I gained in working with data on an international scale.
Data Science:
In 1960, Peter Naur started using the term data science as a substitute for computer
science. He stated that to work with data, you require more than just computer science. I
agree with his declaration.
Data science is an interdisciplinary science that incorporates practices and
methods with actionable knowledge and insights from data in heterogeneous schemas
(structured, semi-structured, or unstructured). It amalgamates the scientific fields
of data exploration with thought-provoking research fields such as data engineering,
information science, computer science, statistics, artificial intelligence, machine
learning, data mining, and predictive analytics.
For my part, as I enthusiastically research the future use of data science, by
translating multiple data lakes, I have gained several valuable insights. I will explain
these with end-to-end examples and share my insights on data lakes. This book explains
vital elements from these sciences that you will use to process your data lake into
actionable knowledge. I will guide you through a series of recognized science procedures
for data lakes. These core skills are a key set of assets to perfect as you begin your
encounters with data science.
Data Analytics:
Data analytics is the science of fact-finding analysis of raw data, with the goal of drawing
conclusions from the data lake. Data analytics is driven by certified algorithms to
statistically define associations between data that produce insights.
The best answer is to point to a certified and recognized algorithm that you have
used. Associate the algorithm to your business terminology to achieve success with your
projects.
Machine Learning:
The business world is buzzing with activities and ideas about machine learning and its
application to numerous business environments. Machine learning is the capability
of systems to learn without explicit software development. It evolved from the study of
pattern recognition and computational learning theory.
The impact is that, with the appropriate processing and skills, you can augment your
own data capabilities. Training enables a processing environment to complete several
magnitudes of discoveries in the time it takes to have a cup of coffee.
Note :Work smarter, not harder! Offload your data science to machines. They are
faster and more consistent in processing your data lakes.
This skill is an essential part of achieving major gains in shortening the data-toknowledge cycle. This blog will cover the essential practical ground rules in later.
Data Mining:
Data mining is processing data to isolate patterns and establish relationships between
data entities within the data lake. For data mining to be successful, there is a small
number of critical data-mining theories that you must know about data patterns.
In later chapters, I will expand on how you can mine your data for insights. This will
help you to discover new actionable knowledge.
Statistics:
Statistics is the study of the collection, analysis, interpretation, presentation, and
organization of data. Statistics deals with all aspects of data, including the planning of
data collection, in terms of the design of surveys and experiments.
Data science and statistics are closely related. I will show you how to run through
series of statistics models covering data collection, population, and samples to enhance
your data science deliveries.
Algorithms:
An algorithm is a self-contained step-by-step set of processes to achieve a specific
outcome. Algorithms execute calculations, data processing, or automated reasoning
tasks with repeatable outcomes.
Algorithms are the backbone of the data science process. You should assemble a
series of methods and procedures that will ease the complexity and processing of your
specific data lake.
I will discuss numerous algorithms and good practices for performing practical data
science throughout the blog.
Data Visualization:
Data visualization is your key communication channel with the business. It consists of
the creation and study of the visual representation of business insights. Data science’s
principal deliverable is visualization. You will have to take your highly technical results
and transform them into a format that you can show to non-specialists.
Storytelling:
Data storytelling is the process of translating data analyses into layperson’s terms, in
order to influence a business decision or action. You can have the finest data science, but
without the business story to translate your findings into business-relevant actions, you
will not succeed.
I will provide details and practical insights into what to check for to ensure that you
have the proper story and actions.
What Next?
I will demonstrate, using the core knowledge of the underlining science, how you can
make a competent start to handle the transformation process of your data lake into
actionable knowledge. The sole requirement is to understand the data science of your
own data lake. Start rapidly to discover what data science reveals about your business.
You are the master of your own data lake.
You will have to build familiarity with the data lake and what is flowing into the
structure. My advice is to apply the data science on smaller scale activities, for insights
from the data lake.
CHAPTER 1:
Data Science Technology
Stack
The Data Science Technology Stack covers the data processing requirements in the
Rapid Information Factory ecosystem. Throughout the book, I will discuss the stack as
the guiding pattern.
In this chapter, I will help you to recognize the basics of data science tools and
their influence on modern data lake development. You will discover the techniques
for transforming a data vault into a data warehouse bus matrix. I will explain the use of
Spark, Mesos, Akka, Cassandra, and Kafka, to tame your data science requirements.
I will guide you in the use of elastic search and MQTT (MQ Telemetry Transport), to
enhance your data science solutions. I will help you to recognize the influence of R as a
creative visualization solution. I will also introduce the impact and influence on the data
science ecosystem of such programming languages as R, Python, and Scala.
Rapid Information Factory Ecosystem:
The Rapid Information Factory ecosystem is a convention of techniques I use for
my individual processing developments. The processing route of the blog will be
formulated on this basis, but you are not bound to use it exclusively. The tools I discuss
in this chapter are available to you without constraint. The tools can be used in any
configuration or permutation that is suitable to your specific ecosystem.
I recommend that you begin to formulate an ecosystem of your own or simply adopt
mine. As a prerequisite, you must become accustomed to a set of tools you know well
and can deploy proficiently.
Note Remember: Your data lake will have its own properties and features, so
adopt your tools to those particular characteristics.
Data Science Storage Tools:
This data science ecosystem has a series of tools that you use to build your solutions.
This environment is undergoing a rapid advancement in capabilities, and new
developments are occurring every day.
I will explain the tools I use in my daily work to perform practical data science. Next,
I will discuss the following basic data methodologies.
Schema-on-Write and Schema-on-Read:
There are two basic methodologies that are supported by the data processing tools.
Following is a brief outline of each methodology and its advantages and drawbacks.
Schema-on-Write Ecosystems:
A traditional relational database management system (RDBMS) requires a schema
before you can load the data. To retrieve data from my structured data schemas, you may
have been running standard SQL queries for a number of years.
Benefits include the following:
• In traditional data ecosystems, tools assume schemas and can only
work once the schema is described, so there is only one view on the
data.
• The approach is extremely valuable in articulating relationships
between data points, so there are already relationships configured.
• It is an efficient way to store “dense” data.
• All the data is in the same data store.
No comments:
Post a Comment