Wednesday, July 31, 2019

Learn Data Science

                                                      Learn Data Science

A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets...

About the Author:
Zain is a consulting manager for decision science, data science, data engineering, machine learning, robotics, artificial intelligence, computational analytics and business intelligence.


Acknowledgments:
To Denise: I am fortunate enough to have created a way of life I love . . . But you have given me the courage and determination to live it! Thanks for the time and patience to complete the book and numerous other mad projects. To Laurence: Thank you for all the knowledge shared on accounting and finance. To Chris: thank you. Your wisdom and insight made this great! Best of luck with your future. To the staff at Apress: your skills transformed an idea into a book. Well done!


Introduction:
People are talking about data lakes daily now. I consult on a regular basis with organizations on how to develop their data lake and data science strategy to serve their evolving and ever-changing business strategies. This requirement for agile and cost-effective information management is high on the priority list of senior managers worldwide. It is a fact that many of the unknown insights are captured and stored in a massive pool of unprocessed data in the enterprise. These data lakes have major implications for the future of the business world. It is projected that combined data scientists worldwide will have to handle 40 zettabytes of data by 2020, an increase of 300 times since 2005. There are numerous data sources that still must be converted into actionable business knowledge. This achievement will safeguard the future of the business that can achieve it. The world’s data producers are generating two-and-a-half quintillion bytes of new data every day. The addition of internet of things will cause this volume to be substantially higher. Data scientists and engineers are falling behind on an immense responsibility. By reading this introduction, you are already an innovative person who wants to understand this advanced data structure that one and all now desire to tame. To tame your data lake, you will require practical data science. I propose to teach you how to tame this beast. I am familiar with the skills it takes to achieve this goal. I will guide you with the sole purpose of you learning and expanding while mastering the practical guidance in this blog.  You will understand what is in your business’s data lake and how to apply data science to it. Think of the process as comparable to a natural lake. It is vital to establish a sequence of proficient techniques with the lake, to obtain pure water in your glass. Do not stress, as by the end of this blog, you will have shared in more than 9 years of working experience with data and extracting actionable business knowledge. I will share with you the experience I gained in working with data on an international scale.

Data Science:
In 1960, Peter Naur started using the term data science as a substitute for computer science. He stated that to work with data, you require more than just computer science. I agree with his declaration. Data science is an interdisciplinary science that incorporates practices and methods with actionable knowledge and insights from data in heterogeneous schemas (structured, semi-structured, or unstructured). It amalgamates the scientific fields of data exploration with thought-provoking research fields such as data engineering, information science, computer science, statistics, artificial intelligence, machine learning, data mining, and predictive analytics. For my part, as I enthusiastically research the future use of data science, by translating multiple data lakes, I have gained several valuable insights. I will explain these with end-to-end examples and share my insights on data lakes. This book explains vital elements from these sciences that you will use to process your data lake into actionable knowledge. I will guide you through a series of recognized science procedures for data lakes. These core skills are a key set of assets to perfect as you begin your encounters with data science.

Data Analytics:

Data analytics is the science of fact-finding analysis of raw data, with the goal of drawing conclusions from the data lake. Data analytics is driven by certified algorithms to statistically define associations between data that produce insights.

The best answer is to point to a certified and recognized algorithm that you have used. Associate the algorithm to your business terminology to achieve success with your projects.


Machine Learning: 

The business world is buzzing with activities and ideas about machine learning and its application to numerous business environments. Machine learning is the capability of systems to learn without explicit software development. It evolved from the study of pattern recognition and computational learning theory. The impact is that, with the appropriate processing and skills, you can augment your own data capabilities. Training enables a processing environment to complete several magnitudes of discoveries in the time it takes to have a cup of coffee.

Note :Work smarter, not harder! Offload your data science to machines. They are faster and more consistent in processing your data lakes.

This skill is an essential part of achieving major gains in shortening the data-toknowledge cycle. This blog will cover the essential practical ground rules in later.

Data Mining:

Data mining is processing data to isolate patterns and establish relationships between data entities within the data lake. For data mining to be successful, there is a small number of critical data-mining theories that you must know about data patterns. In later chapters, I will expand on how you can mine your data for insights. This will help you to discover new actionable knowledge.

Statistics:

Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. Statistics deals with all aspects of data, including the planning of data collection, in terms of the design of surveys and experiments. Data science and statistics are closely related. I will show you how to run through series of statistics models covering data collection, population, and samples to enhance your data science deliveries.

Algorithms:

An algorithm is a self-contained step-by-step set of processes to achieve a specific outcome. Algorithms execute calculations, data processing, or automated reasoning tasks with repeatable outcomes. Algorithms are the backbone of the data science process. You should assemble a series of methods and procedures that will ease the complexity and processing of your specific data lake. I will discuss numerous algorithms and good practices for performing practical data science throughout the blog.

Data Visualization:

Data visualization is your key communication channel with the business. It consists of the creation and study of the visual representation of business insights. Data science’s principal deliverable is visualization. You will have to take your highly technical results and transform them into a format that you can show to non-specialists.

Storytelling:

 Data storytelling is the process of translating data analyses into layperson’s terms, in order to influence a business decision or action. You can have the finest data science, but without the business story to translate your findings into business-relevant actions, you will not succeed. I will provide details and practical insights into what to check for to ensure that you have the proper story and actions.

What Next?

 I will demonstrate, using the core knowledge of the underlining science, how you can make a competent start to handle the transformation process of your data lake into actionable knowledge. The sole requirement is to understand the data science of your own data lake. Start rapidly to discover what data science reveals about your business. You are the master of your own data lake. You will have to build familiarity with the data lake and what is flowing into the structure. My advice is to apply the data science on smaller scale activities, for insights from the data lake.

CHAPTER 1:

Data Science Technology Stack

 The Data Science Technology Stack covers the data processing requirements in the Rapid Information Factory ecosystem. Throughout the book, I will discuss the stack as the guiding pattern. In this chapter, I will help you to recognize the basics of data science tools and their influence on modern data lake development. You will discover the techniques for transforming a data vault into a data warehouse bus matrix. I will explain the use of Spark, Mesos, Akka, Cassandra, and Kafka, to tame your data science requirements. I will guide you in the use of elastic search and MQTT (MQ Telemetry Transport), to enhance your data science solutions. I will help you to recognize the influence of R as a creative visualization solution. I will also introduce the impact and influence on the data science ecosystem of such programming languages as R, Python, and Scala.



Rapid Information Factory Ecosystem:

 The Rapid Information Factory ecosystem is a convention of techniques I use for my individual processing developments. The processing route of the blog will be formulated on this basis, but you are not bound to use it exclusively. The tools I discuss in this chapter are available to you without constraint. The tools can be used in any configuration or permutation that is suitable to your specific ecosystem. I recommend that you begin to formulate an ecosystem of your own or simply adopt mine. As a prerequisite, you must become accustomed to a set of tools you know well and can deploy proficiently.

Note Remember: Your data lake will have its own properties and features, so adopt your tools to those particular characteristics.

Data Science Storage Tools:

 This data science ecosystem has a series of tools that you use to build your solutions. This environment is undergoing a rapid advancement in capabilities, and new developments are occurring every day. I will explain the tools I use in my daily work to perform practical data science. Next, I will discuss the following basic data methodologies.

Schema-on-Write and Schema-on-Read:

There are two basic methodologies that are supported by the data processing tools. Following is a brief outline of each methodology and its advantages and drawbacks.


Schema-on-Write Ecosystems:

 A traditional relational database management system (RDBMS) requires a schema before you can load the data. To retrieve data from my structured data schemas, you may have been running standard SQL queries for a number of years. Benefits include the following: • In traditional data ecosystems, tools assume schemas and can only work once the schema is described, so there is only one view on the data. • The approach is extremely valuable in articulating relationships between data points, so there are already relationships configured. • It is an efficient way to store “dense” data. • All the data is in the same data store.













Learn Data Science

                                                      Learn Data Science A Guide to Building the Technology Stack for Turning Data Lakes i...