buesiness analysis statistic more than 3000 words 3 days

Data are not useful unless they are converted into insight information that is clearly articulated in written or verbal language. An important aspect of business analysis is to communicate with numbers rather than focus on number crunching. Finland is the happiest country in the world according to the 2022 Happiness Index Report by the United Nations (https://worldhappiness.report). Visit the Happiness Index website, explore, and write a report based on the data provided on the website. The paper must address the following issues: 1. What are the data used and how the data have been collected? 2. How the data have been processed to derive the Happiness Index? 3. What are the limitations of the entire process? 4. What can you learn from this website? Format • Introduction: in brief discussion on the Happiness Index. • Body: Discuss the issues from above. • Conclusion: Conclude the theme of the findings from the entire paper. APA style paper and referencing Word Limit: 3,000 words (± 10%), equivalent to 10 pages excluding cover, executive summary, contents, and appendix. Tips: • Focus on Chapters 1 and 2 for understanding the Happiness Index. • Focus on two main data sets, i.e., life evaluation date and life factors. International Series in
Operations Research & Management Science
Bhimasankaram Pochiraju
Sridhar Seshadri Editors
Essentials
of Business
Analytics
An Introduction to the Methodology
and its Applications
International Series in Operations Research
& Management Science
Volume 264
Series Editor
Camille C. Price
Stephen F. Austin State University, TX, USA
Associate Series Editor
Joe Zhu
Worcester Polytechnic Institute, MA, USA
Founding Series Editor
Frederick S. Hillier, Stanford University, CA, USA
More information about this series at http://www.springer.com/series/6161
Bhimasankaram Pochiraju • Sridhar Seshadri
Editors
Essentials of Business
Analytics
An Introduction to the Methodology
and its Applications
123
Editors
Bhimasankaram Pochiraju
Applied Statistics and Computing Lab
Indian School of Business
Hyderabad, Telangana, India
Sridhar Seshadri
Gies College of Business
University of Illinois at Urbana Champaign
Champaign, IL, USA
ISSN 0884-8289
ISSN 2214-7934 (electronic)
International Series in Operations Research & Management Science
ISBN 978-3-319-68836-7
ISBN 978-3-319-68837-4 (eBook)
https://doi.org/10.1007/978-3-319-68837-4
© Springer Nature Switzerland AG 2019
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Professor Bhimasankaram: With the divine
blessings of Bhagawan Sri Sri Sri Satya Sai
Baba, I dedicate this book to my parents—Sri
Pochiraju Rama Rao and Smt. Venkata
Ratnamma.
Sridhar Seshadri: I dedicate this book to
the memory of my parents, Smt. Ranganayaki
and Sri Desikachari Seshadri, my
father-in-law, Sri Kalyana Srinivasan
Ayodhyanath, and my dear friend,
collaborator and advisor, Professor
Bhimasankaram.
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sridhar Seshadri
1
Part I Tools
2
Data Collection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sudhir Voleti
19
3
Data Management—Relational Database Systems (RDBMS) . . . . . . . . .
Hemanth Kumar Dasararaju and Peeyush Taori
41
4
Big Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Peeyush Taori and Hemanth Kumar Dasararaju
71
5
Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
John F. Tripp
6
Statistical Methods: Basic Inferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Vishnuprasad Nagadevara
7
Statistical Methods: Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Bhimasankaram Pochiraju and Hema Sri Sai Kollipara
8
Advanced Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Vishnuprasad Nagadevara
9
Text Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Sudhir Voleti
Part II Modeling Methods
10
Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Sumit Kunnumkal
11
Introduction to Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
Milind G. Sohoni
vii
viii
Contents
12
Forecasting Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
Konstantinos I. Nikolopoulos and Dimitrios D. Thomakos
13
Count Data Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
Thriyambakam Krishnan
14
Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
Thriyambakam Krishnan
15
Machine Learning (Unsupervised) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
Shailesh Kumar
16
Machine Learning (Supervised) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
Shailesh Kumar
17
Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
Manish Gupta
Part III Applications
18
Retail Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
Ramandeep S. Randhawa
19
Marketing Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
S. Arunachalam and Amalesh Sharma
20
Financial Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659
Krishnamurthy Vaidyanathan
21
Social Media and Web Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719
Vishnuprasad Nagadevara
22
Healthcare Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765
Maqbool (Mac) Dada and Chester Chambers
23
Pricing Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793
Kalyan Talluri and Sridhar Seshadri
24
Supply Chain Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823
Yao Zhao
25
Case Study: Ideal Insurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847
Deepak Agrawal and Soumithri Mamidipudi
26
Case Study: AAA Airline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 863
Deepak Agrawal, Hema Sri Sai Kollipara, and Soumithri Mamidipudi
27
Case Study: InfoMedia Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873
Deepak Agrawal, Soumithri Mamidipudi, and Sriram Padmanabhan
28
Introduction to R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889
Peeyush Taori and Hemanth Kumar Dasararaju
Contents
ix
29
Introduction to Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 917
Peeyush Taori and Hemanth Kumar Dasararaju
30
Probability and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945
Peeyush Taori, Soumithri Mamidipudi, and Deepak Agrawal
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965
Disclaimer
This book contains information obtained from authentic and highly regarded
sources. Reasonable efforts have been made to publish reliable data and information,
but the author and publisher cannot assume responsibility for the validity of
all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been
obtained. If any copyright material has not been acknowledged please write and let
us know so we may rectify in any future reprint.
xi
Acknowledgements
This book is the outcome of a truly collaborative effort amongst many people who
have contributed in different ways. We are deeply thankful to all the contributing
authors for their ideas and support. The book belongs to them. This book would not
have been possible without the help of Deepak Agrawal. Deepak helped in every
way, from editorial work, solution support, programming help, to coordination with
authors and researchers, and many more things. Soumithri Mamidipudi provided
editorial support, helped with writing summaries of every chapter, and proof-edited
the probability and statistics appendix and cases. Padmavati Sridhar provided editorial support for many chapters. Two associate alumni—Ramakrishna Vempati and
Suryanarayana Ambatipudi—of the Certificate Programme in Business Analytics
(CBA) at Indian School of Business (ISB) helped with locating contemporary
examples and references. They suggested examples for the Retail Analytics and
Supply Chain Analytics chapters. Ramakrishna also contributed to the draft of the
Big Data chapter. Several researchers in the Advanced Statistics and Computing
Lab (ASC Lab) at ISB helped in many ways. Hema Sri Sai Kollipara provided
support for the cases, exercises, and technical and statistics support for various
chapters. Aditya Taori helped with examples for the machine learning chapters
and exercises. Saurabh Jugalkishor contributed examples for the machine learning
chapters. The ASC Lab’s researchers and Hemanth Kumar provided technical
support in preparing solutions for various examples referred in the chapters. Ashish
Khandelwal, Fellow Program student at ISB, helped with the chapter on Linear
Regression. Dr. Kumar Eswaran and Joy Mustafi provided additional thoughts for
the Unsupervised Learning chapter. The editorial team comprising Faith Su, Mathew
Amboy and series editor Camille Price gave immense support during the book
proposal stage, guidance during editing, production, etc. The ASC Lab provided
the research support for this project.
We thank our families for the constant support during the 2-year long project.
We thank each and every person associated with us during the beautiful journey of
writing this book.
xiii
Contributors
Deepak Agrawal Indian School of Business, Hyderabad, Telangana, India
S. Arunachalam Indian School of Business, Hyderabad, Telangana, India
Chester Chambers Carey Business School, Johns Hopkins University, Baltimore,
MD, USA
Maqbool (Mac) Dada Carey Business School, Johns Hopkins University, Baltimore, MD, USA
Manish Gupta Microsoft Corporation, Hyderabad, India
Hema Sri Sai Kollipara Indian School of Business, Hyderabad, Telangana, India
Thriyambakam Krishnan Chennai Mathematical Institute, Chennai, India
Shailesh Kumar Reliance Jio, Navi Mumbai, Maharashtra, India
Hemanth Kumar Dasararaju Indian School of Business, Hyderabad, Telangana,
India
Sumit Kunnumkal Indian School of Business, Hyderabad, Telangana, India
Soumithri Mamidipudi Indian School of Business, Hyderabad, Telangana, India
Vishnuprasad Nagadevara IIM-Bangalore, Bengaluru, Karnataka, India
Konstantinos I. Nikolopoulos Bangor Business School, Bangor, Gwynedd, UK
Sriram Padmanabhan New York, NY, USA
Bhimasankaram Pochiraju Applied Statistics and Computing Lab, Indian School
of Business, Hyderabad, Telangana, India
Ramandeep S. Randhawa Marshall School of Business, University of Southern
California, Los Angeles, CA, USA
Sridhar Seshadri Gies College of Business, University of Illinois at Urbana
Champaign, Champaign, IL, USA
xv
xvi
Contributors
Amalesh Sharma Texas A&M University, College Station, TX, USA
Milind G. Sohoni Indian School of Business, Hyderabad, Telangana, India
Kalyan Talluri Imperial College Business School, South Kensington, London, UK
Peeyush Taori London Business School, London, UK
Dimitrios D. Thomakos University of Peloponnese, Tripoli, Greece
John F. Tripp Clemson University, Clemson, SC, USA
Krishnamurthy Vaidyanathan Indian
Telangana, India
School
of
Business,
Hyderabad,
Sudhir Voleti Indian School of Business, Hyderabad, Telangana, India
Yao Zhao Rutgers University, Newark, NJ, USA
Chapter 1
Introduction
Sridhar Seshadri
Business analytics is the science of posing and answering data questions related to
business. Business analytics has rapidly expanded in the last few years to include
tools drawn from statistics, data management, data visualization, and machine learning. There is increasing emphasis on big data handling to assimilate the advances
made in data sciences. As is often the case with applied methodologies, business
analytics has to be soundly grounded in applications in various disciplines and
business verticals to be valuable. The bridge between the tools and the applications
are the modeling methods used by managers and researchers in disciplines such as
finance, marketing, and operations. This book provides coverage of all three aspects:
tools, modeling methods, and applications.
The purpose of the book is threefold: to fill the void in the graduate-level study
materials for addressing business problems in order to pose data questions, obtain
optimal business solutions via analytics theory, and ground the solution in practice.
In order to make the material self-contained, we have endeavored to provide ample
use of cases and data sets for practice and testing of tools. Each chapter comes
with data, examples, and exercises showing students what questions to ask, how to
apply the techniques using open source software, and how to interpret the results. In
our approach, simple examples are followed with medium to large applications and
solutions. The book can also serve as a self-study guide to professionals who wish
to enhance their knowledge about the field.
The distinctive features of the book are as follows:
• The chapters are written by experts from universities and industry.
• The major software used are R, Python, MS Excel, and MYSQL. These are all
topical and widely used in the industry.
S. Seshadri ()
Gies College of Business, University of Illinois at Urbana Champaign, Champaign, IL, USA
e-mail: sridhar@illinois.edu
© Springer Nature Switzerland AG 2019
B. Pochiraju, S. Seshadri (eds.), Essentials of Business Analytics, International
Series in Operations Research & Management Science 264,
https://doi.org/10.1007/978-3-319-68837-4_1
1
2
S. Seshadri
• Extreme care has been taken to ensure continuity from one chapter to the next.
The editors have attempted to make sure that the content and flow are similar in
every chapter.
• In Part A of the book, the tools and modeling methodology are developed in
detail. Then this methodology is applied to solve business problems in various
verticals in Part B. Part C contains larger case studies.
• The Appendices cover required material on Probability theory, R, and Python, as
these serve as prerequisites for the main text.
The structure of each chapter is as follows:
• Each chapter has a business orientation. It starts with business problems, which
are transformed into technological problems. Methodology is developed to solve
the technological problems. Data analysis is done using suitable software and the
output and results are clearly explained at each stage of development. Finally, the
technological solution is transformed back to a business solution. The chapters
conclude with suggestions for further reading and a list of references.
• Exercises (with real data sets when applicable) are at the end of each chapter and
on the Web to test and enhance the understanding of the concepts and application.
• Caselets are used to illustrate the concepts in several chapters.
1 Detailed Description of Chapters
Data Collection: This chapter introduces the concepts of data collection and
problem formulation. Firstly, it establishes the foundation upon which the fields
of data sciences and analytics are based, and defines core concepts that will be used
throughout the rest of the book. The chapter starts by discussing the types of data
that can be gathered, and the common pitfalls that can occur when data analytics
does not take into account the nature of the data being used. It distinguishes between
primary and secondary data sources using examples, and provides a detailed
explanation of the advantages and constraints of each type of data. Following this,
the chapter details the types of data that can be collected and sorted. It discusses the
difference between nominal-, ordinal-, interval-, and ratio-based data and the ways
in which they can be used to obtain insights into the subject being studied.
The chapter then discusses problem formulation and its importance. It explains
how and why formulating a problem will impact the data that is gathered, and
thus affect the conclusions at which a research project may arrive. It describes
a framework by which a messy real-world situation can be clarified so that a
mathematical toolkit can be used to identify solutions. The chapter explains the
idea of decision-problems, which can be used to understand the real world, and
research-objectives, which can be used to analyze decision-problems.
1 Introduction
3
The chapter also details the challenges faced when collecting and collating data.
It discusses the importance of understanding what data to collect, how to collect it,
how to assess its quality, and finally the most appropriate way of collating it so that
it does not lose its value.
The chapter ends with an illustrative example of how the retailing industry might
use various sources of data in order to better serve their customers and understand
their preferences.
Data Management—Relational Database Management Systems: This chapter
introduces the idea of data management and storage. The focus of the chapter
is on relational database management systems or RDBMS. RDBMS is the most
commonly used data organization system in enterprises. The chapter introduces and
explains the ideas using MySQL, an open-source structural query language used by
many of the largest data management systems in the world.
The chapter describes the basic functions of a MySQL server, such as creating
databases, examining data tables, and performing functions and various operations
on data sets. The first set of instructions the chapter discusses is about the rules,
definition, and creation of relational databases. Then, the chapter describes how to
create tables and add data to them using MySQL server commands. It explains how
to examine the data present in the tables using the SELECT command.
Data Management—Big Data: This chapter builds on some of the concepts
introduced in the previous chapter but focuses on big data tools. It describes what
really constitutes big data and focuses on some of the big data tools. In this chapter,
the basics of big data tools such as Hadoop, Spark, and surrounding ecosystem are
presented.
The chapter begins by describing Hadoop’s uses and key features, as well as the
programs in its ecosystem that can also be used in conjunction with it. It also briefly
visits the concepts of distributed and parallel computing and big data cloud.
The chapter describes the architecture of the Hadoop runtime environment. It
starts by describing the cluster, which is the set of host machines, or nodes for
facilitating data access. It then moves on to the YARN infrastructure, which is
responsible for providing computational resources to the application. It describes
two main elements of the YARN infrastructure—the Resource Manager and the
Node Manager. It then details the HDFS Federation, which provides storage,
and also discusses other storage solutions. Lastly, it discusses the MapReduce
framework, which is the software layer.
The chapter then describes the functions of MapReduce in detail. MapReduce
divides tasks into subtasks, which it runs in parallel in order to increase efficiency. It
discusses the manner in which MapReduce takes lists of input data and transforms
them into lists of output data, by implementing a “map” process and a “reduce”
process, which it aggregates. It describes in detail the process steps that MapReduce
takes in order to produce the output, and describes how Python can be used to create
a MapReduce process for a word count program.
The chapter briefly describes Spark and an application using Spark. It concludes
with a discussion about cloud storage. The chapter makes use of Cloudera virtual
machine (VM) distributable to demonstrate different hands-on exercises.
4
S. Seshadri
Data Visualization: This chapter discusses how data is visualized and the way
that visualization can be used to aid in analysis. It starts by explaining that humans
use visuals to understand information, and that using visualizations incorrectly can
lead to mistaken conclusions. It discusses the importance of visualization as a
cognitive aid and the importance of working memory in the brain. It emphasizes
the role of data visualization in reducing the load on the reader.
The chapter details the six meta-rules of data visualization, which are as follows:
use the most appropriate chart, directly represent relationships between data, refrain
from asking the viewer to compare differences in area, never use color on top of
color, keep within the primal perceptions of the viewer, and chart with integrity.
Each rule is expanded upon in the chapter. The chapter discusses the kinds of
graphs and tables available to a visualizer, the advantages and disadvantages of 3D
visualization, and the best practices of color schemes.
Statistical Methods—Basic Inferences: This chapter introduces the fundamental
concepts of statistical inferences, such as population and sample parameters,
hypothesis testing, and analysis of variance. It begins by describing the differences
between population and sample means and variance and the methods to calculate
them. It explains the central limit theorem and its use in estimating the mean of a
population.
Confidence intervals are explained for samples in which variance is both
known and unknown. The concept of standard errors and the t- and Chi-squared
distributions are introduced. The chapter introduces hypothesis testing and the use
of statistical parameters to reject or fail to reject hypotheses. Type I and type II errors
are discussed.
Methods to compare two different samples are explained. Analysis of variance between two samples and within samples is also covered. The use of the
F-distribution in analyzing variance is explained. The chapter concludes with
discussion of when we need to compare means of a number of populations. It
explains how to use a technique called “Analysis of Variance (ANOVA)” instead
of carrying out pairwise comparisons.
Statistical Methods—Linear Regression Analysis: This chapter explains the idea
of linear regression in detail. It begins with some examples, such as predicting
newspaper circulation. It uses the examples to discuss the methods by which
linear regression obtains results. It describes a linear regression as a functional
form that can be used to understand relationships between outcomes and input
variables and perform statistical inference. It discusses the importance of linear
regression and its popularity, and explains the basic assumptions underlying linear
regression.
The modeling section begins by discussing a model in which there is only a
single regressor. It explains why a scatter-plot can be useful in understanding singleregressor models, and the importance of visual representation in statistical inference.
It explains the ordinary least squares method of estimating a parameter, and the use
of the sum of squares of residuals as a measure of the fit of a model. The chapter then
discusses the use of confidence intervals and hypothesis testing in a linear regression
1 Introduction
5
model. These concepts are used to describe a linear regression model in which there
are multiple regressors, and the changes that are necessary to adjust a single linear
regression model to a multiple linear regression model.
The chapter then describes the ways in which the basic assumptions of the linear
regression model may be violated, and the need for further analysis and diagnostic
tools. It uses the famous Anscombe data sets in order to demonstrate the existence
of phenomena such as outliers and collinearity that necessitate further analysis. The
methods needed to deal with such problems are explained. The chapter considers
the ways in which the necessity for the use of such methods may be determined,
such as tools to determine whether some data points should be deleted or excluded
from the data set. The possible advantages and disadvantages of adding additional
regressors to a model are described. Dummy variables and their use are explained.
Examples are given for the case where there is only one category of dummy, and
then multiple categories.
The chapter then discusses assumptions regarding the error term. The effect of
the assumption that the error term is normally distributed is discussed, and the Q-Q
plot method of examining the truth of this assumption for the data set is explained.
The Box–Cox method of transforming the response variable in order to normalize
the error term is discussed. The chapter then discusses the idea that the error terms
may not have equal variance, that is, be homoscedastic. It explains possible reasons
for heteroscedasticity, and the ways to adapt the analysis to those situations.
The chapter considers the methods in which the regression model can be
validated. The root mean square error is introduced. Segmenting the data into
training and validation sets is explained. Finally, some frequently asked questions
are presented, along with exercises.
Statistical Methods—Advanced Regression: Three topics are covered in this
chapter. In the main body of the chapter the tools for estimating the parameters of
regression models when the response variable is binary or categorical is presented.
The appendices to the chapter cover two other important techniques, namely,
maximum likelihood estimate (MLE) and how to deal with missing data.
The chapter begins with a description of logistics regression models. It continues
with diagnostics of logistics regression, including likelihood ratio tests, Wald’s
and the Hosmer–Lemeshow tests. It then discusses different R-squared tests, such
as Cox and Snell, Nagelkerke, and McFadden. Then, it discusses how to choose
the cutoff probability for classification, including discussion of discordant and
concordant pairs, the ROC curve, and Youden’s index. It concludes with a similar
discussion of Multinomial Logistics Function and regression. The chapter contains
a self-contained introduction to the maximum likelihood method and methods for
treating missing data. The ideas introduced in this chapter are used in several
following chapters in the book.
Text Analytics: This is the first of several chapters that introduce specialized
analytics methods depending on the type of data and analysis. This chapter begins
by considering various motivating examples for text analysis. It explains the need
for a process by which unstructured text data can be analyzed, and the ways that
it can be used to improve business outcomes. It describes in detail the manner in
6
S. Seshadri
which Google used its text analytics software and its database of searches to identify
vectors of H1N1 flu. It lists out the most common sources of text data, with social
media platforms and blogs producing the vast majority.
The second section of the chapter concerns the ways in which text can be
analyzed. It describes two approaches: a “bag-of-words” approach, in which the
structure of the language is not considered important, and a “natural-language”
approach, in which structure and phrases are also considered.
The example of a retail chain surveying responses to a potential ice-cream
product is used to introduce some terminology. It uses this example to describe
the problems of analyzing sentences due to the existence of grammatical rules, such
as the abundance of articles or the different tense forms of verbs. Various methods
of dealing with these problems are introduced. The term-document matrix (TDM)
is introduced along with its uses, such as generation of wordclouds.
The third and fourth sections of the chapter describe how to run text analysis
and some elementary applications. The text walks through a basic use of the
program R to analyze text. It looks at two ways that the TDM can be used to run
text analysis—using a text-base to cluster or segment documents, and elementary
sentiment analysis.
Clustering documents is a method by which similar customers are sorted into
the same group by analyzing their responses. Sentiment analysis is a method by
which attempts are made to make value judgments and extract qualitative responses.
The chapter describes the models for both processes in detail with regard to an
example.
The fifth section of the chapter then describes the more advanced technique
of latent topic mining. Latent topic mining aims to identify themes present in a
corpus, or a collection of documents. The chapter uses the example of the mission
statements of Fortune-1000 firms in order to identify some latent topics.
The sixth section of the chapter concerns natural-language processing (NLP).
NLP is a set of techniques that enables computers to understand nuances in human
languages. The method by which NLP programs detect data is discussed. The ideas
of this chapter are further explored in the chapter on Deep Learning. The chapter
ends with exercises for the student.
Simulation: This chapter introduces the uses of simulation as a tool for analytics,
focusing on the example of a fashion retailer. It explains the use of Monte Carlo
simulation in the presence of uncertainty as an aid to making decisions that have
various trade-offs.
First, the chapter explains the purposes of simulation, and the ways it can be used
to design an optimal intervention. It differentiates between computer simulation,
which is the main aim of the chapter, and physical simulation. It discusses the
advantages and disadvantages of simulations, and mentions various applications of
simulation in real-world contexts.
The second part of the chapter discusses the steps that are followed in making a
simulation model. It explains how to identify dependent and independent variables,
and the manner in which the relationships between those variables can be modeled.
It describes the method by which input variables can be randomly generated,
1 Introduction
7
and the output of the simulation can be interpreted. It illustrates these steps
using the example of a fashion retailer that needs to make a decision about
production.
The third part of the chapter describes decision-making under uncertainty and
the ways that simulation can be used. It describes how to set out a range of possible
interventions and how they can be modeled using a simulation. It discusses how to
use simulation processes in order to optimize decision-making under constraints, by
using the fashion retailer example in various contexts.
The chapter also contains a case study of a painting business deciding how much
to bid for a contract to paint a factory, and describes the solution to making this
decision. The concepts explained in this chapter are applied in different settings in
the following chapters.
Optimization: Optimization techniques are used in almost every application
in this book. This chapter presents some of the core concepts of constrained
optimization. The basic ideas are illustrated using one broad class of optimization
problems called linear optimization. Linear optimization covers the most widely
used models in business. In addition, because linear models are easy to visualize in
two dimensions, it offers a visual introduction to the basic concepts in optimization.
Additionally, the chapter provides a brief introduction to other optimization models
and techniques such as integer/discrete optimization, nonlinear optimization, search
methods, and the use of optimization software.
The linear optimization part is conventionally developed by describing the decision variables, the objective function, constraints, and the assumptions underlying
the linear models. Using geometric arguments, it illustrates the concept of feasibility
and optimality. It then provides the basic theorems of linear programming. The
chapter then develops the idea of shadow prices, reduced costs, and sensitivity
analysis, which is the underpinning of any post-optimality business analysis. The
solver function in Excel is used for illustrating these ideas. Then, the chapter
explains how these ideas extend to integer programming and provides an outline
of the branch and bound method with examples. The ideas are further extended
to nonlinear optimization via examples of models for linear regression, maximum
likelihood estimation, and logistic regression.
Forecasting Analytics: Forecasting is perhaps the most commonly used method
in business analytics. This chapter introduces the idea of using analytics to predict
the outcomes in the future, and focuses on applying analytics tools for business and
operations. The chapter begins by explaining the difficulty of predicting the future
with perfect accuracy, and the importance of accepting the uncertainty inherent in
any predictive analysis.
The chapter begins by defining forecasting as estimating in unknown situations.
It describes data that can be used to make forecasts, but focuses on time-series
forecasting. It introduces the concepts of point-forecasts and prediction intervals,
which are used in time-series analysis as part of predictions of future outcomes. It
suggests reasons for the intervention of human judgment in the forecasts provided
by computers. It describes the core method of time-series forecasting—identifying
a model that forecasts the best.
8
S. Seshadri
The second part of the chapter describes quantitative approaches to forecasting.
It begins by describing the various kinds of data that can be used to make forecasts,
such as spoken, written, numbers, and so on. It explains some methods of dealing
with outliers in the data set, which can affect the fit of the forecast, such as trimming
and winsorizing.
The chapter discusses the effects of seasonal fluctuations on time-series data and
how to adjust for them. It introduces the autocorrelation function and its use. It also
explains the partial autocorrelation function.
A number of methods used in predictive forecasting are explained, including
the naïve method, the average and moving average methods, Holt exponential
smoothing, and the ARIMA framework. The chapter also discusses ways to predict
stochastic intermittent demand, such as Croston’s approach, and the Syntetos and
Boylan approximation.
The third section of the chapter describes the process of applied forecasting
analytics at the operational, tactical, and strategic levels. It propounds a seven-step
forecasting process for operational tasks, and explains each step in detail.
The fourth section of the chapter concerns evaluating the accuracy of forecasts.
It explains measures such as mean absolute error, mean squared error, and root
mean squared error, and how to calculate them. Both Excel and R software use
is explained.
Advanced Statistical Methods: Count Data: The chapter begins by introducing
the idea of count variables and gives examples of where they are encountered, such
as insurance applications and the amount of time taken off by persons that fall sick.
It first introduces the idea of the Poisson regression model, and explains why
ordinary least squares are not suited to some situations for which the Poisson model
is more appropriate. It illustrates the differences between the normal and Poisson
distributions using conditional distribution graphs.
It defines the Poisson distribution model and its general use, as well as an
example regarding insurance claims data. It walks through the interpretation of
the regression’s results, including the explanation of the regression coefficients,
deviance, dispersion, and so on.
It discusses some of the problems with the Poisson regression, and how
overdispersion can cause issues for the analysis. It introduces the negative binomial
distribution as a method to counteract overdispersion. Zero-inflation models are
discussed. The chapter ends with a case study on Canadian insurance data.
Advanced Statistical Methods—Survival Analysis: Like the previous chapter, this
one deals with another specialized application. It involves techniques that analyze
time-to-event data. It defines time-to-event data and the contexts in which it can
be used, and provides a number of business situations in which survival analysis is
important.
The chapter explains the idea of censored data, which refers to survival times
in which the event in question has not yet occurred. It explains the differences
between survival models and other types of analysis, and the fields in which it can be
used. It defines the types of censoring: right-censoring, left-censoring, and intervalcensoring, and the method to incorporate them into the data set.
1 Introduction
9
The chapter then defines the survival analysis functions: the survival function and
the hazard function. It describes some simple types of hazard functions. It describes
some parametric and nonparametric methods of analysis, and defines the cases in
which nonparametric methods must be used. It explains the Kaplan–Meier method
in detail, along with an example. Semiparametric models are introduced for cases
in which several covariate variables are believed to contribute to survival. Cox’s
proportional hazards model and its interpretation are discussed.
The chapter ends with a comparison between semiparametric and parametric
models, and a case study regarding churn data.
Unsupervised Learning: The first of the three machine learning chapters sets
out the philosophy of machine learning. This chapter explains why unsupervised
learning—an important paradigm in machine learning—is akin to uncovering the
proverbial needle in the haystack, discovering the grammar of the process that
generated the data, and exaggerating the “signal” while ignoring the “noise” in it.
The chapter covers methods of projection, clustering, and density estimation—three
core unsupervised learning frameworks that help us perceive the data in different
ways. In addition, the chapter describes collaborative filtering and applications of
network analysis.
The chapter begins with drawing the distinction between supervised and unsupervised learning. It then presents a common approach to solving unsupervised learning
problems by casting them into an optimization framework. In this framework, there
are four steps:
• Intuition: to develop an intuition about how to approach the problem as an
optimization problem
• Formulation: to write the precise mathematical objective function in terms of data
using intuition
• Modification: to modify the objective function into something simpler or “more
solvable”
• Optimization: to solve the final objective function using traditional optimization
approaches
The chapter discusses principal components analysis (PCA), self-organizing
maps (SOM), and multidimensional scaling (MDS) under projection algorithms.
In clustering, it describes partitional and hierarchical clustering. Under density
estimation, it describes nonparametric and parametric approaches. The chapter
concludes with illustrations of collaborative filtering and network analysis.
Supervised Learning: In supervised learning, the aim is to learn from previously
identified examples. The chapter covers the philosophical, theoretical, and practical
aspects of one of the most common machine learning paradigms—supervised
learning—that essentially learns to map from an observation (e.g., symptoms and
test results of a patient) to a prediction (e.g., disease or medical condition), which
in turn is used to make decisions (e.g., prescription). The chapter then explores the
process, science, and art of building supervised learning models.
The first part explains the different paradigms in supervised learning: classification, regression, retrieval, recommendation, and how they differ by the nature
10
S. Seshadri
of their input and output. It then describes the process of learning, from features
description to feature engineering to models to algorithms that help make the
learning happen.
Among algorithms, the chapter describes rule-based classifiers, decision trees, knearest neighbor, Parzen window, and Bayesian and naïve Bayes classifiers. Among
discriminant functions that partition a region using an algorithm, linear (LDA) and
quadratic discriminant analysis (QDA) are discussed. A section describes recommendation engines. Neural networks are then introduced followed by a succinct
introduction to a key algorithm called support vector machines (SVM). The chapter
concludes with a description of ensemble techniques, including bagging, random
forest, boosting, mixture of experts, and hierarchical classifiers. The specialized
neural networks for Deep Learning are explained in the next chapter.
Deep Learning: This chapter introduces the idea of deep learning as a part of
machine learning. It aims to explain the idea of deep learning and various popular
deep learning architectures. It has four main parts:
• Understand what is deep learning.
• Understand various popular deep learning architectures, and know when to use
which architecture for solving a business problem.
• How to perform image analysis using deep learning.
• How to perform text analysis using deep learning.
The chapter explains the origins of learning, from a single perceptron to mimic
the functioning of a neuron to the multilayered perceptron (MLP). It briefly recaps
the backpropagation algorithm and introduces the learning rate and error functions.
It then discusses the deep learning architectures applied to supervised, unsupervised,
and reinforcement learning. An example of using an artificial neural network for
recognizing handwritten digits (based on the MNIST data set) is presented.
The next section of the chapter describes Convolutional Neural Networks (CNN),
which are aimed at solving vision-related problems. The ImageNet data set is
introduced. The use of CNNs in the ImageNet Large Scale Visual Recognition
Challenge is explained, along with a brief history of the challenge. The biological
inspiration for CNNs is presented. Four layers of a typical CNN are introduced—
the convolution layer, the rectified linear units layer, the pooling layers, and the fully
connected layer. Each layer is explained, with examples. A unifying example using
the same MNIST data set is presented.
The third section of the chapter discusses recurrent neural networks (RNNs).
It begins by describing the motivation for sequence learning models, and their
use in understanding language. Traditional language models and their functions in
predicting words are explained. The chapter describes a basic RNN model with
three units, aimed at predicting the next word in a sentence. It explains the detailed
example by which an RNN can be built for next word prediction. It presents some
uses of RNNs, such as image captioning and machine translation.
The next seven chapters contain descriptions of analytics usage in different
domains and different contexts. These are described next.
1 Introduction
11
Retail Analytics: The chapter begins by introducing the background and definition of retail analytics. It focuses on advanced analytics. It explains the use of four
main categories of business decisions: consumer, product, human resources, and
advertising. Several examples of retail analytics are presented, such as increasing
book recommendations during periods of cold weather. Complications in retail
analytics are discussed.
The second part of the chapter focuses on data collection in the retail sector. It
describes the traditional sources of retail data, such as point-of-sale devices, and
how they have been used in decision-making processes. It also discusses advances
in technology and the way that new means of data collection have changed the field.
These include the use of radio frequency identification technology, the Internet of
things, and Bluetooth beacons.
The third section describes methodologies, focusing on inventory, assortment,
and pricing decisions. It begins with modeling product-based demand in order
to make predictions. The penalized L1 regression LASSO for retail demand
forecasting is introduced. The use of regression trees and artificial neural networks
is discussed in the same context. The chapter then discusses the use of such forecasts
in decision-making. It presents evidence that machine learning approaches benefit
revenue and profit in both price-setting and inventory-choice contexts.
Demand models into which consumer choice is incorporated are introduced.
The multinomial logit, mixed multinomial logit, and nested logit models are
described. Nonparametric choice models are also introduced as an alternative to
logit models. Optimal assortment decisions using these models are presented.
Attempts at learning customer preferences while optimizing assortment choices are
described.
The fourth section of the chapter discusses business challenges and opportunities.
The benefits of omnichannel retail are discussed, along with the need for retail
analytics to change in order to fit an omnichannel shop. It also discusses some recent
start-ups in the retail analytics space and their focuses.
Marketing Analytics: Marketing is one of the most important, historically the
earliest, and fascinating areas for applying analytics to solve business problems.
Due to the vast array of applications, only the most important ones are surveyed
in this chapter. The chapter begins by explaining the importance of using marketing
analytics for firms. It defines the various levels that marketing analytics can apply to:
the firm, the brand or product, and the customer. It introduces a number of processes
and models that can be used in analyzing and making marketing decisions, including
statistical analysis, nonparametric tools, and customer analysis. The processes
and tools discussed in this chapter will help in various aspects of marketing
such as target marketing and segmentation, price and promotion, customer valuation, resource allocation, response analysis, demand assessment, and new product
development.
The second section of the chapter explains the use of the interaction effect
in regression models. Building on earlier chapters on regression, it explains the
utility of a term that captures the effect of one or more interactions between other
12
S. Seshadri
variables. It explains how to interpret new variables and their significance. The use
of curvilinear relationships in order to identify the curvilinear effect is discussed.
Mediation analysis is introduced, along with an example.
The third section describes data envelopment analysis (DEA), which is aimed at
improving the performance of organizations. It describes the manner in which DEA
works to present targets to managers and can be used to answer key operational
questions in Marketing: sales force productivity, performance of sales regions, and
effectiveness of geomarketing.
The next topic covered is conjoint analysis. It explains how knowing customers’
preference provides invaluable information about how customers think and make
their decisions before purchasing products. Thus, it helps firms devise their marketing strategies including advertising, promotion, and sales activities.
The fifth section of the chapter discusses customer analytics. Customer lifetime
value (CLV), a measure of the value provided to firms by customers, is introduced,
along with some other measures. A method to calculate CLV is presented, along
with its limitations. The chapter also discusses two more measures of customer
value: customer referral value and customer influence value, in detail. Additional
topics are covered in the chapters on retail analytics and social media analytics.
Financial Analytics: Financial analytics like Marketing has been a big consumer
of data. The topics chosen in this chapter provide one unified way of thinking
about analytics in this domain—valuation. This chapter focuses on the two main
branches of quantitative finance: the risk-neutral or “Q” world and the risk-averse
or “P” world. It describes the constraints and aims of analysts in each world, along
with their primary methodologies. It explains Q-quant theories such as the work of
Black and Scholes, and Harrison and Pliska. P-quant theories such as net present
value, capital asset pricing models, arbitrage pricing theory, and the efficient market
hypothesis are presented.
The methodology of financial data analytics is explained via a three-stage
process: asset price estimation, risk management, and portfolio analysis.
Asset price estimation is explained as a five-step process. It describes the use
of the random walk in identifying the variable to be analyzed. Several methods of
transforming the variable into one that is identical and independently distributed
are presented. A maximum likelihood estimation method to model variance is
explained. Monte Carlo simulations of projecting variables into the future are
discussed, along with pricing projected variables.
Risk management is discussed as a three-step process. The first step is risk
aggregation. Copula functions and their uses are explained. The second step,
portfolio assessment, is explained by using metrics such as Value at Risk. The third
step, attribution, is explained. Various types of capital at risk are listed.
Portfolio analysis is described as a two-stage process. Allocating risk for the
entire portfolio is discussed. Executing trades in order to move the portfolio to a
new risk/return level is explained.
A detailed example explaining each of the ten steps is presented, along with data
and code in MATLAB. This example also serves as a stand-alone case study on
financial analytics.
1 Introduction
13
Social Media Analytics: Social-media-based analytics has been growing in
importance and value to businesses. This chapter discusses the various tools
available to gather and analyze data from social media and Internet-based sources,
focusing on the use of advertisements. It begins by describing Web-based analytical
tools and the information they can provide, such as cookies, sentiment analysis, and
mobile analytics.
It introduces real-time advertising on online platforms, and the wealth of
data generated by browsers visiting target websites. It lists the various kinds of
advertising possible, including video and audio ads, map-based ads, and banner
ads. It explains the various avenues in which these ads can be displayed, and
details the reach of social media sites such as Facebook and Twitter. The various
methods in which ads can be purchased are discussed. Programmatic advertising
and its components are introduced. Real-time bidding on online advertising spaces
is explained.
A/B experiments are defined and explained. The completely randomized design
(CRD) experiment is discussed. The regression model for the CRD and an example
are presented. The need for randomized complete block design experiments is
introduced, and an example for such an experiment is shown. Analytics of multivariate experiments and their advantages are discussed. Orthogonal designs and
their meanings are explained.
The chapter discusses the use of data-driven search engine advertising. The
use of data in order to help companies better reach consumers and identify
trends is discussed. The power of search engines in this regard is discussed. The
problem of attribution, or identifying the influence of various ads across various
platforms is introduced, and a number of models that aim to solve this problem
are elucidated. Some models discussed are: the first click attribution model, the
last click attribution model, the linear attribution model, and algorithmic attribution
models.
Healthcare Analytics: Healthcare is once again an area where data, experiments, and research have coexisted within an analytical framework for hundreds
of years. This chapter discusses analytical approaches to healthcare. It begins
with an overview of the current field of healthcare analytics. It describes the
latest innovations in the use of data to refine healthcare, including telemedicine,
wearable technologies, and simulations of the human body. It describes some of the
challenges that data analysts can face when attempting to use analytics to understand
healthcare-related problems.
The main part of the chapter focuses on the use of analytics to improve
operations. The context is patient flow in outpatient clinics. It uses Academic
Medical Centers as an example to describe the processes that patients go through
when visiting clinics that are also teaching centers. It describes the effects of the
Affordable Care Act, an aging population, and changes in social healthcare systems
on the public health infrastructure in the USA.
A five-step process map of a representative clinic is presented, along with a
discrete event simulation of the clinic. The history of using operations researchbased methods to improve healthcare processes is discussed. The chapter introduces
14
S. Seshadri
a six-step process aimed at understanding complex systems, identifying potential
improvements, and predicting the effects of changes, and describes each step in
detail.
Lastly, the chapter discusses the various results of this process on some goals
of the clinic, such as arrivals, processing times, and impact on teaching. Data
regarding each goal and its change are presented and analyzed. The chapter contains
a hands-on exercise based on the simulation models discussed. The chapter is a fine
application of simulation concepts and modeling methodologies used in Operations
Management to improve healthcare systems.
Pricing Analytics: This chapter discusses the various mechanisms available
to companies in order to price their products. The topics pertain to revenue
management, which constitutes perhaps the most successful and visible area of
business analytics.
The chapter begins by introducing defining two factors that affect pricing: the
nature of the product and its competition, and customers’ preferences and values.
It introduces the concept of a price optimization model, and the need to control
capacity constraints when estimating customer demand.
The first type of model introduced is the independent class model. The underlying
assumption behind the model is defined, as well as its implications for modeling
customer choice. The EMSR heuristic and its use are explained.
The issue of overbooking in many service-related industries is introduced. The
trade-off between an underutilized inventory and the risk of denying service to
customers is discussed. A model for deciding an overbooking limit, given the
physical capacity at the disposal of the company, is presented. Dynamic pricing
is presented as a method to better utilize inventory.
Three main types of dynamic pricing are discussed: surge pricing, repricing,
and markup/markdown pricing. Each type is comprehensively explained. Three
models of forecasting and estimating customer demand are presented: additive,
multiplicative, and choice.
A number of processes for capacity control, such as nested allocations, are
presented. Network revenue management systems are introduced. A backward
induction method of control is explained. The chapter ends with an example of a
hotel that is planning allocation of rooms based on a demand forecast.
Supply Chain Analytics: This chapter discusses the use of data and analytical
tools to increase value in the supply chain. It begins by defining the processes
that constitute supply chains, and the goals of supply chain management. The
uncertainty inherent in supply chains is discussed. Four applications of supply chain
analytics are described: demand forecasting, inventory optimization, supply chain
disruption, and commodity procurement.
A case study of VASTA, one of the largest wireless services carriers in the USA,
is presented. The case study concerns the decision of whether the company should
change its current inventory strategy from a “push” strategy to a “pull” strategy.
The advantages and disadvantages of each strategy are discussed. A basic model
to evaluate both strategies is introduced. An analysis of the results is presented.
Following the analysis, a more advanced evaluation model is introduced. Customer
satisfaction and implementation costs are added to the model.
1 Introduction
15
The last three chapters of the book contain case studies. Each of the cases comes
with a large data set upon which students can practice almost every technique and
modeling approach covered in the book. The Info Media case study explains the use
of viewership data to design promotional campaigns. The problem presented is to
determine a multichannel ad spots allocation in order to maximize “reach” given
a budget and campaign guidelines. The approach uses simulation to compute the
viewership and then uses the simulated data to link promotional aspects to the total
reach of a campaign. Finally, the model can be used to optimize the allocation of
budgets across channels.
The AAA airline case study illustrates the use of choice models to design airline
offerings. The main task is to develop a demand forecasting model, which predicts
the passenger share for every origin–destination pair (O–D pair) given AAA, as
well as competitors’ offerings. The students are asked to explore different models
including the MNL and machine learning algorithms. Once a demand model has
been developed it can be used to diagnose the current performance and suggest
various remedies, such as adding, dropping, or changing itineraries in specific city
pairs. The third case study, Ideal Insurance, is on fraud detection. The problem faced
by the firm is the growing cost of servicing and settling claims in their healthcare
practice. The students learn about the industry and its intricate relationships with
various stakeholders. They also get an introduction to rule-based decision support
systems. The students are asked to create a system for detecting fraud, which should
be superior to the current “rule-based” system.
2 The Intended Audience
This book is the first of its kind both in breadth and depth of coverage and serves as
a textbook for students of first year graduate program in analytics and long duration
(1-year part time) certificate programs in business analytics. It also serves as a
perfect guide to practitioners.
The content is based on the curriculum of the Certificate Programme in Business
Analytics (CBA), now renamed as Advanced Management Programme in Business
Analytics (AMPBA) of Indian School of Business (ISB). The original curriculum
was created by Galit Shmueli. The curriculum was further developed by the
coeditors, Bhimasankaram Pochiraju and Sridhar Seshadri, who were responsible
for starting and mentoring the CBA program in ISB. Bhimasankaram Pochiraju has
been the Faculty Director of CBA since its inception and was a member of the
Academic Board. Sridhar Seshadri managed the launch of the program and since
then has chaired the academic development efforts. Based on the industry needs,
the curriculum continues to be modified by the Academic Board of the Applied
Statistics and Computing Lab (ASC Lab) at ISB.
Part I
Tools
Chapter 2
Data Collection
Sudhir Voleti
1 Introduction
Collecting data is the first step towards analyzing it. In order to understand and solve
business problems, data scientists must have a strong grasp of the characteristics of
the data in question. How do we collect data? What kinds of data exist? Where
is it coming from? Before beginning to analyze data, analysts must know how to
answer these questions. In doing so, we build the base upon which the rest of our
examination follows. This chapter aims to introduce and explain the nuances of data
collection, so that we understand the methods we can use to analyze it.
2 The Value of Data: A Motivating Example
In 2017, video-streaming company Netflix Inc. was worth more than $80 billion,
more than 100 times its value when it listed in 2002. The company’s current position
as the market leader in the online-streaming sector is a far cry from its humble
beginning as a DVD rental-by-mail service founded in 1997. So, what had driven
Netflix’s incredible success? What helped its shares, priced at $15 each on their
initial public offering in May 2002, rise to nearly $190 in July 2017? It is well
known that a firm’s [market] valuation is the sum total in today’s money, or the net
present value (NPV) of all the profits the firm will earn over its lifetime. So investors
reckon that Netflix is worth tens of billions of dollars in profits over its lifetime.
Why might this be the case? After all, companies had been creating television and
S. Voleti ()
Indian School of Business, Hyderabad, Telangana, India
e-mail: sudhir_voleti@isb.edu
© Springer Nature Switzerland AG 2019
B. Pochiraju, S. Seshadri (eds.), Essentials of Business Analytics, International
Series in Operations Research & Management Science 264,
https://doi.org/10.1007/978-3-319-68837-4_2
19
20
S. Voleti
cinematic content for decades before Netflix came along, and Netflix did not start
its own online business until 2007. Why is Netflix different from traditional cable
companies that offer shows on their own channels?
Moreover, the vast majority of Netflix’s content is actually owned by its
competitors. Though the streaming company invests in original programming, the
lion’s share of the material available on Netflix is produced by cable companies
across the world. Yet Netflix has access to one key asset that helps it to predict
where its audience will go and understand their every quirk: data.
Netflix can track every action that a customer makes on its website—what they
watch, how long they watch it for, when they tune out, and most importantly, what
they might be looking for next. This data is invaluable to its business—it allows the
company to target specific niches of the market with unerring accuracy.
On February 1, 2013, Netflix debuted House of Cards—a political thriller starring
Kevin Spacey. The show was a hit, propelling Netflix’s viewership and proving
that its online strategy could work. A few months later, Spacey applauded Netflix’s
approach and cited its use of data for its ability to take a risk on a project that every
other major television studio network had declined. Casey said in Edinburgh, at the
Guardian Edinburgh International Television Festival1 on August 22: “Netflix was
the only company that said, ‘We believe in you. We have run our data, and it tells us
our audience would watch this series.’”
Netflix’s data-oriented approach is key not just to its ability to pick winning
television shows, but to its global reach and power. Though competitors are
springing up the world over, Netflix remains at the top of the pack, and so long
as it is able to exploit its knowledge of how its viewers behave and what they prefer
to watch, it will remain there.
Let us take another example. The technology “cab” company Uber has taken the
world by storm in the past 5 years. In 2014, Uber’s valuation was a mammoth 40
billion USD, which by 2015 jumped another 50% to reach 60 billion USD. This
fact begs the question: what makes Uber so special? What competitive advantage,
strategic asset, and/or enabling platform accounts for Uber’s valuation numbers?
The investors reckon that Uber is worth tens of billions of dollars in profits over
its lifetime. Why might this be the case? Uber is after all known as a ride-sharing
business—and there are other cab companies available in every city.
We know that Uber is “asset-light,” in the sense that it does not own the cab fleet
or have drivers of the cabs on its direct payroll as employees. It employs a franchise
model wherein drivers bring their own vehicles and sign up for Uber. Yet Uber
does have one key asset that it actually owns, one that lies at the heart of its profit
projections: data. Uber owns all rights to every bit of data from every passenger,
every driver, every ride and every route on its network. Curious as to how much
data are we talking about? Consider this. Uber took 6 years to reach one billion
1 Guardian Edinburgh International Television Festival, 2017 (https://www.ibtimes.com/kevin-
spacey-speech-why-netflix-model-can-save-television-video-full-transcript-1401970) accessed
on Sep 13, 2018.
2 Data Collection
21
rides (Dec 2015). Six months later, it had reached the two billion mark. That is one
billion rides in 180 days, or 5.5 million rides/day. How did having consumer data
play a factor in the exponential growth of a company such as Uber? Moreover, how
does data connect to analytics and, finally, to market value?
Data is a valuable asset that helps build sustainable competitive advantage. It
enables what economists would call “supernormal profits” and thereby plausibly
justify some of those wonderful valuation numbers we saw earlier. Uber had help,
of course. The nature of demand for its product (contractual personal transportation), the ubiquity of its enabling platform (location-enabled mobile devices), and
the profile of its typical customers (the smartphone-owning, convenience-seeking
segment) has all contributed to its success. However, that does not take away from
the central point being motivated here—the value contained in data, and the need to
collect and corral this valuable resource into a strategic asset.
3 Data Collection Preliminaries
A well-known management adage goes, “We can only manage what we can measure.” But why is measurement considered so critical? Measurement is important
because it precedes analysis, which in turn precedes modeling. And more often than
not, it is modeling that enables prediction. Without prediction (determination of
the values an outcome or entity will take under specific conditions), there can be
no optimization. And without optimization, there is no management. The quantity
that gets measured is reflected in our records as “data.” The word data comes
from the Latin root datum for “given.” Thus, data (datum in plural) becomes facts
which are given or known to be true. In what follows, we will explore some
preliminary conceptions about data, types of data, basic measurement scales, and
the implications therein.
3.1 Primary Versus Secondary Dichotomy
Data collection for research and analytics can broadly be divided into two major
types: primary data and secondary data. Consider a project or a business task that
requires certain data. Primary data would be data that is collected “at source” (hence,
primary in form) and specifically for the research at hand. The data source could
be individuals, groups, organizations, etc. and data from them would be actively
elicited or passively observed and collected. Thus, surveys, interviews, and focus
groups all fall under the ambit of primary data. The main advantage of primary data
is that it is tailored specifically to the questions posed by the research project. The
disadvantages are cost and time.
On the other hand, secondary data is that which has been previously collected
for a purpose that is not specific to the research at hand. For example, sales records,
22
S. Voleti
industry reports, and interview transcripts from past research are data that would
continue to exist whether or not the project at hand had come to fruition. A good
example of a means to obtain secondary data that is rapidly expanding is the API
(Application Programming Interface)—an interface that is used by developers to
securely query external systems and obtain a myriad of information.
In this chapter, we concentrate on data available in published sources and
websites (often called secondary data sources) as these are the most commonly used
data sources in business today.
4 Data Collection Methods
In this section, we describe various methods of data collection based on sources,
structure, type, etc. There are basically two methods of data collection: (1) data
generation through a designed experiment and (2) collecting data that already exists.
A brief description of these methods is given below.
4.1 Designed Experiment
Suppose an agricultural scientist wants to compare the effects of five different
fertilizers, A, B, C, D, and E, on the yield of a crop. The yield depends not only
on the fertilizer but also on the fertility of the soil. The consultant considers a few
relevant types of soil, for example, clay, silt, and sandy soil. In order to compare
the fertilizer effect one has to control for the soil effect. For each soil type, the
experimenter may choose ten representative plots of equal size and assign the five
fertilizers to the ten plots at random in such a way that each fertilizer is assigned
to two plots. He then observes the yield in each plot. This is the design of the
experiment. Once the experiment is conducted as per this design, the yields in
different plots are observed. This is the data collection procedure. As we notice, the
data is not readily available to the scientist. He designs an experiment and generates
the data. This method of data collection is possible when we can control different
factors precisely while studying the effect of an important variable on the outcome.
This is quite common in the manufacturing industry (while studying the effect
of machines on output or various settings on the yield of a process), psychology,
agriculture, etc. For well-designed experiments, determination of the causal effects
is easy. However, in social sciences and business where human beings often are the
instruments or subjects, experimentation is not easy and in fact may not even be
feasible. Despite the limitations, there has been tremendous interest in behavioral
experiments in disciplines such as finance, economics, marketing, and operations
management. For a recent account on design of experiments, please refer to
Montgomery (2017).
2 Data Collection
23
4.2 Collection of Data That Already Exists
Household income, expenditure, wealth, and demographic information are examples
of data that already exists. Collection of such data is usually done in three possible
ways: (1) complete enumeration, (2) sample survey, and (3) through available
sources where the data was collected possibly for a different purpose and is available
in different published sources. Complete enumeration is collecting data on all
items/individuals/firms. Such data, say, on households, may be on consumption
of essential commodities, the family income, births and deaths, education of each
member of the household, etc. This data is already available with the households
but needs to be collected by the investigator. The census is an example of complete
enumeration. This method will give information on the whole population. It may
appear to be the best way but is expensive both in terms of time and money. Also,
it may involve several investigators and investigator bias can creep in (in ways that
may not be easy to account for). Such errors are known as non-sampling errors. So
often, a sample survey is employed. In a sample survey, the data is not collected on
the entire population, but on a representative sample. Based on the data collected
from the sample, inferences are drawn on the population. Since data is not collected
on the entire population, there is bound to be an error in the inferences drawn. This
error is known as the sampling error. The inferences through a sample survey can be
made precise with error bounds. It is commonly employed in market research, social
sciences, public administration, etc. A good account on sample surveys is available
in Blair and Blair (2015).
Secondary data can be collected from two sources: internal or external. Internal
data is collected by the company or its agents on behalf of the company. The
defining characteristic of the internal data is its proprietary nature; the company
has control over the data collection process and also has exclusive access to the
data and thus the insights drawn on it. Although it is costlier than external data, the
exclusivity of access to the data can offer competitive advantage to the company.
The external data, on the other hand, can be collected by either third-party data
providers (such as IRI, AC Nielsen) or government agencies. In addition, recently
another source of external secondary data has come into existence in the form of
social media/blogs/review websites/search engines where users themselves generate
a lot of data through C2B or C2C interactions. Secondary data can also be classified
on the nature of the data along the dimension of structure. Broadly, there are
three types of data: structured, semi-structured (hybrid), and unstructured data.
Some examples of structured data are sales records, financial reports, customer
records such as purchase history, etc. A typical example of unstructured data is
in the form of free-flow text, images, audio, and videos, which are difficult to
store in a traditional database. Usually, in reality, data is somewhere in between
structured and unstructured and thus is called semi-structured or hybrid data. For
example, a product web page will have product details (structured) and user reviews
(unstructured).
24
S. Voleti
The data and its analysis can also be classified on the basis of whether a single
unit is observed over multiple time points (time-series data), many units observed
once (cross-sectional data), or many units are observed over multiple time periods
(panel data). The insights that can be drawn from the data depend on the nature
of data, with the richest insights available from panel data. The panel could be
balanced (all units are observed over all time periods) or unbalanced (observations
on a few units are missing for a few time points either by design or by accident).
If the data is not missing excessively, it can be accounted for using the methods
described in Chap. 8.
5 Data Types
In programming, we primarily classify the data into three types—numerals, alphabets, and special characters and the computer converts any data type into binary
code for further processing. However, the data collected through various sources
can be of types such as numbers, text, image, video, voice, and biometrics.
The data type helps analyst to evaluate which operations can be performed to
analyze the data in a meaningful way. The data can limit or enhance the complexity
and quality of analysis.
Table 2.1 lists a few examples of data categorized by type, source, and uses. You
can read more about them following the links (all accessed on Aug 10, 2017).
5.1 Four Data Types and Primary Scales
Generally, there are four types of data associated with four primary scales, namely,
nominal, ordinal, interval, and ratio. Nominal scale is used to describe categories in
which there is no specific order while the ordinal scale is used to describe categories
in which there is an inherent order. For example, green, yellow, and red are three
colors that in general are not bound by an inherent order. In such a case, a nominal
scale is appropriate. However, if we are using the same colors in connection with
the traffic light signals there is clear order. In this case, these categories carry an
ordinal scale. Typical examples of the ordinal scale are (1) sick, recovering, healthy;
(2) lower income, middle income, higher income; (3) illiterate, primary school pass,
higher school pass, graduate or higher, and so on. In the ordinal scale, the differences
in the categories are not of the same magnitude (or even of measurable magnitude).
Interval scale is used to convey relative magnitude information such as temperature.
The term “Interval” comes about because rulers (and rating scales) have intervals
of uniform lengths. Example: “I rate A as a 7 and B as a 4 on a scale of 10.”
In this case, we not only know that A is preferred to B, but we also have some
idea of how much more A is preferred to B. Ratio scales convey information on
an absolute scale. Example: “I paid $11 for A and $12 for B.” The 11 and 12
2 Data Collection
25
here are termed “absolute” measures because the corresponding zero point ($0) is
understood in the same way by different people (i.e., the measure is independent of
subject).
Another set of examples for the four data types, this time from the world of
sports, could be as follows. The numbers assigned to runners are of nominal data
type, whereas the rank order of winners is of the ordinal data type. Note in the latter
case that while we may know who came first and who came second, we would not
know by how much based on the rank order alone. A performance rating on a 0–10
Table 2.1 A description of data and their types, sources, and examples
Category
Internal data
Transaction
data
Examples
Type
Sourcesa
Sales (POS/online)
transactions, stock
market orders and
trades, customer IP
and geolocation data
Numbers, text
Customer
preference data
Website click stream,
cookies, shopping
cart, wish list,
preorder
Numbers, text
Experimental
data
Simulation games,
clinical trials, live
experiments
Text, number, image,
audio, video
http://times.cs.uiuc.edu/
~wang296/Data/
https://www.quandl.com/
https://www.nyse.com/
data/transactions-statisticsdata-library
https://www.sec.gov/
answers/shortsalevolume.
htm
C:\Users\username\App
Data\Roaming\Microsoft
\Windows\Cookies,
Nearbuy.com (advance
coupon sold)
https://www.
clinicaltrialsregister.eu/
https://www.novctrd.com/
http://ctri.nic.in/
Customer
relationship
data
Demographics,
purchase history,
loyalty rewards data,
phone book
Text, number, image,
biometrics
Census, national
sample survey,
annual survey of
industries,
geographical survey,
land registry
Text, number, image,
audio, video
Immigration data,
social security
identity, Aadhar card
(UID)
Number, text,
image,
biometric
External data
Survey data
Biometric data
(fingerprint,
retina, pupil,
palm, face)
http://www.census.gov/
data.html
http://www.mospi.gov.in/
http://www.csoisw.gov.in/
https://www.gsi.gov.in/
http://
landrecords.mp.gov.in/
http://www.migration
policy.org/programs/
migration-data-hub
https://www.dhs.gov/
immigration-statistics
(continued)
26
S. Voleti
Table 2.1 (continued)
Category
Third party
data
Examples
RenTrak, A. C.
Nielsen, IRI, MIDT
(Market Information
Data Tapes) in airline
industry, people
finder, associations,
NGOs, database
vendors, Google
Trends, Google
Public Data
Type
All possible data types
Govt and quasi
govt agencies
Federal
governments,
regulators—
Telecom, BFSI, etc.,
World Bank, IMF,
credit reports,
climate and weather
reports, agriculture
production,
benchmark
indicators—GDP,
etc., electoral roll,
driver and vehicle
licenses, health
statistics, judicial
records
Twitter, Facebook,
YouTube, Instagram,
Pinterest
Wikipedia, YouTube
videos, blogs,
articles, reviews,
comments
All possible data types
Social sites
data,
user-generated
data
All possible data types
a All the sources are last accessed on Aug 10, 2017
Sourcesa
http://aws.amazon.com/
datasets
https://www.worldwildlife.
org/pages/conservationscience-data-and-tools
http://www.whitepages.
com/
https://pipl.com/
https://www.bloomberg.
com/
https://in.reuters.com/
http://www.imdb.com/
http://datacatalogs.org/
http://www.google.com/
trends/explore
https://www.google.com/
publicdata/directory
http://data.gov/
https://data.gov.in/
http://data.gov.uk/
http://open-data.europa.eu/
en/data/
http://www.imf.org/en/
Data
https://www.rbi.org.in/
Scripts/Statistics.aspx
https://www.healthdata.
gov/
https://www.cibil.com/
http://eci.nic.in/
http://data.worldbank.org/
https://dev.twitter.com/
streaming/overview
https://developers.
facebook.com/docs/graphapi
https://en.wikipedia.org/
https://www.youtube.com/
https://snap.stanford.edu/
data/web-Amazon.html
http://www.cs.cornell.edu/
people/pabo/moviereview-data/
2 Data Collection
27
scale would be an example of an interval scale. We see this used in certain sports
ratings (i.e., gymnastics) wherein judges assign points based on certain metrics.
Finally, in track and field events, the time to finish in seconds is an example of ratio
data. The reference point of zero seconds is well understood by all observers.
5.2 Common Analysis Types with the Four Primary Scales
The reason why it matters what primary scale was used to collect data is that
downstream analysis is constrained by data type. For instance, with nominal data, all
we can compute are the mode, some frequencies and percentages. Nothing beyond
this is possible due to the nature of the data. With ordinal data, we can compute
the median and some rank order statistics in addition to whatever is possible with
nominal data. This is because ordinal data retains all the properties of the nominal
data type. When we proceed further to interval data and then on to ratio data,
we encounter a qualitative leap over what was possible before. Now, suddenly,
the arithmetic mean and the variance become meaningful. Hence, most statistical
analysis and parametric statistical tests (and associated inference procedures) all
become available. With ratio data, in addition to everything that is possible with
interval data, ratios of quantities also make sense.
The multiple-choice examples that follow are meant to concretize the understanding of the four primary scales and corresponding data types.
6 Problem Formulation Preliminaries
Even before data collection can begin, the purpose for which the data collection
is being conducted must be clarified. Enter, problem formulation. The importance
of problem formulation cannot be overstated—it comes first in any research project,
ideally speaking. Moreover, even small deviations from the intended path at the very
beginning of a project’s trajectory can lead to a vastly different destination than was
intended. That said, problem formulation can often be a tricky issue to get right. To
see why, consider the musings of a decision-maker and country head for XYZ Inc.
Sales fell short last year. But sales would’ve approached target except for 6 territories in 2
regions where results were poor. Of course, we implemented a price increase across-theboard last year, so our profit margin goals were just met, even though sales revenue fell
short. Yet, 2 of our competitors saw above-trend sales increases last year. Still, another
competitor seems to be struggling, and word on the street is that they have been slashing
prices to close deals. Of course, the economy was pretty uneven across our geographies last
year and the 2 regions in question, weak anyway, were particularly so last year. Then there
was that mess with the new salesforce compensation policy coming into effect last year. 1
of the 2 weak regions saw much salesforce turnover last year . . .
These are everyday musings in the lives of business executives and are far from
unusual. Depending on the identification of the problem, data collection strategies,
28
S. Voleti
resources, and approaches will differ. The difficulty in being able to readily pinpoint
any one cause or a combination of causes as specific problem highlights the issues
that crop up in problem formulation. Four important points jump out from the above
example. First, that reality is messy. Unlike textbook examples of problems, wherein
irrelevant information is filtered out a priori and only that which is required to solve
“the” identified problem exactly is retained, life seldom simplifies issues in such a
clear-cut manner. Second, borrowing from a medical analogy, there are symptoms—
observable manifestations of an underlying problem or ailment—and then there is
the cause or ailment itself. Symptoms could be a fever or a cold and the causes
could be bacterial or viral agents. However, curing the symptoms may not cure
the ailment. Similarly, in the previous example from XYZ Inc., we see symptoms
(“sales are falling”) and hypothesize the existence of one or more underlying
problems or causes. Third, note the pattern of connections between symptom(s) and
potential causes. One symptom (falling sales) is assumed to be coming from one
or more potential causes (product line, salesforce compensation, weak economy,
competitors, etc.). This brings up the fourth point—How can we diagnose a problem
(or cause)? One strategy would be to narrow the field of “ailments” by ruling out
low-hanging fruits—ideally, as quickly and cheaply as feasible. It is not hard to see
that the data required for this problem depends on what potential ailments we have
shortlisted in the first place.
6.1 Towards a Problem Formulation Framework
For illustrative purposes, consider a list of three probable causes from the messy
reality of the problem statement given above, namely, (1) product line is obsolete;
(2) customer-connect is ineffective; and (3) product pricing is uncompetitive (say).
Then, from this messy reality we can formulate decision problems (D.P.s) that
correspond to the three identified probable causes:
• D.P. #1: “Should new product(s) be introduced?”
• D.P. #2: “Should advertising campaign be changed?”
• D.P. #3: “Should product prices be changed?”
Note what we are doing in mathematical terms—if messy reality is a large
multidimensional object, then these D.P.s are small-dimensional subsets of that
reality. This “reduces” a messy large-dimensional object to a relatively more
manageable small-dimensional one.
The D.P., even though it is of small dimension, may not contain sufficient detail
to map directly onto tools. Hence, another level of refinement called the research
objective (R.O.) may be needed. While the D.P. is a small-dimensional object,
the R.O. is (ideally) a one-dimensional object. Multiple R.O.s may be needed to
completely “cover” or address a single D.P. Furthermore, because each R.O. is
one-dimensional, it maps easily and directly onto one or more specific tools in
the analytics toolbox. A one-dimensional problem formulation component better be
2 Data Collection
29
Large-dimensional object
Messy Reality
Analycs Toolbox
One-dimensional object
Decision
Problem
Research
Objecve
Relavely smalldimensional object
Fig. 2.1 A framework for problem formulation
well defined. The R.O. has three essential parts that together lend necessary clarity
to its definition. R.O.s comprise of (a) an action verb and (b) an actionable object,
and typically fit within one handwritten line (to enforce brevity). For instance, the
active voice statement “Identify the real and perceived gaps in our product line visà-vis that of our main competitors” is an R.O. because its components action verb
(“identify”), actionable object (“real and perceived gaps”), and brevity are satisfied.
Figure 2.1 depicts the problem formulation framework we just described in
pictorial form. It is clear from the figure that as we impose preliminary structure, we
effectively reduce problem dimensionality from large (messy reality) to somewhat
small (D.P.) to the concise and the precise (R.O.).
6.2 Problem Clarity and Research Type
A quotation attributed to former US defense secretary Donald Rumsfeld in the runup to the Iraq war goes as follows: “There are known-knowns. These are things we
know that we know. There are known-unknowns. That is to say, there are things that
we know we don’t know. But there are also unknown-unknowns. There are things
we don’t know we don’t know.” This statement is useful in that it helps discern the
differing degrees of the awareness of our ignorance about the true state of affairs.
To understand why the above statement might be relevant for problem formulation, consider that there are broadly three types of research that correspond to three
levels of clarity in problem definition. The first is exploratory research wherein the
problem is at best ambiguous. For instance, “Our sales are falling . . . . Why?” or
“Our ad campaign isn’t working. Don’t know why.” When identifying the problem
is itself a problem, owing to unknown-unknowns, we take an exploratory approach
to trace and list potential problem sources and then define what the problems
30
S. Voleti
may be. The second type is descriptive research wherein the problem’s identity is
somewhat clear. For instance, “What kind of people buy our products?” or “Who is
perceived as competition to us?” These are examples of known-unknowns. The third
type is causal research wherein the problem is clearly defined. For instance, “Will
changing this particular promotional campaign raise sales?” is a clearly identified
known-unknown. Causal research (the cause in causal comes from the cause in
because) tries to uncover the “why” behind phenomena of interest and its most
powerful and practical tool is the experimentation method. It is not hard to see that
the level of clarity in problem definition vastly affects the choices available in terms
of data collection and downstream analysis.
7 Challenges in Data Collection
Data collection is about data and about collection. We have seen the value inherent
in the right data in Sect. 1. In Sect. 3, we have seen the importance of clarity in
problem formulation while determining what data to collect. Now it is time to turn
to the “collection” piece of data collection. What challenges might a data scientist
typically face in collecting data? There are various ways to list the challenges that
arise. The approach taken here follows a logical sequence.
The first challenge is in knowing what data to collect. This often requires
some familiarity with or knowledge of the problem domain. Second, after the data
scientist knows what data to collect, the hunt for data sources can proceed apace.
Third, having identified data sources (the next section features a lengthy listing of
data sources in one domain as part of an illustrative example), the actual process
of mining of raw data can follow. Fourth, once the raw data is mined, data quality
assessment follows. This includes various data cleaning/wrangling, imputation, and
other data “janitorial” work that consumes a major part of the typical data science
project’s time. Fifth, after assessing data quality, the data scientist must now judge
the relevance of the data to the problem at hand. While considering the above, at
each stage one has to take into consideration the cost and time constraints.
Consider a retailing context. What kinds of data would or could a grocery retail
store collect? Of course, there would be point-of-sale data on items purchased,
promotions availed, payment modes and prices paid in each market basket, captured
by UPC scanner machines. Apart from that, retailers would likely be interested in
(and can easily collect) data on a varied set of parameters. For example, that may
include store traffic and footfalls by time of the day and day of the week, basic
segmentation (e.g., demographic) of the store’s clientele, past purchase history of
customers (provided customers can be uniquely identified, that is, through a loyalty
or bonus program), routes taken by the average customer when navigating the
store, or time spent on an average by a customer in different aisles and product
departments. Clearly, in the retail sector, the wide variety of data sources and capture
points to data are typically large in the following three areas:
2 Data Collection
31
• Volume
• Variety (ranges from structured metric data on sales, inventory, and geo location
to unstructured data types such as text, images, and audiovisual files)
• Velocity—(the speed at which data comes in and gets updated, i.e., sales or
inventory data, social media monitoring data, clickstreams, RFIDs—Radiofrequency identification, etc.)
These fulfill the three attribute criteria that are required to being labeled “Big
Data” (Diebold 2012). The next subsection dives into the retail sector as an
illustrative example of data collection possibilities, opportunities, and challenges.
8 Data Collation, Validation, and Presentation
Collecting data from multiple sources will not result in rich insights unless the data
is collated to retain its integrity. Data validity may be compromised if proper care is
not taken during collation. One may face various challenges while trying to collate
the data. Below, we describe a few challenges along with the approaches to handle
them in the light of business problems.
• No common identifier: A challenge while collating data from multiple sources
arises due to the absence of common identifiers across different sources. The
analyst may seek a third identifier that can serve as a link between two data
sources.
• Missing data, data entry error: Missing data can either be ignored, deleted, or
imputed with relevant statistics (see Chap. 8).
• Different levels of granularity: The data could be aggregated at different levels.
For example, primary data is collected at the individual level, while secondary
data is usually available at the aggregate level. One can either aggregate the
data in order to bring all the observations to the same level of granularity or
can apportion the data using business logic.
• Change in data type over the period or across the samples: In financial and
economic data, many a time the base period or multipliers are changed, which
needs to be accounted for to achieve data consistency. Similarly, samples
collected from different populations such as India and the USA may suffer from
inconsistent definitions of time periods—the financial year in India is from April
to March and in the USA, it is from January to December. One may require
remapping of old versus new data types in order to bring the data to the same
level for analysis.
• Validation and reliability: As the secondary data is collected by another user, the
researcher may want to validate to check the correctness and reliability of the
data to answer a particular research question.
Data presentation is also very important to understand the issues in the data. The
basic presentation may include relevant charts such as scatter plots, histograms, and
32
S. Voleti
pie charts or summary statistics such as the number of observations, mean, median,
variance, minimum, and maximum. You will read more about data visualization in
Chap. 5 and about basic inferences in Chap. 6.
9 Data Collection in the Retailing Industry: An Illustrative
Example
Bradlow et al. (2017) provide a detailed framework to understand and classify the
various data sources becoming popular with retailers in the era of Big Data and
analytics. Figure 2.2, taken from Bradlow et al. (2017), “organizes (an admittedly
incomplete) set of eight broad retail data sources into three primary groups, namely,
(1) traditional enterprise data capture; (2) customer identity, characteristics, social
graph and profile data capture; and (3) location-based data capture.” The claim
is that insight and possibilities lie at the intersection of these groups of diverse,
contextual, and relevant data.
Traditional enterprise data capture (marked #1 in Fig. 2.2) from UPC scanners
combined with inventory data from ERP or SCM software and syndicated databases
(such as those from IRI or Nielsen) enable a host of analyses, including the
following:
1. Sales & Inventory
data capture from
enterprise systems
2. Loyalty or Bonus Card
data for Household
identification
3. Customers’ Webpresence data from
retailer’s site and/or
syndicated sources.
Data capture from traditional Enterprise
systems (UPC scanners, ERP etc.)
4. Customers’ Social Graph and profile information
Customer or household level Data capture
Location based Data capture
5. Mobile and app based data (both
retailer’s own app and from syndicated
sources)
6. Customers’ subconscious, habit based
or subliminally influenced choices (RFID,
eye-tracking etc.)
7. Relative product locations in the store
layout and on shop shelves within an aisle.
8. Environmental data such as weather
conditions
9. Store location used for third party order fulfillment
Fig. 2.2 Data sources in the modern retail sector
2 Data Collection
33
• Cross-sectional analysis of market baskets—item co-occurrences, complements
and substitutes, cross-category dependence, etc. (e.g., Blattberg et al. 2008;
Russell and Petersen 2000)
• Analysis of aggregate sales and inventory movement patterns by stock-keeping
unit
• Computation of price or shelf-space elasticities at different levels of aggregation
such as category, brand, and SKU (see Bijmolt et al. (2005) for a review of this
literature)
• Assessment of aggregate effects of prices, promotions, and product attributes on
sales
In other words, traditional enterprise data capture in a retailing context enables
an overview of the four P’s of Marketing (product, price, promotion, and place at
the level of store, aisle, shelf, etc.).
Customer identity, characteristics, social graph, and profile data capture identify
consumers and thereby make available a slew of consumer- or household-specific
information such as demographics, purchase history, preferences and promotional
response history, product returns history, and basic contacts such as email for email
marketing campaigns and personalized flyers and promotions. Bradlow et al. (2017,
p. 12) write:
Such data capture adds not just a slew of columns (consumer characteristics) to the most
detailed datasets retailers would have from previous data sources, but also rows in that
household-purchase occasion becomes the new unit of analysis. A common data source for
customer identification is loyalty or bonus card data (marked #2 in Fig. 2.2) that customers
sign up for in return for discounts and promotional offers from retailers. The advent of
household specific “panel” data enabled the estimation of household specific parameters
in traditional choice models (e.g., Rossi and Allenby 1993; Rossi et al. 1996) and their use
thereafter to better design household specific promotions, catalogs, email campaigns, flyers,
etc. The use of household- or customer identity requires that a single customer ID be used
as primary key to link together all relevant information about a customer across multiple
data sources. Within th…

Order your essay today and save 25% with the discount code: STUDYSAVE

Order Now

Turn in your highest-quality paper
Get a qualified writer to help you with

“ buesiness analysis statistic more than 3000 words 3 days ”

Get high-quality paper

NEW! AI matching with writer

Order a unique copy of this paper

Type of paper needed:

Pages:

600 words

Academic level:

We'll send you the first draft for approval by September 11, 2018 at 10:52 AM

Total price:

$26

Our Services

buesiness analysis statistic more than 3000 words 3 days

Order a unique copy of this paper