ITEC 621 Predictive AnalyticsPredictive Analytics Project
Prof. Espinosa – Last updated 7/17/2021
Background CRISP Data Requirements D1 D2 D3 D4 D5 Teamwork
Background
The main goal of this project is to help you prepare for your practicum projects by giving you an
opportunity to put into practice what you have learned in class. The predictive analytics project
will be done in teams of maximum 4 students. It is expected that all team members will contribute
equally and that everyone will take the opportunity to learn from each other. Business analytics is
not just about analyzing data. It requires teamwork and a compelling upfront articulation of the
specific business problem or analytics question being addressed; and a clear and concise report of
the findings and conclusion. We will follow the Cross Industry Standard Process for Data Mining
(CRISP-DM) framework for this project, which maps closely to INFORMS’ Job Task Analysis (JTA)
(http://info.informs.org/analytics-body-of-knowledge; Amazon), and it is a popular framework for
analytics projects.
CRISP-DM Overview
In essence, the CRISP-DM framework (see lecture slides) includes the following activities, which
we will adopt for the project:
• Business Understanding (CRISP-DM 1)
o (JTA Domain I) Formulate the business question to be answered or problem to be solved.
All business analytics projects must be driven by business needs or business value
propositions. This requires the articulation of the respective business case, leading to the
articulation of the business question or problem.
o (JTA Domain II) Translate business question into the respective analytics question. Not all
business questions or problems are amenable to analytics solutions. The project report must
specify how or why analytics is the appropriate approach to address the business question or
problem.
• Data Understanding (CRISP-DM 2)
o Data acquisition and pre-processing can take as much as 80% of the analytics project effort.
o (JTA Domain III) Acquire and identify relationships in the data. This step involves acquiring
the data (e.g., ETL or Extract-Translate-Load) and then doing a substantial amount of
descriptive analytics, including things like (as appropriate): descriptive statistics, correlation
analysis, ANOVA, distribution curves, visual plots and other graphs, and other related
1
analysis (e.g., cluster analysis). Predictive analytic modeling should not start until you have
developed a thorough understanding of the data. If fact, this phase may uncover issues and
relationships in the data that you did not anticipate, thus leading to reformulation of the
analytics question.
• Data Preparation (CRISP-DM 3)
o (JTA Domain III) Harmonize, re-scale and clean data, as needed. Data sets often need to be
split, merged, sub-sampled (for large data sets), and cleansed. This step involves all data
pre-processing activities, such as: re-structuring the data (e.g., normalizing scales, centering,
aggregating, etc.); addressing issues of missing data; and acquiring and merging other
related data.
• Modeling (CRISP-DM 4)
o Select the appropriate analysis methodology and tools, exploring various model
specifications, and then building the respective models. In this course we use R as the
primary analytical tool.
o (JTA Domain IV) Methodology Selection. The vast majority of the course is focused on
method selection (e.g., OLS regression, Logistic regression, Ridge or LASSO, trees, etc.).
Candidate models should be identified based on the analytics goals: interpretation, inference
and/or prediction). For this project, students need to focus on models that are relatively
interpretable and then select the model that has better predictive accuracy, based on cross
validation test error or deviance.
o (JTA Domain V) Model Building. Another area of focus in this course is on model
specification (e.g., linear, polynomial, interactions, variable selection, etc.). The initial set of
predictors to be used in the model must be driven by business domain knowledge. But then
this set should be narrowed down or refined using statistical methods like cross-validation
testing.
• Evaluation (CRISP-DM 5)
o This phase is not about evaluating the models. This happens in the Modeling phase above.
This phase is about evaluating the extent to which the analysis has answered the business
and analytics questions framed in phase 1. For this project, we will focus on the following:
o Interpretation of Results: the final project reports must provide very focused interpretation
of results, in terms of effects observed, fit statistics, and predictive power of the final
model.
o An important part of this interpretation is providing a well-documented answer to the
business and analytics question.
o It is also important that you tell a compelling story in your report. Storytelling is one of the
most important skills in business analytics. Remember, this is not a statistics class, but a
business class. You must tell a compelling story for your audience. The story must be backed
up by your findings.
• Deployment (CRISP-DM 6)
o (JTA Domain VI) For this project, deployment will focus on turning in your written report,
with the necessary interpretation and stories articulated in step 5 above.
Important note: not all projects lead to amazing findings. A model that shows no effects can offer
very interesting insights. It all depends on how you rationalize the lack of effects from a business
point of view. Along the same lines, this project is not so much about what you analyzed and
found, but about how effectively you described to your readers the motivation for your study,
2
your method evaluation and selection process and what the implications of your findings from a
business perspective.
Data
Any dataset not used in class for lectures, exercises or homework can be used for this project.
Students are expected to identify an interesting external data set to work with. In the past, many
students have used Kaggle data sets used in competitions, but there are many sources of public
data. Proprietary data sets can only be used with permission of the owner of the data set. It is OK
to use data from your practicums, if you have it, and use this project as an opportunity to work
with your client’s data. Unless the data is proprietary, teams must submit the actual datasets with
their final projects so that the professor can replicate some of your work when grading.
Requirements
All projects must evaluate 3 different modeling methods (e.g., OLS, Ridge, Logistic, LDA, trees,
etc.) with 2 different model specifications for each, (e.g., different predictor subsets; polynomial,
log or other transformations; interactions, etc.).
IMPORTANT: the 2 model specifications selected above should be used in each of the 3 modeling
methods above. The best approach is to fit the first model using OLS or Logistic regression, using
both model specifications. Then, depending on your results and assumption testing, fit the same 2
specifications using two other models.
IMPORTANT: all team members must contribute their fair share of the analysis. I expect each
member to take the lead on one particular modeling method or transformations. I will be
surveying the team during the semester to evaluate how each member contributed to the project.
IMPORTANT: while you will be evaluating and testing 6 different models (3 model methods x 2
specifications), you should only report on the final model methods and specification selected, but
you must close the loop and re-fit your final model with the full dataset. There is no need to
report on all alternative models. You only need to discuss your model selection process, including
any fit statistics and cross-validation test results that guided your final selection. However, if you
wish to include output from alternative models and specifications, you can do that in an appendix.
Project Deliverables
This project has 5 deliverables:
Deliverable 1 (5 pts): Project Proposal (1 page, single-spaced)
A project proposal is due around the mid-semester point, per the class schedule. The goal in this
deliverable is to get you started on your project early and provide me with an idea of the direction
you are planning to take in your project. It is also an opportunity for me to give you feedback on
your project ideas. The proposal should contain the following sections:
3
(1) The business case – a brief rationale about the importance of this question/problem from a
business perspective. What is the value proposition of your project? The business case is the
motivation for your study. Why is this study important? And why should your client or
company devote resources to carry out the study? A business case should provide a convincing
statement articulating things like: how/why is the study important to your client? What are the
benefits that your study will provide? Or, what are the opportunity costs if you don’t carry out
the study? Business cases are most effective when your narrative is: specific, based on facts or
data, concise and to the point. By specific, we mean that it should be specific to your project.
(2) The business question – the business case should lead to one or many interesting business
questions to pursue in your project (e.g., how can we control the spread of an epidemic most
effectively?). In a real project, there will probably be more than one business question to
address, but for this project we encourage you to focus on a single business question.
(3) The analytics question – Not all business questions can be answered with analytics.
Translating a business question into an analytics question is simply providing a more detailed
formulation of the business question, such that the question is answerable through analytics. If
answering the business question requires that you analyze data, then your question is
answerable through analytics. Otherwise is not. Think of the analytics question as the verbal
translation of your predictive model. For example, if your analytics model is likely to be
something like Y ~ Focal Predictors (of interest) + Other Predictors (controls), and your focal
predictors are X1 and X2, then your analytics question would read somewhat like this: “In this
study we are interested in understanding the effect that X1 and X2 have on Y. That’s it. This will
guide your model specification. Notice that we don’t need to discuss all predictors, just the
focal predictors of interest to the study.
The analytics question should be more specifically tailored to the outcome variable you will be
using in your models (e.g., how do population density, sanitation conditions and general
population health affect the spread of an epidemic?). The analytics question will lead to either
a quantitative or classification method. Although you can change this later, at this point, you
should discuss whether your analytics question about a quantitative or classification outcome.
This will lead you into the correct modeling approach in the next deliverable; The effective
articulation of the analytics question should set you in the right direction to start building your
model; and
(4) Dataset(s) – Identify one or more possible datasets for the project. The more specific the
datasets you are contemplating the better.
Deliverable 2 (10 pts): Preliminary Data Analysis Report (2 pages of text,
single-spaced, plus appendices with R output as needed)
This deliverable is intended to get you started early on your project model method and
specification exploration. It is also meant to get you familiarized with the project data. You should
think of this deliverable as an early draft of your final report. It is also one last opportunity to get
feedback on the direction of your project.
Because all model explorations begin with either an OLS regression (for quantitative predictions)
or a Logistic regression (for classification predictions), this preliminary data analysis report will
include the following:
4
(1) IMPORTANT: your main text should only contain narratives. Place all statistical output and
plots in appendices. All appendices must be appropriately referenced in the main text.
(2) Revise and refine your project proposal as needed. More specifically, refine your business
case, business question and analytics question, as needed. Your deliverable 2 report must
include these revised items.
(3) Brief description of your dataset. In Deliverable 1 you discussed possible dataset to use. For
this deliverable, you must have settled on the specific dataset you will use in your project. You
don’t need to provide a full description of the dataset yet, but you need to provide enough
information for your professor to understand what you are analyzing. No need to provide
extensive descriptions, just the data source and the main variables you included in your
preliminary analysis. For each variable, please describe its respective variable type, unit of
measurement, and a brief description of the variable.
(4) Descriptive analytics. You must provide a brief discussion of the respective descriptive
statistics, correlation analysis, ANOVA and/or any plots you may have rendered to understand
the data and how variables relate to each other. The text in this section should be limited to a
brief analysis of the most salient aspects of this analysis. Provide a brief narrative of what you
learned from your descriptive analytics.
(5) Define an initial set of predictors for your model. These predictors must be variables in your
dataset and must be selected using business domain rationale. The initial set of predictors
should NOT be selected statistically, but you must articulate your rationale for why you chose
your initial set of predictors.
(6) If your analytics question is quantitative, run an OLS regression. If your analytics question is a
classification, run a Logistic regression. In either case you must include the predictors
identified above. Later in the project you will refine this initial set of predictors through
variable selection, best subsets, or other methods.
(7) Inspect residual and other regression plots, as appropriate, and conduct the necessary tests to
evaluate adherence to the OLS or Logit regression assumptions (e.g., multicollinearity, serial
correlation if there is time data, heteroscedasticity, linearity, etc.).
(8) Provide a brief statement of your conclusion.
Deliverable 3 (0 pts): Meet with Professor. This deliverable does not have any points
assigned but it is mandatory for the ENTIRE team. All teams must schedule a meeting with the
professor shortly after submitting Deliverable 2. This is an important step for the professor to ask
you questions about your project and for you to get additional feedback and guidance on your
project.
Deliverable 4 (65 pts): Final Report (4 to 5 pages of text, single-spaced, plus appendices
with R output as needed)
IMPORTANT: as it should be clear by now, one important learning objective in the MS Analytics
program is being able to interpret analytics results and articulate them clearly to a business
audience. The market calls this “storytelling” and it boils down to writing concisely, to the point
5
and clearly, what your results mean for the business of your client. This involves things like
interpretations of statistical output and telling a good story. Avoid grandiose statements and fluff.
Get to the point right away because the space is limited and business people like succinct but
informational writing.
The final project report will be submitted as an analytics report prepared in MS Word or knitted
with R Markdown as a Word or PDF document. Most of these sections should be an extension of
your Preliminary Data Analysis Report above. The final project report will contain the following
sections:
(1) (10 pts.) A brief but compelling business case articulating. Combine (1), (2) and (3) from your
proposal into one coherent statement (or in subsections) discussing:
a) What is the business problem and/or business question, which your study seeks to solve or
answer?
b) The rationale about the importance of this question/problem from a business perspective.
Why is the problem you are analyzing important? That is, what is the value proposition of
your project to your client or managers?
a) Articulate the analytics question in very specific terms. Note that the business question
and the analytics question are related, but they are not the same. The business question
does not need to discuss variables in detail, but articulate the general are of business
inquiry. In contrast, the analytics question should be very specific and needs to clearly
state: (1) the type of problem you are addressing (i.e., quantitative or classification); (2) the
outcome (variable) you are predicting; and (3) the focal predictors of interest (not all of
them).
(2) (5 pts.) A description of the dataset utilized for the analysis (if the data set is not available in
an R package or public web site, the data set must be attached). Your data description should
be sufficient for your reading audience to understand your data set, variables and the
interpretations you provide in your report, including variable types and units of measurement.
The data description should be accompanied by any necessary descriptive analytics artifacts
necessary for your predictive modeling (e.g., descriptive statistics, correlation matrix,
correlation plots, other plots, etc.).
(3) (10 pts.) Descriptive Analytics: Brief analysis of the study variables, from both, business and
statistical perspectives.
a) First, clearly identify and describe your outcome variable(s).
b) Then specify and briefly describe your main predictors. You don’t need to discuss all
predictors in this section, just the ones that are most central to your analytics question and
business problem. You will be selecting the final predictors later, but before you do that, it
is important to have a business rationale for including them.
c) Briefly discuss any important aspects uncovered by your descriptive analytics of the data
(i.e., visual plots, descriptive statistics, correlations, etc.)
6
d) Finally, provide a brief discussion of any pre-processing (e.g., grouping, combining
variables, etc.) and transformations done with the data (e.g., normality, logs,
standardization, non-linear, etc.) you employed for some of the variables, if any, along with
the rationale for the appropriateness of this transformation (e.g., normality, non-linearity,
non-continuous, etc.). Again, you will be selecting your model specifications later, but you
want to do some descriptive analytics early to spot any issues with the data that may
require transformations.
Please include all the necessary plots, descriptive statistics, correlation matrices, etc. in an
appendix. Do not include R output in the main text.
(4) (10 pts.) A discussion of the (a) analytics methods and (b) model specifications you evaluated
and selected. All methods used must be appropriate and relevant to the problem and you
need to provide a justification for the selected methods based on:
(a) Conformance with or departure from OLS and/or Logistic OLS assumptions, based on visual
inspections and OLS assumption tests.
(b) Predictive accuracy based on cross-validation test statistics. Similarly, the particular model
specifications utilized must have a rationale. For example, if you chose a quadratic
regression specification, you must have some rationale for the respective non-linear
relationship. All projects must be analyzed with a variety of appropriate model with
different model specification. Please consult with me if in doubt, but these are the
minimum requirements
(5) (10 pts.) Analysis and presentation of results. Your analysis and results need to contain some
narrative to allow your audience to understand what you did. A simple output and diagram
dump with no explanation will receive very little credit. Every procedure, output and diagram
needs to be briefly but appropriately introduced before and briefly commented on its meaning
after. Don’t leave it up to the reader to interpret what you did. Also, vague and general
discussions of results will receive little credit. Your narrative of results should be factual and
specific, so it needs to backed up by fit statistics, coefficient values and significance, etc.
(6) (10 pts.) A short section with final thoughts, conclusions and lessons learned. Business
analytics is about gaining insights from business data for decision making. This is the section
for you to articulate what insights you gained from your analysis. These conclusions must
contain a discussion of:
a) The main conclusions of your analysis. These conclusions must answer/solve your analytics
question/problem stated in 1 above. Please be brief but concise and discuss the main
insights you obtained from your analysis
b) A brief statement of the main issues and challenges you faced in this project and what you
learned from it, including things like: data issues, methodological challenges, do’s and
don’ts, what you learned from this experience. You don’t need to address all of this. But
please be thoughtful and make it interesting.
(7) (10 pts.) Writing Quality, Formatting and Presentation. Analytics projects, no matter how good
they are, are not useful unless the analytics report is well written and clearly articulated.
Nobody wants to see a bunch of statistical output without sound commentary about the
7
results and their implications for business. Consequently, heavy weight will be placed on the
attractiveness, presentation, writing clarity of the report, free of grammatical errors and typos.
More importantly, the entire report needs to flow and be understandable to your audience.
Deliverable 5 (10 pts): Brief Presentation to the Class (5 to 6 slides of content)
Each team will have 10 to 12 minutes or so to share with the class your: business
question/problem; model selection; and conclusions. All presentations must follow this format
(approximately one slide per each bullet):
•
•
•
•
•
•
•
Title slide with project name and team members names
Business problem or analytics question addressed in the study with a short statement of the
business case
Brief description of the dataset (describe any relevant aspects of descriptive statistics,
correlations, visual plot inspections, and pre-processing or transformations, as appropriate)
Brief explanation of your model selection process and alternatives, along with the respective
model specifications.
Discussion of the most relevant results. No need to discuss all results, just important ones.
Final conclusions about implications of your findings
Brief articulation of the challenges you encountered in your project.
Teamwork (10 pts): The instructor will make an assessment of how well the team worked
together. This project is not only about carrying out an analytics exercise, but to get some
experience working as a team, as you will do in your professional work. Some of this grading will
be based on the team as a whole (i.e., how well the team collaborated, distributed assignments
and worked together professionally); and some of it will be individual, based on team evaluations
and the professor’s observations about the fair share and quality contributions of each member.
8