You may use any dataset of your choice as long as it has more than 1,000 observations with at
least 2 numeric predictors. This dataset should be relevant to your interest or research (
should
you pursue a thesis).
This project should have the following core components:
1)
A data lake or data warehouse – defend your choice. Why are you choosing to store your
data in one over the other
2)
Connect data lake or data warehouse to a distributed cloud service such as AWS,
AZURE, or GCP
3)
Run your Spark application over those distributed services
4)
Documentation – This is the most CRUCIAL STEP. The final product should be a report
(NO LESS
than 3 pages), detailing the steps you took to get to the results, and what the
final results are.
5)
This report should also include an explanation of your goal, datasets used, and the
technical approach to get to the end result.
6)
THINK OF IT THIS WAY: I should be able to follow the approaches from your report and
replicate whatever you did.
Criteria:
1- Includes data storage
2- Includes distributed cloud service
3- Includes final report