do not use any chat gpt or google codemy sugesstion is please do carefully buddy
module-6-1
October 4, 2023
Homework 3 is at the end of the notbook.
1
Module 6: Classification
The following tutorial contains Python examples for solving classification problems. You should
refer to the Chapters 3 and 4 of the “Introduction to Data Mining” book to understand
some of the concepts introduced in this tutorial. The notebook can be downloaded from
http://www.cse.msu.edu/~ptan/dmbook/tutorials/tutorial6/tutorial6.ipynb.
Classification is the task of predicting a nominal-valued attribute (known as class label) based on
the values of other attributes (known as predictor variables). The goals for this tutorial are as
follows: 1. To provide examples of using different classification techniques from the scikit-learn
library package. 2. To demonstrate the problem of model overfitting.
Read the step-by-step instructions below carefully. To execute the code, click on the corresponding
cell and press the SHIFT-ENTER keys simultaneously.
[1]: import pandas as pd
from sklearn import tree
from sklearn.metrics import accuracy_score
import numpy as np
import matplotlib.pyplot as plt
from numpy.random import random
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
%matplotlib inline
1.1
6.1 Vertebrate Dataset
We use a variation of the vertebrate data described in Example 3.1 of Chapter 3. Each vertebrate
is classified into one of 5 categories: mammals, reptiles, birds, fishes, and amphibians, based on a
set of explanatory attributes (predictor variables). Except for “name”, the rest of the attributes
have been converted into a one hot encoding binary representation. To illustrate this, we will first
load the data into a Pandas DataFrame object and display its content.
[ ]: data = pd.read_csv(‘/content/drive/MyDrive/datamining/vertebrate.
↪csv’,header=’infer’)
data.head()
1
[ ]:
0
1
2
3
4
Name
human
python
salmon
whale
frog
Warm-blooded
1
0
0
1
0
0
1
2
3
4
Has Legs
1
0
0
0
1
Hibernates
0
1
0
0
1
Gives Birth
1
0
0
1
0
Aquatic Creature
0
0
1
1
1
Aerial Creature
0
0
0
0
0
\
Class
mammals
reptiles
fishes
mammals
amphibians
Given the limited number of training examples, suppose we convert the problem into a binary
classification task (mammals versus non-mammals). We can do so by replacing the class labels of
the instances to non-mammals except for those that belong to the mammals class.
[ ]: data[‘Class’] = data[‘Class’].
↪replace([‘fishes’,’birds’,’amphibians’,’reptiles’],’non-mammals’)
data.head()
[ ]:
0
1
2
3
4
Name
human
python
salmon
whale
frog
Warm-blooded
1
0
0
1
0
0
1
2
3
4
Has Legs
1
0
0
0
1
Hibernates
0
1
0
0
1
Gives Birth
1
0
0
1
0
Aquatic Creature
0
0
1
1
1
Aerial Creature
0
0
0
0
0
\
Class
mammals
non-mammals
non-mammals
mammals
non-mammals
We can apply Pandas cross-tabulation to examine the relationship between the Warm-blooded and
Gives Birth attributes with respect to the class.
[ ]: pd.crosstab([data[‘Warm-blooded’],data[‘Gives Birth’]],data[‘Class’])
[ ]: Class
Warm-blooded Gives Birth
0
0
1
1
0
1
mammals
non-mammals
0
0
0
5
7
1
2
0
2
The results above show that it is possible to distinguish mammals from non-mammals using these
two attributes alone since each combination of their attribute values would yield only instances that
belong to the same class. For example, mammals can be identified as warm-blooded vertebrates
that give birth to their young. Such a relationship can also be derived using a decision tree classifier,
as shown by the example given in the next subsection.
1.2
6.2 Decision Tree Classifier
In this section, we apply a decision tree classifier to the vertebrate dataset described in the previous
subsection.
[ ]: Y = data[‘Class’]
X = data.drop([‘Name’,’Class’],axis=1)
clf = tree.DecisionTreeClassifier(criterion=’entropy’,max_depth=3)
clf = clf.fit(X, Y)
The preceding commands will extract the predictor (X) and target class (Y) attributes from the
vertebrate dataset and create a decision tree classifier object using entropy as its impurity measure
for splitting criterion. The decision tree class in Python sklearn library also supports using ‘gini’
as impurity measure. The classifier above is also constrained to generate trees with a maximum
depth equals to 3. Next, the classifier is trained on the labeled data using the fit() function.
We can plot the resulting decision tree obtained after training the classifier. To do this, you must
first install both graphviz (http://www.graphviz.org) and its Python interface called pydotplus
(http://pydotplus.readthedocs.io/).
[ ]: import pydotplus
from IPython.display import Image
dot_data = tree.export_graphviz(clf, feature_names=X.columns,␣
↪class_names=[‘mammals’,’non-mammals’], filled=True,
out_file=None)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
[ ]:
3
Next, suppose we apply the decision tree to classify the following test examples.
[ ]: testData = [[‘gila monster’,0,0,0,0,1,1,’non-mammals’],
[‘platypus’,1,0,0,0,1,1,’mammals’],
[‘owl’,1,0,0,1,1,0,’non-mammals’],
[‘dolphin’,1,1,1,0,0,0,’mammals’]]
testData = pd.DataFrame(testData, columns=data.columns)
testData
[ ]:
0
1
2
3
Name
gila monster
platypus
owl
dolphin
0
1
2
Has Legs
1
1
1
Warm-blooded
0
1
1
1
Hibernates
1
1
0
Gives Birth
0
0
0
1
Class
non-mammals
mammals
non-mammals
4
Aquatic Creature
0
0
0
1
Aerial Creature
0
0
1
0
\
3
0
0
mammals
We first extract the predictor and target class attributes from the test data and then apply the
decision tree classifier to predict their classes.
[ ]: testY = testData[‘Class’]
testX = testData.drop([‘Name’,’Class’],axis=1)
predY = clf.predict(testX)
predictions = pd.concat([testData[‘Name’],pd.Series(predY,name=’Predicted␣
↪Class’)], axis=1)
predictions
[ ]:
0
1
2
3
Name Predicted Class
gila monster
non-mammals
platypus
non-mammals
owl
non-mammals
dolphin
mammals
Except for platypus, which is an egg-laying mammal, the classifier correctly predicts the class label
of the test examples. We can calculate the accuracy of the classifier on the test data as shown by
the example given below.
[ ]: print(‘Accuracy on test data is %.2f’ % (accuracy_score(testY, predY)))
Accuracy on test data is 0.75
1.3
6.3 Model Overfitting
To illustrate the problem of model overfitting, we consider a two-dimensional dataset containing
1500 labeled instances, each of which is assigned to one of two classes, 0 or 1. Instances from each
class are generated as follows: 1. Instances from class 1 are generated from a mixture of 3 Gaussian
distributions, centered at [6,14], [10,6], and [14 14], respectively. 2. Instances from class 0 are
generated from a uniform distribution in a square region, whose sides have a length equals to 20.
For simplicity, both classes have equal number of labeled instances. The code for generating and
plotting the data is shown below. All instances from class 1 are shown in red while those from class
0 are shown in black.
[ ]: N = 1500
mean1 = [6, 14]
mean2 = [10, 6]
mean3 = [14, 14]
cov = [[3.5, 0], [0, 3.5]]
# diagonal covariance
np.random.seed(50)
X = np.random.multivariate_normal(mean1, cov, int(N/6))
X = np.concatenate((X, np.random.multivariate_normal(mean2, cov, int(N/6))))
5
X = np.concatenate((X, np.random.multivariate_normal(mean3, cov, int(N/6))))
X = np.concatenate((X, 20*np.random.rand(int(N/2),2)))
Y = np.concatenate((np.ones(int(N/2)),np.zeros(int(N/2))))
plt.plot(X[:int(N/2),0],X[:int(N/2),1],’r+’,X[int(N/2):,0],X[int(N/2):,1],’k.
↪’,ms=4)
[ ]: [,
]
In this example, we reserve 80% of the labeled data for training and the remaining 20% for testing.
We then fit decision trees of different maximum depths (from 2 to 50) to the training set and plot
their respective accuracies when applied to the training and test sets.
[ ]: Y
[ ]: array([1., 1., 1., …, 0., 0., 0.])
[ ]: #########################################
# Training and Test set creation
#########################################
6
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.8,␣
↪random_state=1)
from sklearn import tree
from sklearn.metrics import accuracy_score
#########################################
# Model fitting and evaluation
#########################################
maxdepths = [2,3,4,5,6,7,8,9,10,15,20,25,30,35,40,45,50]
trainAcc = np.zeros(len(maxdepths))
testAcc = np.zeros(len(maxdepths))
index = 0
for depth in maxdepths:
clf = tree.DecisionTreeClassifier(max_depth=depth)
clf = clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
trainAcc[index] = accuracy_score(Y_train, Y_predTrain)
testAcc[index] = accuracy_score(Y_test, Y_predTest)
index += 1
#########################################
# Plot of training and test accuracies
#########################################
plt.plot(maxdepths,trainAcc,’ro-‘,maxdepths,testAcc,’bv–‘)
plt.legend([‘Training Accuracy’,’Test Accuracy’])
plt.xlabel(‘Max depth’)
plt.ylabel(‘Accuracy’)
[ ]: Text(0, 0.5, ‘Accuracy’)
7
The plot above shows that training accuracy will continue to improve as the maximum depth of
the tree increases (i.e., as the model becomes more complex). However, the test accuracy initially
improves up to a maximum depth of 5, before it gradually decreases due to model overfitting.
1.4
6.4 Alternative Classification Techniques
Besides decision tree classifier, the Python sklearn library also supports other classification techniques. In this section, we provide examples to illustrate how to apply the k-nearest neighbor
classifier, linear classifiers (logistic regression and support vector machine), as well as ensemble
methods (boosting, bagging, and random forest) to the 2-dimensional data given in the previous
section.
1.4.1
6.4.4 Ensemble Methods
An ensemble classifier constructs a set of base classifiers from the training data and performs
classification by taking a vote on the predictions made by each base classifier. We consider 3 types
of ensemble classifiers in this example: bagging, boosting, and random forest. Detailed explanation
about these classifiers can be found in Section 4.10 of the book.
In the example below, we fit 500 base classifiers to the 2-dimensional dataset using each ensemble
method. The base classifier corresponds to a decision tree with maximum depth equals to 10.
8
[ ]:
[2]: data = pd.read_csv(‘https://archive.ics.uci.edu/ml/machine-learning-databases/
↪breast-cancer-wisconsin/breast-cancer-wisconsin.data’, header=None)
data.columns = [‘Sample code’, ‘Clump Thickness’, ‘Uniformity of Cell Size’,␣
↪’Uniformity of Cell Shape’,
‘Marginal Adhesion’, ‘Single Epithelial Cell Size’, ‘Bare␣
↪Nuclei’, ‘Bland Chromatin’,
‘Normal Nucleoli’, ‘Mitoses’,’Class’]
data = data.drop([‘Sample code’],axis=1)
print(‘Number of instances = %d’ % (data.shape[0]))
print(‘Number of attributes = %d’ % (data.shape[1]))
data.head()
Number of instances = 699
Number of attributes = 10
[2]:
0
1
2
3
4
Clump Thickness
5
5
3
6
4
Uniformity of Cell Size
1
4
1
8
1
0
1
2
3
4
Marginal Adhesion
1
5
1
1
3
Single Epithelial Cell Size Bare Nuclei
2
1
7
10
2
2
3
4
2
1
0
1
2
3
4
Bland Chromatin
3
3
3
3
3
Mitoses
1
1
1
1
1
Normal Nucleoli
1
2
1
7
1
[ ]: data = data.replace(‘?’,np.NaN)
data.fillna(data.median(), inplace=True)
Y = data[‘Class’]
X = data.drop([‘Class’],axis=1)
#########################################
9
Uniformity of Cell Shape
1
4
1
8
1
Class
2
2
2
2
2
\
\
# Training and Test set creation
#########################################
from sklearn.utils import shuffle,resample
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,␣
↪random_state=1)
[ ]: X_test.shape
[ ]: (140, 9)
[ ]: from sklearn import ensemble
from sklearn.tree import DecisionTreeClassifier
numBaseClassifiers = 100
maxdepth = 10
trainAcc = []
testAcc = []
clf = ensemble.RandomForestClassifier(n_estimators=numBaseClassifiers)
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
trainAcc.append(accuracy_score(Y_train, Y_predTrain))
testAcc.append(accuracy_score(Y_test, Y_predTest))
clf = ensemble.
↪BaggingClassifier(DecisionTreeClassifier(max_depth=maxdepth),n_estimators=numBaseClassifiers
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
trainAcc.append(accuracy_score(Y_train, Y_predTrain))
testAcc.append(accuracy_score(Y_test, Y_predTest))
clf = ensemble.
↪AdaBoostClassifier(DecisionTreeClassifier(max_depth=maxdepth),n_estimators=numBaseClassifier
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
trainAcc.append(accuracy_score(Y_train, Y_predTrain))
testAcc.append(accuracy_score(Y_test, Y_predTest))
print(“Results\n”)
print(‘Random Forest Train = %.2f; Test = %.2f’%(trainAcc[0],testAcc[0]))
print(‘Bagging Train = %.2f; Test = %.2f’ %(trainAcc[1],testAcc[1]))
print(‘AdaBoost Train = %.2f; Test = %.2f\n’ %(trainAcc[2],testAcc[2]))
10
methods = [‘Random Forest’, ‘Bagging’, ‘AdaBoost’]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,6))
ax1.bar([1.5,2.5,3.5], trainAcc)
ax1.set_xticks([1.5,2.5,3.5])
ax1.set_xticklabels(methods)
ax2.bar([1.5,2.5,3.5], testAcc)
ax2.set_xticks([1.5,2.5,3.5])
ax2.set_xticklabels(methods)
Results
Random Forest Train = 1.00; Test = 0.97
Bagging Train = 1.00; Test = 0.95
AdaBoost Train = 1.00; Test = 0.91
[ ]: [Text(1.5, 0, ‘Random Forest’),
Text(2.5, 0, ‘Bagging’),
Text(3.5, 0, ‘AdaBoost’)]
1.4.2
6.4.1 K-Nearest neighbor classifier
In this approach, the class label of a test instance is predicted based on the majority class of its k
closest training instances. The number of nearest neighbors, k, is a hyperparameter that must be
provided by the user, along with the distance metric. By default, we can use Euclidean distance
(which is equivalent to Minkowski distance with an exponent factor equals to p=2):
11
𝑁
1
𝑝
Minkowski distance(𝑥, 𝑦) = [ ∑ |𝑥𝑖 − 𝑦𝑖 |𝑝 ]
𝑖=1
[ ]: from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
%matplotlib inline
numNeighbors = [1, 5, 10, 15, 20, 25, 30]
trainAcc = []
testAcc = []
for k in numNeighbors:
clf = KNeighborsClassifier(n_neighbors=k, metric=’minkowski’, p=2)
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
trainAcc.append(accuracy_score(Y_train, Y_predTrain))
testAcc.append(accuracy_score(Y_test, Y_predTest))
plt.plot(numNeighbors, trainAcc, ‘ro-‘, numNeighbors, testAcc,’bv–‘)
plt.legend([‘Training Accuracy’,’Test Accuracy’])
plt.xlabel(‘Number of neighbors’)
plt.ylabel(‘Accuracy’)
1.4.3
6.4.2 Linear Classifiers
Linear classifiers such as logistic regression and support vector machine (SVM) constructs a linear
separating hyperplane to distinguish instances from different classes.
For logistic regression, the model can be described by the following equation:
𝑃 (𝑦 = 1|𝑥) =
1
= 𝜎(𝑤𝑇 𝑥 + 𝑏)
1 + exp−𝑤𝑇 𝑥−𝑏
The model parameters (w,b) are estimated by optimizing the following regularized negative loglikelihood function:
𝑁
(𝑤∗ , 𝑏∗ ) = arg min − ∑ 𝑦𝑖 log [𝜎(𝑤𝑇 𝑥𝑖 + 𝑏)] + (1 − 𝑦𝑖 ) log [𝜎(−𝑤𝑇 𝑥𝑖 − 𝑏)] +
𝑤,𝑏
𝑖=1
1
Ω([𝑤, 𝑏])
𝐶
where 𝐶 is a hyperparameter that controls the inverse of model complexity (smaller values imply
stronger regularization) while Ω(⋅) is the regularization term, which by default, is assumed to be
an 𝑙2 -norm in sklearn.
For support vector machine, the model parameters (𝑤∗ , 𝑏∗ ) are estimated by solving the following
12
constrained optimization problem:
‖𝑤‖2
1
+ ∑ 𝜉𝑖
𝐶 𝑖
𝑤 ,𝑏 ,{𝜉𝑖 } 2
min
∗ ∗
s.t.
∀𝑖 ∶ 𝑦𝑖 [𝑤𝑇 𝜙(𝑥𝑖 ) + 𝑏] ≥ 1 − 𝜉𝑖 , 𝜉𝑖 ≥ 0
[ ]: from sklearn import linear_model
from sklearn.svm import SVC
C = [0.01, 0.1, 0.2, 0.5, 0.8, 1, 5, 10, 20, 50]
LRtrainAcc = []
LRtestAcc = []
SVMtrainAcc = []
SVMtestAcc = []
for param in C:
clf = linear_model.LogisticRegression(C=param)
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
LRtrainAcc.append(accuracy_score(Y_train, Y_predTrain))
LRtestAcc.append(accuracy_score(Y_test, Y_predTest))
clf = SVC(C=param,kernel=’linear’)
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
SVMtrainAcc.append(accuracy_score(Y_train, Y_predTrain))
SVMtestAcc.append(accuracy_score(Y_test, Y_predTest))
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,6))
ax1.plot(C, LRtrainAcc, ‘ro-‘, C, LRtestAcc,’bv–‘)
ax1.legend([‘Training Accuracy’,’Test Accuracy’])
ax1.set_xlabel(‘C’)
ax1.set_xscale(‘log’)
ax1.set_ylabel(‘Accuracy’)
ax2.plot(C, SVMtrainAcc, ‘ro-‘, C, SVMtestAcc,’bv–‘)
ax2.legend([‘Training Accuracy’,’Test Accuracy’])
ax2.set_xlabel(‘C’)
ax2.set_xscale(‘log’)
ax2.set_ylabel(‘Accuracy’)
Note that linear classifiers perform poorly on the data since the true decision boundaries between
classes are nonlinear for the given 2-dimensional dataset.
13
1.4.4
6.4.3 Nonlinear Support Vector Machine
The code below shows an example of using nonlinear support vector machine with a Gaussian radial
basis function kernel to fit the 2-dimensional dataset.
[ ]: from sklearn.svm import SVC
C = [0.01, 0.1, 0.2, 0.5, 0.8, 1, 5, 10, 20, 50]
SVMtrainAcc = []
SVMtestAcc = []
for param in C:
clf = SVC(C=param,kernel=’rbf’,gamma=’auto’)
clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
SVMtrainAcc.append(accuracy_score(Y_train, Y_predTrain))
SVMtestAcc.append(accuracy_score(Y_test, Y_predTest))
plt.plot(C, SVMtrainAcc, ‘ro-‘, C, SVMtestAcc,’bv–‘)
plt.legend([‘Training Accuracy’,’Test Accuracy’])
plt.xlabel(‘C’)
plt.xscale(‘log’)
plt.ylabel(‘Accuracy’)
Observe that the nonlinear SVM can achieve a higher test accuracy compared to linear SVM.
1.5
3.5 Summary
This section provides several examples of using Python sklearn library to build classification models
from a given input data. We also illustrate the problem of model overfitting and show how to apply
different classification methods to the given dataset.
[3]: !pip install ucimlrepo
Collecting ucimlrepo
Downloading ucimlrepo-0.0.1-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.1
[4]: from ucimlrepo import fetch_ucirepo
# fetch dataset
iris = fetch_ucirepo(id=53)
# data (as pandas dataframes)
X = iris.data.features
y = iris.data.targets
14
# metadata
print(iris.metadata)
# variable information
print(iris.variables)
{‘uci_id’: 53, ‘name’: ‘Iris’, ‘repository_url’:
‘https://archive.ics.uci.edu/dataset/53/iris’, ‘data_url’:
‘https://archive.ics.uci.edu/static/public/53/data.csv’, ‘abstract’: ‘A small
classic dataset from Fisher, 1936. One of the earliest known datasets used for
evaluating classification methods.\n’, ‘area’: ‘Life Science’, ‘tasks’:
[‘Classification’], ‘characteristics’: [‘Tabular’], ‘num_instances’: 150,
‘num_features’: 4, ‘feature_types’: [‘Real’], ‘demographics’: [], ‘target_col’:
[‘class’], ‘index_col’: None, ‘has_missing_values’: ‘no’,
‘missing_values_symbol’: None, ‘year_of_dataset_creation’: 1936, ‘last_updated’:
‘Tue Sep 12 2023’, ‘dataset_doi’: ‘10.24432/C56C76’, ‘creators’: [‘R. A.
Fisher’], ‘intro_paper’: {‘title’: ‘The Iris data set: In search of the source
of virginica’, ‘authors’: ‘A. Unwin, K. Kleinman’, ‘published_in’:
‘Significance, 2021’, ‘year’: 2021, ‘url’: ‘https://www.semanticscholar.org/pape
r/4599862ea877863669a6a8e63a3c707a787d5d7e’, ‘doi’: ‘1740-9713.01589’},
‘additional_info’: {‘summary’: ‘This is one of the earliest datasets used in the
literature on classification methods and widely used in statistics and machine
learning. The data set contains 3 classes of 50 instances each, where each
class refers to a type of iris plant. One class is linearly separable from the
other 2; the latter are not linearly separable from each other.\n\nPredicted
attribute: class of iris plant.\n\nThis is an exceedingly simple domain.\n\nThis
data differs from the data presented in Fishers article (identified by Steve
Chadwick, spchadwick@espeedaz.net ). The 35th sample should be:
4.9,3.1,1.5,0.2,”Iris-setosa” where the error is in the fourth feature. The 38th
sample: 4.9,3.6,1.4,0.1,”Iris-setosa” where the errors are in the second and
third features. ‘, ‘purpose’: ‘N/A’, ‘funded_by’: None, ‘instances_represent’:
‘Each instance is a plant’, ‘recommended_data_splits’: None, ‘sensitive_data’:
None, ‘preprocessing_description’: None, ‘variable_info’: None, ‘citation’:
None}}
name
role
type demographic \
0 sepal length Feature
Continuous
None
1
sepal width Feature
Continuous
None
2 petal length Feature
Continuous
None
3
petal width Feature
Continuous
None
4
class
Target Categorical
None
0
1
2
3
4
description units missing_values
None
cm
no
None
cm
no
None
cm
no
None
cm
no
class of iris plant: Iris Setosa, Iris Versico… None
no
15
[5]: X.head()
[5]:
0
1
2
3
4
sepal length
5.1
4.9
4.7
4.6
5.0
sepal width
3.5
3.0
3.2
3.1
3.6
petal length
1.4
1.4
1.3
1.5
1.4
petal width
0.2
0.2
0.2
0.2
0.2
[6]: y.head()
[6]:
0
1
2
3
4
class
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
2
Homework 3
The goal of this is to use a famous Machine Learning package in Python called sklearn. The best
resource for learning sklearn is its documentation found at https://scikit-learn.org/stable/.
We will use iris dataset. To download Iris dataset: https://archive.ics.uci.edu/dataset/53/iris.
Your task is build and compare different classification algorithms introduced in this class.
2.0.1
Requirements and questions
Exploratory Analysis
Are three any correlation between features? why or why not?
Are there any outliers in the dataset? why or why not?
Are there any normalization needed? why or why not?
Are there any missing values? Why or why not?
Classification
Expriment the following algorithms/models: decision tree, random forest, adaboost, KNN, SVM,
MLP, and Naive Baye
For each model, train-test-split with 80% for train and 20% for test.
For each algorithm, output the following performance meaures: accuracy, precision, and recall
Visualiation
Pick at least one visualization for model perfomnace comparison.
Reflection
What is the best K in KNN? Only consider k in the range of 1-15. Use odd numbers only.
16
What is the most important features found by decision tree and random forest? (Graduate students
only)
Which algorithm has the highest accuracy? Is there model overfitting for this algorithm? Why
or why not?Perform 10-fold cross-validation with this algorithm and report the accuracy, precision
and recall. (Graduate students only)
17