In this article, we are going to first get an introduction to Supervised learning, followed by a little dive into the two most common types of supervised learning algorithms; namely, classification & regression. At the end we will have two coding examples, one for classification and one for regression. Both will use a different dataset and go through the steps in each algorithm.
The Table of Contents is added below. Read it before moving to the next parts, so you can first decide if this article is relevant for you.
Table of Contents
- Introduction to Supervised Learning
- Introduction to Classification & Regression
2.1 Classification
2.2 Regression - Prerequisites for the code examples
- Classification Example
4.1 Python Code - Regression Example
5.1 Python Code - Supervised Learning Applications
- Conclusion
Introduction to Supervised Learning
Supervised Learning is the most common type of learning in Machine Learning. During training, the algorithm is given the dataset with the correct answers/labels, thus the name ‘supervised’. Then, during testing, model tries to predict the correct output for similar new examples on the basis of what it has learnt from the previous data samples. To put this in a more relatable manner, lets consider a student preparing for a Maths exam. (S)he first does practice questions for which they can see the answers. If they get the wrong answer, they backpropagate to see which ‘step’ they messed up in and try to correct that. In the first go, they might get only 2 out of 10 practice questions correct, in which case, they would re-do them. Once they start getting more than 90% of their practice questions right, they could consider themselves ready for the actual exam. In the exam, they will get questions they haven’t seen or solved before, but would use the concepts learned during practice and try to solve them. That’s supervised learning in a nutshell!
Supervised Learning Algorithms: Classification & Regression
We are going to talk about two most important/commonly-used techniques in supervised learning:
Classification
Target variable consists of categories i.e. used to identify to which category an object belongs to. The output variable is discrete. Consider a dataset of cat and dog images. The classifier would take as input an image and its output would fall into two discrete categories: cat or dog. We can take the digit classifier we are going to code as an example, too. In cat vs dog classifier, there are two classes, in digits classifier there will be 10 i.e. Class 0 to Class 9, since there are a total of 10 digits.
Input: Image containing either a cat or a dog
Output: Probability values for each class (Example: {‘Cat: 0.80’, ‘Dog:0.20’})
Regression
Target variable is continuous i.e. used to predict a continuous valued attribute associated with an object. The output variable is a real value. For example, consider a dataset of house prices in a certain area. The classifier would take as input features of the house like number of rooms, area, furnished (yes/no), etc. and based on that has to output the estimated worth of the house. That is a regression task because price will be a continuous output.
Input: Csv file containing columns like number of bedrooms, area of the house in sq. ft. etc.
Output: Predicted Price or worth of the house (Example: $2501)
Prerequisites for the code examples
Before you go ahead, please note that there are a few prerequisites for understanding the code examples. It’s beginner-friendly but you should have some prior basic
knowledge of Machine Learning and programming in general, in any language (but preferably Python). You must also have Python 3.7 & Scikit-learn library installed as we will be using its pre-built Digits dataset
for our example. Other than that, the rest of the article is pretty easy to follow. We will also be using Jupyter Notebooks for writing the code. If you do not already have it installed, visit Jupyter Notebook before you begin the tutorial.
Coding Language: Python 3.7
IDE: Jupyter Notebook
Libraries: Sklearn, Matplotlib
Classification Example
We will be building an application to recognize handwritten digits using Digits Dataset which is included in scikit-learn’s datasets module. Each sample in this scikit-learn dataset is an 8×8 image representing a handwritten digit. This is a multiclass image classification problem with 10 classes representing digits from 0 to 9. We wish to classify the handwritten digits into their respective classes from 0 to 9 on the basis of the intensity values within the image which depict the shape of the digit. For more on this dataset, visit Digits Dataset.
Python Code
# Importing dataset, libraries, classifiers and performance metric
from __future__ import division
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split as tts
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Loading digits dataset
digits = load_digits()
# Create feature matrix
x = digits.data
# Create target vector
y = digits.target
# First 6 images stored in the images attribute of the dataset
print("First 6 images of the dataset: ")
for x in range (6):
plt.subplot(330 + 1 + x)
plt.imshow(digits.images[x], cmap=plt.get_cmap('gray'))
plt.show()
# Flattening the image to apply classifier
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
# Splitting the data into training and testing
x_train, x_test, y_train, y_test = train_test_split(data, digits.target, test_size=0.5, shuffle=False)
# Creating a classifier. SVM is set as default but you can test out other two as well by commenting out SVM and un-commennting the one you wish to try
clf = svm.SVC (gamma=0.001)
# Decision Tree Classifier
#clf = tree.DecisionTreeClassifier()
# Random Forest Classifier
#clf = RandomForestClassifier()
# Printing the details of the Classifier used
print ("Using: ", clf)
Output:
Using: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
# Training
clf.fit(x_train, y_train)
# Predicting
predictions = clf.predict(x_test)
#print ("\nPredictions:", predictions)
score = 0
for i in range(len(predictions)):
if predictions[i] == y_test[i]:
score += 1
print ("Accuracy:", (score / len(predictions)) * 100, "%")
# print accuracy_score(test_labels, predictions)
Output:
Accuracy: 96.88542825361512 %
Regression Example
We are going to build a regression model which predicts the rating of board games. Firstly, we will load the dataset and analyze to filter out garbage features. We’ll be doing that through the correlation matrix (strong correlation with the target/label means it’s an important feature as its value varies in a similar manner to the target value, which in our case is the rating). So lets get to it.
Python Code:
# Importing libraries, classifier and performance metric
import pandas
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
In the next few cells, we will load our dataset, analyze it, graph the correlation matrix, use the info in the correlation matrix to remove some features/columns from our dataset, and then in the end, proceed with applying our regression model on it.
# Load dataset
games = pandas.read_csv("games.csv") # download link:
# Names of all features/columns in the dataset
print(games.columns)
print(games.shape)
# Graph a histogram based on the average_rating column/
plt.hist(games["average_rating"])
# Display the plot
plt.show()
# Data Cleaning
# Delete rows which do not contain user reviews
games = games[games["users_rated"] > 0]
# Drop rows which contain missing values
games = games.dropna(axis=0)
# Graphing the correlation matrix
corr_mat = games.corr()
fig = plt.figure(figsize = (12, 9))
sns.heatmap(corr_mat, vmax=.8, square=True)
plt.show()
# Get the list of all columns from the dataframe
columns_list = games.columns.tolist()
# Filter the columns to remove ones we don't want.
cols = [col for col in columns_list if col not in ["bayes_average_rating", "average_rating", "type", "name", "id"]]
# the variable we'll be predicting through regression
target = "average_rating"
Splitting the dataset into training and testing set, followed by fitting the model on the training set.
train = games.sample(frac=0.8, random_state=1) # selecting 80% of the dataset as training set
# Select the rows not in the training set and put them in the testing set
test = games.loc[~games.index.isin(train.index)]
# Initialize the model class
model = LinearRegression()
# Fit the model on the training data
model.fit(train[columns], train[target])
Generating predictions and calculating the Mean Squared Error for the test set.
# Generate our predictions for the test set.
predictions = model.predict(test[columns])
print('Prediction on the first instance in Test Set: ', predictions[0])
# Compute error between our test predictions and the actual values.
print("Mean Square Error Value: ", mean_squared_error(predictions, test[target])
Supervised Learning Applications
Some common applications of Supervised Learning:
- Optical Character Recognition
- Handwriting Recognition
- Object Recognition
- Speech Recognition
- Pattern Recognition
- Spam Classifier
- Face Recognition
- Predicting Stock Price
Conclusion
To sum it all up, we started off by getting an introduction of what supervised learning is and its two main types which are Regression and Classification. We discussed how the two differ and then we went on to build a multiclass classification application about handwritten digits’ recognition, followed by a regression model to predict the average rating of board games. Lastly, we saw a few other use cases of supervised learning. All in all, we learnt how about the importance and use of Supervised learning algorithms in the world of Machine learning.