Junaid Khalid, Author At Junaid Khalid

Bucket Sort Algorithm & Time Complexity | Python Implementation

Introduction

In this tutorial, we are going to learn about a sorting algorithm named Bucket Sort. Bucket sort, or bin sort, is a comparison sort algorithm that works by assigning the elements of an array or list to a number of buckets. The elements within each bucket are then compared to each other and sorted individually by either recursively calling the Bucket sort function or by using a different sorting algorithm. In our example, we will be using the Insertion Sort as our intermediate algorithm to sort the elements within the individual buckets.

Bucket sort can be thought of as a scatter-order-gather approach towards sorting a list. Firstly, the elements of the input array or list are scattered into different buckets. Then, the elements are ordered or sorted within the containing bucket. Finally, the elements are gathered in order. This sorting approach is most suitable when the input is uniformly distributed over a range and also when you have data in decimals or floating-point values. For example, if you come across a problem that asks you to sort a large set of floating-point numbers, in the range from e.g. 0.5 to 1.5, uniformly distributed across the range, then Bucket sort is the best algorithm to do so.

What to expect?

So basically, in this tutorial, we will be implementing Bucket sort on a list using Python. We will give you a step by step demonstration of how Bucket sort actually works. We will also discuss the computational and time complexities of the worst, average and best cases of the algorithm and compare Bucket sort’s performance with other sorting algorithms. Also, we will talk about some of the applications of Bucket sort itself. So, let’s begin!

Prerequisites

To follow this tutorial, you should have some prior basic programming knowledge in any language (preferably in Python). Other than that, the rest of the article is pretty beginner friendly. We will be using Sublime Text for writing our implementation of the Bucket sort in Python 3. Feel free to use any other code editor that you like.

Bucket Sort Algorithm

Before jumping into its exact implementation, lets walk through the algorithm a little bit. Bucket sort works as follows:

Set up a list of initially empty buckets. Number of buckets is equal to the length of the input list.
Go over the original list, putting each element in its bucket. This is determined by the length of the input list and the largest element in the input list. This will be clearly demonstrated in the example that follows.
Sort each non-empty bucket. We have used Insertion sort for this. You can use any other sorting algorithm for this as well.
Visit the buckets in order. This will yield a list as the output in which the elements are arranged from smallest to largest.

Step by step demonstration of Bucket Sort Algorithm

Suppose the input list is:

Largest number = 1.2
Length of list = 6
Size = Largest number / Length of list = 1.2/6 = 0.19999 = 0.2
This information will be useful in determining in which bucket will each element of the list be placed.

Create empty buckets. The number of buckets is equal to the length of list which in our case is 6. So, we will have 6 empty buckets.

Insert elements into the buckets from the list. The question is, how do we know in which bucket to place each element? Remember we calculated ‘size’ in the first step. We will basically divide each value of the list by size which is 0.2. The output will tell us the index of the bucket where the value is to be placed. Let’s consider our input list.

1.2 / 0.2 = 6. We will subtract 1 from this as it is outside of your index range. So, 6 – 1 = 5
Thus, the value 1.2 will be placed inside the bucket at index 5.

Likewise,

0.22/ 0.2 = 1.1 = 1. Thus, the value 0.22 will be placed inside the bucket at index 1.

0.43/0.2 = 2.15 = 2. Thus, the value 0.43 will be placed inside the bucket at index 2.

0.36/0.2 = 1.8 = 1 (Remember that we always consider the floor of our value. Thus 1.8 will not be rounded to 2, rather it will be rounded to 1). The value 0.36 will be placed inside the bucket at index 1.

0.39/0.2 = 1.95 = 1. Thus, the value 0.39 will be placed inside the bucket at index 1.

Lastly,
0.27/0.2 = 1.35 = 1. Thus, the value 0.27 will be placed inside the bucket at index 1. So, what we finally get after this step looks something like this:

The elements within each bucket are now sorted using any of the stable sorting algorithms. Here, we have opted for Insertion sort. We will not be going into the details of this algorithm as our main focus is on Bucket sort algorithm. We will just talk briefly on how Insertion sort works and why we have chosen it as our intermediate sorting algorithm in one of the following sections. All you need to know for now is that after insertion sort, the elements within the individual buckets are sorted. They look something like this:

Lastly, the elements from each bucket are gathered or concatenated. It is done by iterating through each bucket and inserting individual elements into a list. This results in a final list with sorted elements which look like this:

Python Code

    def bucketSort (input_list):
        # Find maximum value and length of the input list to determine which value in the list goes into which bucket 
        largest = max (input_list)
        length = len (input_list)
        size = largest/length

         # Create n empty buckets where n is equal to the length of the input list
        buckets = [[] for _ in range (length)]

        # Put list elements into different buckets based on the index
        for i in range (length):
            j = int (input_list[i] / size)

            if j != length:
                buckets[j].append (input_list[i])
            else:
                buckets[length - 1].append (input_list[i])


       # Sort elements within the buckets using Insertion sort
        for i in range (length):
            insertionSort (buckets[i])



       # Concatenate buckets with sorted elements into a single list
        final_output = []
        for i in range (length):
            final_output = final_output + buckets[i]


        return final_output



    # Performs insertion sort algorithm on the individual bucket lists
    def insertionSort (bucket_list):
        for i in range (1, len (bucket_list)):
            var = bucket_list[i]
            j = i - 1
            while (j >= 0 and var < bucket_list[j]):
                bucket_list[j + 1] = bucket_list[j]
                j = j - 1
            bucket_list[j + 1] = var



    def main ():
        input_list = [1.20, 0.22, 0.43, 0.36,0.39,0.27]
        print ('Original list: ', end = '')
        print (input_list)
        sorted_list = bucketSort (input_list)
        print ('Sorted list: ', end = '')
        print (sorted_list)

    main ()

Output:

Original list: [1.2, 0.22, 0.43, 0.36, 0.39, 0.27]
Sorted list: [0.22, 0.27, 0.36, 0.39, 0.43, 1.2]

Intermediate Sorting Algorithm within Bucket Sort

As discussed above, for sorting the elements in each of the individual buckets, we used the insertion sort algorithm. Why? because insertion sort’s runtime is based on how far each element is from its final position. This implies that the number of comparisons between elements remains relatively small, and the memory hierarchy is better exploited or utilized.

Insertion Sort’s fine details are outside the scope of this tutorial. Its functioning can be understood by looking closely at the insertionSort function in the above code. It takes the first element of bucket list and assumes it’s sorted. Then the second element is stored in the ‘var’ variable. It compares var with the first element; if the first element is greater than var, var is placed in front of first element such that the first two elements become sorted. It then takes the third element and compares it with elements on the left of it. The third element is placed just behind the element smaller than it. If there is no smaller element, it is placed at the beginning of the list. Similarly, every unsorted element is placed at its correct position. The overall output is the different bucket lists, with elements arranged from smallest to largest in each.

Complexity

The complexity of an algorithm is simply the amount of work the algorithm performs to complete its task

Worst Case Complexity

Worst case occurs when the elements in the input list itself are arranged in the reverse order. When there are elements of close range in the list, which is the typical case for Bucket sort, they are likely to be placed in the same bucket and this may result in some buckets having a greater number of elements than others. This makes the complexity depend on the sorting algorithm used to sort the elements of the bucket, which in our case is insertion sort. Thus, if insertion sort is used to sort elements of the bucket, the time complexity for the worst case becomes O(n2).

Average Case Complexity

It occurs when the elements are distributed randomly in the list. Bucket sort runs in linear time in all cases until the sum of the squares of the bucket sizes is linear in the total number of elements. This makes O(n) to be the average case complexity of Bucket sort algorithm.

Best Case Complexity

Best case occurs when the elements in the input list itself are arranged in the correct order. Also, in the case of Bucket sort, if the elements are uniformly distributed within buckets, i.e. each bucket has the same number of elements as the other. If insertion sort is used to sort elements of a bucket, then the overall complexity in the best case will be linear i.e. O (n+k). O (n) is the complexity for making the buckets and O(k) is the complexity for sorting the elements of the bucket using algorithm having linear time complexity at best case.

Comparison with other sorting algorithms

Counting sort:

Bucket sort can be converted to Counting sort if we reduce the size of each bucket to 1, i.e. each bucket can only hold one value. The variable bucket size of Bucket sort allows it to use O (n) memory instead of O(M) memory, where M is the number of distinct values as is the case in Counting sort. Therefore, Bucket sort gives up Counting sort’s O (n + M) worst-case behavior.

Merge sort:

In Merge sort, a list is distributed into several sub lists to be sorted. However, because of the overlapping value ranges, the sub lists cannot be combined to one as in the final step of the Bucket sort algorithm. Instead, they must be interleaved by a merge algorithm which adds complexity to the algorithm as compared to Bucket sort.

Quick sort:

Bucket sort with two buckets is a form of Quicksort where the pivot value is always selected to be the middle value of the value range. This choice is effective for uniformly distributed inputs, but for other means of choosing the pivot in Quicksort e.g. randomly selected pivots, this makes it more resistant to clustering in the input distribution.

Bucket Sort Applications

Computing histograms in data visualization and analysis. For example, n students might be assigned test scores in the range from 0 to 10 and are then placed into ranges or buckets on the basis of those scores.
Working on data that is uniformly distributed over a close range
Working on floating point or decimal values

Conclusion

To sum it all up, we started off by getting an introduction to what Bucket sort is and went on to discuss what we need to know before we jump into its implementation in Python. We discussed its algorithm and each and every step involved in the sorting process. We talked about Insertion sort and why we chose it as our intermediate algorithm instead of other algorithms. Then the worst, average and best-case complexities of Bucket sort were analyzed. We drew a comparison and a relationship of Bucket sort with other sorting algorithms like counting, merge, and quick sort. Lastly, we covered some common applications of Bucket Sort. Overall, Bucket Sort is an important concept to understand when it comes to algorithms especially if you need to sort a list that is huge to fit into memory.

How to code a classification & regression model in Python

In this article, we are going to first get an introduction to Supervised learning, followed by a little dive into the two most common types of supervised learning algorithms; namely, classification & regression. At the end we will have two coding examples, one for classification and one for regression. Both will use a different dataset and go through the steps in each algorithm.

The Table of Contents is added below. Read it before moving to the next parts, so you can first decide if this article is relevant for you.

Introduction to Supervised Learning
Introduction to Classification & Regression
2.1 Classification
2.2 Regression
Prerequisites for the code examples
Classification Example
4.1 Python Code
Regression Example
5.1 Python Code
Supervised Learning Applications
Conclusion

Introduction to Supervised Learning

Supervised Learning is the most common type of learning in Machine Learning. During training, the algorithm is given the dataset with the correct answers/labels, thus the name ‘supervised’. Then, during testing, model tries to predict the correct output for similar new examples on the basis of what it has learnt from the previous data samples. To put this in a more relatable manner, lets consider a student preparing for a Maths exam. (S)he first does practice questions for which they can see the answers. If they get the wrong answer, they backpropagate to see which ‘step’ they messed up in and try to correct that. In the first go, they might get only 2 out of 10 practice questions correct, in which case, they would re-do them. Once they start getting more than 90% of their practice questions right, they could consider themselves ready for the actual exam. In the exam, they will get questions they haven’t seen or solved before, but would use the concepts learned during practice and try to solve them. That’s supervised learning in a nutshell!

Supervised Learning Algorithms: Classification & Regression

We are going to talk about two most important/commonly-used techniques in supervised learning:

Classification

Target variable consists of categories i.e. used to identify to which category an object belongs to. The output variable is discrete. Consider a dataset of cat and dog images. The classifier would take as input an image and its output would fall into two discrete categories: cat or dog. We can take the digit classifier we are going to code as an example, too. In cat vs dog classifier, there are two classes, in digits classifier there will be 10 i.e. Class 0 to Class 9, since there are a total of 10 digits.

Input: Image containing either a cat or a dog
Output: Probability values for each class (Example: {‘Cat: 0.80’, ‘Dog:0.20’})

Regression

Target variable is continuous i.e. used to predict a continuous valued attribute associated with an object. The output variable is a real value. For example, consider a dataset of house prices in a certain area. The classifier would take as input features of the house like number of rooms, area, furnished (yes/no), etc. and based on that has to output the estimated worth of the house. That is a regression task because price will be a continuous output.

Input: Csv file containing columns like number of bedrooms, area of the house in sq. ft. etc.
Output: Predicted Price or worth of the house (Example: $2501)

Prerequisites for the code examples

Before you go ahead, please note that there are a few prerequisites for understanding the code examples. It’s beginner-friendly but you should have some prior basic knowledge of Machine Learning and programming in general, in any language (but preferably Python). You must also have Python 3.7 & Scikit-learn library installed as we will be using its pre-built Digits dataset for our example. Other than that, the rest of the article is pretty easy to follow. We will also be using Jupyter Notebooks for writing the code. If you do not already have it installed, visit Jupyter Notebook before you begin the tutorial.

Coding Language: Python 3.7
IDE: Jupyter Notebook
Libraries: Sklearn, Matplotlib

Classification Example

We will be building an application to recognize handwritten digits using Digits Dataset which is included in scikit-learn’s datasets module. Each sample in this scikit-learn dataset is an 8×8 image representing a handwritten digit. This is a multiclass image classification problem with 10 classes representing digits from 0 to 9. We wish to classify the handwritten digits into their respective classes from 0 to 9 on the basis of the intensity values within the image which depict the shape of the digit. For more on this dataset, visit Digits Dataset.

Python Code

# Importing dataset, libraries, classifiers and performance metric

from __future__ import division  
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split as tts
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Loading digits dataset
digits = load_digits()
# Create feature matrix
x = digits.data
# Create target vector
y = digits.target

# First 6 images stored in the images attribute of the dataset
print("First 6 images of the dataset: ")

for x in range (6):

    plt.subplot(330 + 1 + x)
    plt.imshow(digits.images[x], cmap=plt.get_cmap('gray'))

plt.show()

# Flattening the image to apply classifier
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

# Splitting the data into training and testing
x_train, x_test, y_train, y_test = train_test_split(data, digits.target, test_size=0.5, shuffle=False)

# Creating a classifier. SVM is set as default but you can test out other two as well by commenting out SVM and un-commennting the one you wish to try
clf = svm.SVC (gamma=0.001)

# Decision Tree Classifier
#clf = tree.DecisionTreeClassifier()

# Random Forest Classifier
#clf = RandomForestClassifier()

# Printing the details of the Classifier used
print ("Using: ", clf)

Output:

Using: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

# Training
clf.fit(x_train, y_train)
# Predicting
predictions = clf.predict(x_test)
#print ("\nPredictions:", predictions)

score = 0
for i in range(len(predictions)):

    if predictions[i] == y_test[i]:

        score += 1

print ("Accuracy:", (score / len(predictions)) * 100, "%")
# print accuracy_score(test_labels, predictions)

Output:

Accuracy: 96.88542825361512 %

Regression Example

We are going to build a regression model which predicts the rating of board games. Firstly, we will load the dataset and analyze to filter out garbage features. We’ll be doing that through the correlation matrix (strong correlation with the target/label means it’s an important feature as its value varies in a similar manner to the target value, which in our case is the rating). So lets get to it.

Python Code:

# Importing libraries, classifier and performance metric

import pandas
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

In the next few cells, we will load our dataset, analyze it, graph the correlation matrix, use the info in the correlation matrix to remove some features/columns from our dataset, and then in the end, proceed with applying our regression model on it.

# Load dataset
games = pandas.read_csv("games.csv") # download link: 

# Names of all features/columns in the dataset
print(games.columns)
print(games.shape)

# Graph a histogram based on the average_rating column/
plt.hist(games["average_rating"])

# Display the plot
plt.show()

# Data Cleaning

# Delete rows which do not contain user reviews
games = games[games["users_rated"] > 0]
# Drop rows which contain missing values 
games = games.dropna(axis=0)

# Graphing the correlation matrix
corr_mat = games.corr()
fig = plt.figure(figsize = (12, 9))
sns.heatmap(corr_mat, vmax=.8, square=True)
plt.show()

# Get the list of all columns from the dataframe
columns_list = games.columns.tolist()
# Filter the columns to remove ones we don't want.
cols = [col for col in columns_list if col not in ["bayes_average_rating", "average_rating", "type", "name", "id"]]

# the variable we'll be predicting through regression
target = "average_rating"

Splitting the dataset into training and testing set, followed by fitting the model on the training set.

train = games.sample(frac=0.8, random_state=1) # selecting 80% of the dataset as training set
# Select the rows not in the training set and put them in the testing set
test = games.loc[~games.index.isin(train.index)]
# Initialize the model class
model = LinearRegression()
# Fit the model on the training data
model.fit(train[columns], train[target])

Generating predictions and calculating the Mean Squared Error for the test set.

# Generate our predictions for the test set.
predictions = model.predict(test[columns])
print('Prediction on the first instance in Test Set: ', predictions[0])
# Compute error between our test predictions and the actual values.
print("Mean Square Error Value: ", mean_squared_error(predictions, test[target])

Supervised Learning Applications

Some common applications of Supervised Learning:

Optical Character Recognition
Handwriting Recognition
Object Recognition
Speech Recognition
Pattern Recognition
Spam Classifier
Face Recognition
Predicting Stock Price

Conclusion

To sum it all up, we started off by getting an introduction of what supervised learning is and its two main types which are Regression and Classification. We discussed how the two differ and then we went on to build a multiclass classification application about handwritten digits’ recognition, followed by a regression model to predict the average rating of board games. Lastly, we saw a few other use cases of supervised learning. All in all, we learnt how about the importance and use of Supervised learning algorithms in the world of Machine learning.