Machine learning is far too broad of a topic to discuss on this website alone. To help you gain knowledge on the subject, this page will discuss the two appproaches to supervised and unsupervised machine learning, and provide an example of at least one technique. We hope that by demonstrating these techniques, you will:
🖥 Learn some Machine Learning terminology. |
🖥 Know that the two most popular machine learning programming languages are Python and R/R Studio. (For this demonstration of machine learning, we will be using R.) |
🖥 Have some knowledge to the types of machine learning and be able to discern between supervised and unsupervised learning. |
🖥 Impress your friends at parties with your machine learning prowess! |
To start, let's look at some data. Click the Run button below to view the famous Fisher-Anderson Iris Flower Dataset.
Using a few functions, we were able to find out that the dataset contains 150 observations (the rows) and 5 variables (the columns Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species, respectively). Because our dataset is small and what is known as labeled data (data collected by human experts who know how to distinguish between the three species of irises), it is easy for us to see the distinctions between the species, and make informed guesses about our data.
This process of separating data based on categorical variables is called CLASSIFICATION. Now imagine this dataset contained thousands of recorded observations. Suddenly, the process of classifying the species of irises gets much more complicated. Thankfully, classification is a great job for a machine learning algorithm! And any method or algorithm that classifies is called a classifier. Classification is a method of supervised machine learning because you are using labeled data "to identify similarities or differences, or calculate trends for predictions" (McSweeney). Other forms of supervised learning include making decision trees and regression modeling. Like classification, decision trees use labeled data to create a flow chart structure to visualize data observations on their common features. Whereas, regression creates a line of best fit to predict the value of one variable based on the other. Types of regression include linear, multiple, and logistic.
Now suppose we didn't have a species variable in our iris dataset? How would we be able to know that this data represented three different species (or groupings) of irises? That's where unsupervised learning comes in.
Unlike supervised machine learning, unsupervised learning examines unlabeled data and groups observations based on points of maximum similarity and points of maximum differences. There are two main types of unsupervised learning--CLUSTERING and DIMENSIONALITY REDUCTION. Unlike classification, clustering forms bunches of data observations called clusters around these points of similarity and difference. Dimensionality reduction, however, looks at ways to reduce the amount of random variables being considered. It does this through a process called feature selection, which looks for subsets of variables within a dataset.
Below we will look at an example of clustering with the k-means algorithm to find compactness and maximum separation between three species of iris flowers from the Fisher-Anderson Iris Flower Dataset.
From R's report, we see that the k-means function found 3 clusters within this dataset.
Let's create a table called a confusion matrix, which uses decision boundaries to separate data by the calculated maximum separation between the three groups. From the k-means values, we can create a chart to visualize this separation.
The confusion matrix and the decision boundaries plot shown below illustrate that there were was one boundary line which separated the setosa species clearly. The border line between the virginica and versicolor species clearly indicates that there was overlap between the two species.
From the decision boundaries chart and the clustering chart, we can see the result of kmeans clustering, which illustrates how the agent performed.