You may ask yourself the question: What is this Machine Learning thing all about? Why is it getting so much attention these days and why should I care? If you are a non-technical person, you may have heard the name many times before, but chances are you may not know what the fuss is all about. In this article, I want to give you a very quick hands-on impression of how machine learning can be useful to you and why it is so interesting. You do not need any technical background to follow through.
Machine learning basically attempts to teach a computer new things so that it can make better decisions. For the purposes of demonstration, we will have a look at a very small data set of around 570 records of patients with breast tumors. We will try to teach the computer to diagnose whether the tumor is benign (non-cancerous) or malignant (cancerous). If you feel overwhelmed already, don’t panic, all you will need is your computer, the Weka tool suite (see the link section) and most important of all, a nice cup of coffee!
Demystifying the Jargon
Before we get our hands dirty however, let’s try to clarify a few terms first. The problem we are trying to solve here is what Computer Scientists call a classification problem. The target class (whether the tumor is benign or malignant) is what we want to teach the computer to predict. The rest of the information (tumor properties such as radius, area and texture for example) are features that we can feed into the computer so that it can learn to identify patterns and diagnose accurately.
We will split our data set in two distinct parts. One we will call the training set, and the other we will call the test set. You may view the training set as the pairs of questions and answers that we give our classifier (the computer) to learn. Our testing data set is the exam which the classifier has to go through so that we can evaluate how well it has learned to make predictions.
Baby steps in Weka
OK, enough theory for now! Let’s head over to kaggle.com and download our medical dataset (links are at the end of this article) then we’ll head over to waikato.ac.nz and download Weka which is a tool suite that will allow us to experiment with our data. If we examine the dataset we just downloaded in Excel, we can see that it consists of rows for every patient, and that the columns consist of an ID field, our target class (diagnosis) as well as some features of the patient’s tumor.
Let’s now try and open our data in Weka. After launching the application, you should be greeted with a list of Weka Applications. Go ahead and click on “Explorer”, then on the new screen click “Open file…” and choose the CSV file which we downloaded from kaggle. While this screen may look daunting to you, trust me there are only one or two things which interest us here.
The above view is the Pre-processing tab. Under the “Attributes” section, Weka shows us a list of fields which we can use as features (those are the same as the columns we saw in Excel). Clicking on any attribute shows us some information and statistics about the given field on the right in the “Selected Attribute” section.
While there is a plethora of things Weka allows us to do to pre-process our data (automatically filling missing values, aggregating fields, etc..), we will keep things simple here and only go ahead and remove the ID field. Why? Because the ID field is just an arbitrary value which provides us with no useful information to predict tumor type. Leaving it in the data might confuse some machine learning algorithms. In fact, part of the art in machine learning is selecting the right set of features for a given problem. So go ahead and click the checkbox next to ID then click Remove.
Visualising our Data
Viewing our data in text and numbers is all nice and fancy but what about a picture? Can we gain more insight by visualizing our data through graphs? Let’s find out by clicking on the “Visualize” tab in the Weka Explorer. Under “Plot Matrix” we see a set of graphs plotting several attributes against each other. Each point on the graph representing a row in our data.
All while remaining in the “Visualize” tab, click on the Color dropdown menu and select “Diagnosis” then click “Update”. The target class (tumor type) is now color-coded. Blue points represent malignant tumors and red shows benign tumors.
Now if we examine the graph of the mean tumor area against the mean concavity, we see something interesting. Down in the lower left corner of the graph where both values are low, we see a grouping of red dots. As the values of both variables get bigger, we see mostly blue dots. Is there are relationship here? Can we use this to teach the computer to make simple decisions such as: If the mean concavity has a value less than 0.1 and the mean area has a value below 730, then the prediction should be “Benign”, otherwise it should be “Malignant”?
It turns out that we can use an algorithm called Random Forest and have the computer automatically build structures called Decision Trees for us. These decision trees are – to put it simply – a hierarchy of simple if/else decisions chained together. They represent the set of questions the computer asks itself based on the data you give it in order to come up with a prediction. The random forest consists of several decision trees whose predictions can be averaged in order to get better results for the classifier as a whole. The following diagram shows a very simple representation of how such a decision tree might look.
Doing some Classification
Now let’s try to train the computer to make sense of that data we briefly examined. Head over to the Classify tab. This view allows us to try out various classification algorithms. First, we will select the algorithm we want to use by clicking “Choose” then navigating down to “Trees” and selecting “Random Forest”. For the purposes of this article it will be enough to use the default settings.
Now, under the “Test options” section on the left of the screen, we choose “Percentage split” and leave the default value at 66%. This basically tells Weka to use two thirds of our data set for training and the remaining set of records for testing. Finally, we want to change the value of the dropdown menu below the “Test options” section to “Diagnosis”. This tells Weka what our target class is.
We are now ready to train our Random Forest classifier and evaluate its performance on our test data set. Click on the “Start” button and after a few seconds you should see results appearing in the “Classifier Output” section on the right.
If you scroll up the output you will see a line with the text “Correctly Classified Instances” and to the right of it you will see a percentage of about 95.8%. This tells us that our classifier was able to predict the correct diagnosis 95.8% of the time on our test data set.
Not too bad for our simple example! Of course, if this was part of a real diagnostic support solution we would have to use much larger datasets and be very careful and rigorous about the quality of our investigations and of the classifier’s predictions. In the end, you should bear in mind however that no classifier is 100% accurate; a doctor should always be consulted!
In this article, I gave you a very quick and simplistic glance at machine learning and the classification problem in particular. For the sake of simplicity, I purposely left out many considerations such as the choice of our training and testing dataset sizes, dealing with overfitting problems, or applying different weights to classification errors (for e.g. do we lower the overall accuracy in order to minimize malignant tumors being falsely classified as benign?). We also did not address any of the mathematics behind the algorithms we used.
I hope, however, that you got a taste of why machine learning is such an interesting topic and that I got you thinking about how this fascinating branch of computer science might help you solve your specific problems. If the topic interests you, have a look at the further reading section where you will find links to the data set we used and to the University of Waikato’s Weka tool. You will also find interesting material about Decision Trees and the Random Forest algorithm.
If you have any question or comment I would be happy to hear from you so please do not hesitate to drop me an email. I wish you a happy machine learning journey!