June 23, 2020
Machine Learning with GoLearn
@anthonycorletti

It's really easy to build a K-nearest-neighbors implementation in go using golearn.

After searching for ways to start writing more go, especially ways that provide alternatives to familiar languages and frameworks, I've wanted to find machine learning libraries in golang because I tend to rely on python microservices and the rich ecosystem of ml libraries.

I'm writing this post using go1.14.4 so let's dive in to installing golearn and writing up a quick example K-nearest-neighbors implementation.

To install golearn make sure the following happens

  1. Your machine has to have some c++ compiler, check by running, $ g++ --version. If you're using a mac, XCode should already have this available to you.
  2. Install OpenBLAS: go get github.com/gonum/blas
  3. Install golearn: go get -t -u -v github.com/sjwhitworth/golearn
  4. Complete the installation: cd $GOPATH/src/github.com/sjwhitworth/golearn; go get -t -u -v ./...

As validation – those instructions should complete without error.

Now let's get building!

cd ~
mkdir go-knn
cd go-knn
go mod init example.com/go-knn
touch main.go

All we've done so far is create a working directory for our package and initialized our modules in the package using an example placeholder for the module name.

You should only have two files in your tree and your go.mod should look like the following.

$ tree
.
├── go.mod
└── main.go

$ cat go.mod
module example.com/go-knn

go 1.14

Let's start by scaffolding out our function

package main

import (
	"fmt"
)

func main() {
	fmt.Println("Load our csv data")

	fmt.Println("Initialize our KNN classifier")

	fmt.Println("Perform a training-test split")

	fmt.Println("Calculate the euclidian distance and return the most popular label")

	fmt.Println("Print our summary metrics")
}

Now if we run go run main.go we should see the following in stdout

$ go run main.go
Load our csv data
Initialize our KNN classifier
Perform a training-test split
Calculate the euclidian distance and return the most popular label
Print our summary metrics

Great, now download our test dataset from golearn's examples.

curl -o iris_headers.csv https://raw.githubusercontent.com/sjwhitworth/golearn/master/examples/datasets/iris_headers.csv

Now let's fill in our imports for golearn and the relevant code

package main

import (
	"fmt"
	"github.com/sjwhitworth/golearn/base"
	"github.com/sjwhitworth/golearn/evaluation"
	"github.com/sjwhitworth/golearn/knn"
)

func main() {
	fmt.Println("Load our csv data")
	rawData, err := base.ParseCSVToInstances("iris_headers.csv", true)
	if err != nil {
		panic(err)
	}

	fmt.Println("Initialize our KNN classifier")
	cls := knn.NewKnnClassifier("euclidean", "linear", 2)

	fmt.Println("Perform a training-test split")
	trainData, testData := base.InstancesTrainTestSplit(rawData, 0.50)
	cls.Fit(trainData)

	fmt.Println("Calculate the euclidian distance and return the most popular label")
	predictions, err := cls.Predict(testData)
	if err != nil {
		panic(err)
	}
	fmt.Println(predictions)

	fmt.Println("Print our summary metrics")
	confusionMat, err := evaluation.GetConfusionMatrix(testData, predictions)
	if err != nil {
		panic(fmt.Sprintf("Unable to get confusion matrix: %s", err.Error()))
	}
	fmt.Println(evaluation.GetSummary(confusionMat))
}

Now run go run main.go and you should see output similar to the following

$ go run main.go
Load our csv data
Initialize our KNN classifier
Perform a training-test split
Calculate the euclidian distance and return the most popular label
Instances with 88 row(s) 1 attribute(s)
Attributes:
*	CategoricalAttribute("Species", [Iris-setosa Iris-versicolor Iris-virginica])

Data:
	Iris-setosa
	Iris-virginica
	Iris-virginica
	Iris-versicolor
	Iris-setosa
	Iris-virginica
	Iris-setosa
	Iris-versicolor
	Iris-setosa
	Iris-setosa
	Iris-versicolor
	Iris-versicolor
	Iris-versicolor
	Iris-setosa
	Iris-virginica
	Iris-setosa
	Iris-setosa
	Iris-setosa
	Iris-virginica
	Iris-versicolor
	Iris-setosa
	Iris-setosa
	Iris-versicolor
	Iris-versicolor
	Iris-virginica
	Iris-virginica
	Iris-setosa
	Iris-virginica
	Iris-versicolor
	Iris-virginica
	...
58 row(s) undisplayed
Print our summary metrics
Reference Class	True Positives	False Positives	True Negatives	Precision	Recall	F1 Score
---------------	--------------	---------------	--------------	---------	------	--------
Iris-setosa	30		0		58		1.0000		1.0000	1.0000
Iris-virginica	28		3		56		0.9032		0.9655	0.9333
Iris-versicolor	26		1		58		0.9630		0.8966	0.9286
Overall accuracy: 0.9545

So what does it all mean? Basically we're looking for how accurate our results are at identifying types of iris flowers based on the data we provided.

To do this we can use our F1-Score as our primary metric for measuring our accuracy. From wikipedia:

The F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive).

KNNs work really well for classification problems so any kind of dataset that needs to output a classification e.g. sound identification, image recognition, or internet traffic patterns are great examples.

It's super easy to get started writing ml applications with golearn. There's not a whole lot of code involved and the biggest question you have to ask yourself (and your team) is whether or not you're ready to switch into a language ecosystem without as much robust ML support as other languages (like python).