- Published on
Machine Learning with GoLearn
- Authors
- Name
- Anthony Corletti
- @anthonycorletti
It's really easy to build a K-nearest-neighbors implementation
in go using golearn
.
After searching for ways to start writing more go, especially ways that provide alternatives to familiar languages and frameworks, I've wanted to find machine learning libraries in golang because I tend to rely on python microservices and the rich ecosystem of ml libraries.
I'm writing this post using go1.14.4
so let's dive in to installing golearn
and writing up a quick example K-nearest-neighbors implementation.
To install golearn
make sure the following happens
- Your machine has to have some c++ compiler, check by running,
$ g++ --version
. If you're using a mac, XCode should already have this available to you. - Install OpenBLAS:
go get github.com/gonum/blas
- Install golearn:
go get -t -u -v github.com/sjwhitworth/golearn
- Complete the installation:
cd $GOPATH/src/github.com/sjwhitworth/golearn; go get -t -u -v ./...
As validation – those instructions should complete without error.
Now let's get building!
cd ~mkdir go-knncd go-knngo mod init example.com/go-knntouch main.go
All we've done so far is create a working directory for our package and initialized our modules in the package using an example placeholder for the module name.
You should only have two files in your tree and your go.mod
should look like the following.
$ tree.├── go.mod└── main.go
$ cat go.modmodule example.com/go-knn
go 1.14
Let's start by scaffolding out our function
package main
import ( "fmt")
func main() { fmt.Println("Load our csv data")
fmt.Println("Initialize our KNN classifier")
fmt.Println("Perform a training-test split")
fmt.Println("Calculate the euclidian distance and return the most popular label")
fmt.Println("Print our summary metrics")}
Now if we run go run main.go
we should see the following in stdout
$ go run main.goLoad our csv dataInitialize our KNN classifierPerform a training-test splitCalculate the euclidian distance and return the most popular labelPrint our summary metrics
Great, now download our test dataset from golearn's examples.
curl -o iris_headers.csv https://raw.githubusercontent.com/sjwhitworth/golearn/master/examples/datasets/iris_headers.csv
Now let's fill in our imports for golearn and the relevant code
package main
import ( "fmt" "github.com/sjwhitworth/golearn/base" "github.com/sjwhitworth/golearn/evaluation" "github.com/sjwhitworth/golearn/knn")
func main() { fmt.Println("Load our csv data") rawData, err := base.ParseCSVToInstances("iris_headers.csv", true) if err != nil { panic(err) }
fmt.Println("Initialize our KNN classifier") cls := knn.NewKnnClassifier("euclidean", "linear", 2)
fmt.Println("Perform a training-test split") trainData, testData := base.InstancesTrainTestSplit(rawData, 0.50) cls.Fit(trainData)
fmt.Println("Calculate the euclidian distance and return the most popular label") predictions, err := cls.Predict(testData) if err != nil { panic(err) } fmt.Println(predictions)
fmt.Println("Print our summary metrics") confusionMat, err := evaluation.GetConfusionMatrix(testData, predictions) if err != nil { panic(fmt.Sprintf("Unable to get confusion matrix: %s", err.Error())) } fmt.Println(evaluation.GetSummary(confusionMat))}
Now run go run main.go
and you should see output similar to the following
$ go run main.goLoad our csv dataInitialize our KNN classifierPerform a training-test splitCalculate the euclidian distance and return the most popular labelInstances with 88 row(s) 1 attribute(s)Attributes:* CategoricalAttribute("Species", [Iris-setosa Iris-versicolor Iris-virginica])
Data: Iris-setosa Iris-virginica Iris-virginica Iris-versicolor Iris-setosa Iris-virginica Iris-setosa Iris-versicolor Iris-setosa Iris-setosa Iris-versicolor Iris-versicolor Iris-versicolor Iris-setosa Iris-virginica Iris-setosa Iris-setosa Iris-setosa Iris-virginica Iris-versicolor Iris-setosa Iris-setosa Iris-versicolor Iris-versicolor Iris-virginica Iris-virginica Iris-setosa Iris-virginica Iris-versicolor Iris-virginica ...58 row(s) undisplayedPrint our summary metricsReference Class True Positives False Positives True Negatives Precision Recall F1 Score--------------- -------------- --------------- -------------- --------- ------ --------Iris-setosa 30 0 58 1.0000 1.0000 1.0000Iris-virginica 28 3 56 0.9032 0.9655 0.9333Iris-versicolor 26 1 58 0.9630 0.8966 0.9286Overall accuracy: 0.9545
So what does it all mean? Basically we're looking for how accurate our results are at identifying types of iris flowers based on the data we provided.
To do this we can use our F1-Score as our primary metric for measuring our accuracy. From wikipedia:
The F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive).
KNNs work really well for classification problems so any kind of dataset that needs to output a classification e.g. sound identification, image recognition, or internet traffic patterns are great examples.
It's super easy to get started writing ml applications with golearn. There's not a whole lot of code involved and the biggest question you have to ask yourself (and your team) is whether or not you're ready to switch into a language ecosystem without as much robust ML support as other languages (like python).