August 11, 2020
Text Processing in Ruby
@anthonycorletti

I've come across a lot of text based processing and analytics software is written in python, but not so much in ruby, so I've decided to share an example about how to algorithmically parse lots of text with plain ruby code – no dependencies required!

The problem statement is simple.

Given a new-line delimited text file, with each line being field separated by commas, determine the mm most common string of length nn, accounting for multiple most common strings.

Let's get hacking 🤓

Start off by defining a ruby class and some functions.

require 'csv'

class Pathfinder
    def initialize
    end

    def delimited_data
    end

    def find_freq_path
    end

    def limiter
    end
end

We'll be using ruby's built in csv library to do the file parsing. And defining a few functions.

  • The initialize function will set a few attributes, :input_filename, :target_length (nn), :col_delimiter, and :limit_results (mm).
  • The delimited_data function will basically return an object that ruby has read into memory such that we can iterate over the results. Let's assume for the purposes of this post that the file can be read into memory.
  • find_freq_path will use a rolling hash algorithm to build up a frequency counter of the most common strings of length nn.
  • limiter will limit our results to the top mm results, with the top mm being the strings that appear most frequent.
require 'csv'

class Pathfinder
    attr_accessor :input_filename, :target_length, :col_delimiter, :limit_results

    def initialize(input_filename: 'data.txt',
                   col_delimiter: ',',
                   target_length: 5,
                   limit_results: 1)
        @input_filename = input_filename
        @col_delimiter = col_delimiter
        @target_length = target_length
        @limit_results = limit_results
    end

    # ...
end

Basically this is telling us that we'll be parsing data.txt (row delimited by new lines and column delimited by commas) for the string of length 5 that appears the most.

Additionally, these fields can also be parameterized incase our delimiters change, we want to use another file, we change our target length, or change the number of results we want.

Let's read in our file.

def delimited_data
    CSV.read(@input_filename, col_sep: @col_delimiter)
end

And run that through our find_freq_path function

def find_freq_path
    result = {}

    delimited_data.each do |row|
        keys = []
        row.each do |col|
            keys.push col
            if keys.length == @target_length
                increment_hash_for_key result, arr_to_str(keys)
                keys.shift
            end
        end
    end

    limiter result
end

And now let's write our limiter, increment, and hash key functions (arr_to_str)

def limiter(hash)
    arr = []

    @limit_results.times do
        max = hash.max_by { |_key, value| value }
        max_v = max[1]
        arr.push hash.map { |k, v| v == max_v ? [k, v] : nil }.compact
        hash.delete_if { |_key, value| value == max_v }
    end

    arr
end

def increment_hash_for_key(hash, key)
    !hash[key] ? hash[key] = 1 : hash[key] += 1
end

def arr_to_str(arr)
    arr.map { |s| s.to_s }.join('')
end

When you're all done it should look like this.

require 'csv'

class Pathfinder
    attr_accessor :input_filename, :target_length, :col_delimiter, :limit_results

    def initialize(input_filename: 'data.txt',
                   col_delimiter: ',',
                   target_length: 5,
                   limit_results: 1)
        @input_filename = input_filename
        @col_delimiter = col_delimiter
        @target_length = target_length
        @limit_results = limit_results
    end

    def delimited_data
        CSV.read(@input_filename, col_sep: @col_delimiter)
    end

    def find_freq_path
        result = {}

        delimited_data.each do |row|
            keys = []
            row.each do |col|
                keys.push col
                if keys.length == @target_length
                    increment_hash_for_key result, arr_to_str(keys)
                    keys.shift
                end
            end
        end

        limiter result
    end

    def limiter(hash)
        arr = []

        @limit_results.times do
            max = hash.max_by { |_key, value| value }
            max_v = max[1]
            arr.push hash.map { |k, v| v == max_v ? [k, v] : nil }.compact
            hash.delete_if { |_key, value| value == max_v }
        end

        arr
    end

    def increment_hash_for_key(hash, key)
        !hash[key] ? hash[key] = 1 : hash[key] += 1
    end

    def arr_to_str(arr)
        arr.map { |s| s.to_s }.join('')
    end
end

We can write a test for this class too. In a new file in the same dir as your pathfinder class, write the following.

require_relative 'pathfinder'

pathfinder = Pathfinder.new
pathfinder.find_freq_path

My data.txt is about 200KB and 10K rows, and running that script with the target_length and limit_results as mentioned above, I got the following result.

$ time ruby pathfinder_test.rb
ruby pathfinder_test.rb  0.20s user 0.05s system 96% cpu 0.258 total

Fairly quick. Let's try with different parameters

pathfinder = Pathfinder.new(target_length: 5, limit_results: 5)
pathfinder.find_freq_path
$ time ruby pathfinder_test.rb
ruby pathfinder_test.rb  0.23s user 0.05s system 97% cpu 0.291 total

Negligible time differences here which is great. We could use limit_results to cache results frequently which we're storing into a current state, frequently read datastore and run our text processing script whenever we've accumulated more rows in our file.

I hope this example of writing a sample text processing class in ruby has shown that you can write performant, statistical, analytics tools with ruby – so if you're writing alot of api code in rails, you don't always have to switch to using python based libraries for data analytics.