Anthony Corletti
Published on

Text Processing in Ruby

Authors

I've come across a lot of text based processing and analytics software is written in python, but not so much in ruby, so I've decided to share an example about how to algorithmically parse lots of text with plain ruby code – no dependencies required!

The problem statement is simple.

Given a new-line delimited text file, with each line being field separated by commas, determine the mm most common string of length nn, accounting for multiple most common strings.

Let's get hacking 🤓

Start off by defining a ruby class and some functions.

require 'csv'
class Pathfinder
def initialize
end
def delimited_data
end
def find_freq_path
end
def limiter
end
end

We'll be using ruby's built in csv library to do the file parsing. And defining a few functions.

  • The initialize function will set a few attributes, :input_filename, :target_length (nn), :col_delimiter, and :limit_results (mm).
  • The delimited_data function will basically return an object that ruby has read into memory such that we can iterate over the results. Let's assume for the purposes of this post that the file can be read into memory.
  • find_freq_path will use a rolling hash algorithm to build up a frequency counter of the most common strings of length nn.
  • limiter will limit our results to the top mm results, with the top mm being the strings that appear most frequent.
require 'csv'
class Pathfinder
attr_accessor :input_filename, :target_length, :col_delimiter, :limit_results
def initialize(input_filename: 'data.txt',
col_delimiter: ',',
target_length: 5,
limit_results: 1)
@input_filename = input_filename
@col_delimiter = col_delimiter
@target_length = target_length
@limit_results = limit_results
end
# ...
end

Basically this is telling us that we'll be parsing data.txt (row delimited by new lines and column delimited by commas) for the string of length 5 that appears the most.

Additionally, these fields can also be parameterized incase our delimiters change, we want to use another file, we change our target length, or change the number of results we want.

Let's read in our file.

def delimited_data
CSV.read(@input_filename, col_sep: @col_delimiter)
end

And run that through our find_freq_path function

def find_freq_path
result = {}
delimited_data.each do |row|
keys = []
row.each do |col|
keys.push col
if keys.length == @target_length
increment_hash_for_key result, arr_to_str(keys)
keys.shift
end
end
end
limiter result
end

And now let's write our limiter, increment, and hash key functions (arr_to_str)

def limiter(hash)
arr = []
@limit_results.times do
max = hash.max_by { |_key, value| value }
max_v = max[1]
arr.push hash.map { |k, v| v == max_v ? [k, v] : nil }.compact
hash.delete_if { |_key, value| value == max_v }
end
arr
end
def increment_hash_for_key(hash, key)
!hash[key] ? hash[key] = 1 : hash[key] += 1
end
def arr_to_str(arr)
arr.map { |s| s.to_s }.join('')
end

When you're all done it should look like this.

require 'csv'
class Pathfinder
attr_accessor :input_filename, :target_length, :col_delimiter, :limit_results
def initialize(input_filename: 'data.txt',
col_delimiter: ',',
target_length: 5,
limit_results: 1)
@input_filename = input_filename
@col_delimiter = col_delimiter
@target_length = target_length
@limit_results = limit_results
end
def delimited_data
CSV.read(@input_filename, col_sep: @col_delimiter)
end
def find_freq_path
result = {}
delimited_data.each do |row|
keys = []
row.each do |col|
keys.push col
if keys.length == @target_length
increment_hash_for_key result, arr_to_str(keys)
keys.shift
end
end
end
limiter result
end
def limiter(hash)
arr = []
@limit_results.times do
max = hash.max_by { |_key, value| value }
max_v = max[1]
arr.push hash.map { |k, v| v == max_v ? [k, v] : nil }.compact
hash.delete_if { |_key, value| value == max_v }
end
arr
end
def increment_hash_for_key(hash, key)
!hash[key] ? hash[key] = 1 : hash[key] += 1
end
def arr_to_str(arr)
arr.map { |s| s.to_s }.join('')
end
end

We can write a test for this class too. In a new file in the same dir as your pathfinder class, write the following.

require_relative 'pathfinder'
pathfinder = Pathfinder.new
pathfinder.find_freq_path

My data.txt is about 200KB and 10K rows, and running that script with the target_length and limit_results as mentioned above, I got the following result.

$ time ruby pathfinder_test.rb
ruby pathfinder_test.rb 0.20s user 0.05s system 96% cpu 0.258 total

Fairly quick. Let's try with different parameters

pathfinder = Pathfinder.new(target_length: 5, limit_results: 5)
pathfinder.find_freq_path
$ time ruby pathfinder_test.rb
ruby pathfinder_test.rb 0.23s user 0.05s system 97% cpu 0.291 total

Negligible time differences here which is great. We could use limit_results to cache results frequently which we're storing into a current state, frequently read datastore and run our text processing script whenever we've accumulated more rows in our file.

I hope this example of writing a sample text processing class in ruby has shown that you can write performant, statistical, analytics tools with ruby – so if you're writing alot of api code in rails, you don't always have to switch to using python based libraries for data analytics.