I've come across a lot of text based processing and analytics software is written in python, but not so much in ruby, so I've decided to share an example about how to algorithmically parse lots of text with plain ruby code – no dependencies required!
The problem statement is simple.
Given a new-line delimited text file, with each line being field separated by commas, determine the most common string of length , accounting for multiple most common strings.
Let's get hacking 🤓
Start off by defining a ruby class and some functions.
require 'csv'
class Pathfinder
def initialize
end
def delimited_data
end
def find_freq_path
end
def limiter
end
end
We'll be using ruby's built in csv
library to do the file parsing. And defining a few functions.
- The
initialize
function will set a few attributes,:input_filename
,:target_length
(),:col_delimiter
, and:limit_results
(). - The
delimited_data
function will basically return an object that ruby has read into memory such that we can iterate over the results. Let's assume for the purposes of this post that the file can be read into memory. find_freq_path
will use a rolling hash algorithm to build up a frequency counter of the most common strings of length .limiter
will limit our results to the top results, with the top being the strings that appear most frequent.
require 'csv'
class Pathfinder
attr_accessor :input_filename, :target_length, :col_delimiter, :limit_results
def initialize(input_filename: 'data.txt',
col_delimiter: ',',
target_length: 5,
limit_results: 1)
@input_filename = input_filename
@col_delimiter = col_delimiter
@target_length = target_length
@limit_results = limit_results
end
# ...
end
Basically this is telling us that we'll be parsing data.txt
(row delimited by new lines and column delimited by commas) for the string of length 5 that appears the most.
Additionally, these fields can also be parameterized incase our delimiters change, we want to use another file, we change our target length, or change the number of results we want.
Let's read in our file.
def delimited_data
CSV.read(@input_filename, col_sep: @col_delimiter)
end
And run that through our find_freq_path
function
def find_freq_path
result = {}
delimited_data.each do |row|
keys = []
row.each do |col|
keys.push col
if keys.length == @target_length
increment_hash_for_key result, arr_to_str(keys)
keys.shift
end
end
end
limiter result
end
And now let's write our limiter, increment, and hash key functions (arr_to_str
)
def limiter(hash)
arr = []
@limit_results.times do
max = hash.max_by { |_key, value| value }
max_v = max[1]
arr.push hash.map { |k, v| v == max_v ? [k, v] : nil }.compact
hash.delete_if { |_key, value| value == max_v }
end
arr
end
def increment_hash_for_key(hash, key)
!hash[key] ? hash[key] = 1 : hash[key] += 1
end
def arr_to_str(arr)
arr.map { |s| s.to_s }.join('')
end
When you're all done it should look like this.
require 'csv'
class Pathfinder
attr_accessor :input_filename, :target_length, :col_delimiter, :limit_results
def initialize(input_filename: 'data.txt',
col_delimiter: ',',
target_length: 5,
limit_results: 1)
@input_filename = input_filename
@col_delimiter = col_delimiter
@target_length = target_length
@limit_results = limit_results
end
def delimited_data
CSV.read(@input_filename, col_sep: @col_delimiter)
end
def find_freq_path
result = {}
delimited_data.each do |row|
keys = []
row.each do |col|
keys.push col
if keys.length == @target_length
increment_hash_for_key result, arr_to_str(keys)
keys.shift
end
end
end
limiter result
end
def limiter(hash)
arr = []
@limit_results.times do
max = hash.max_by { |_key, value| value }
max_v = max[1]
arr.push hash.map { |k, v| v == max_v ? [k, v] : nil }.compact
hash.delete_if { |_key, value| value == max_v }
end
arr
end
def increment_hash_for_key(hash, key)
!hash[key] ? hash[key] = 1 : hash[key] += 1
end
def arr_to_str(arr)
arr.map { |s| s.to_s }.join('')
end
end
We can write a test for this class too. In a new file in the same dir as your pathfinder class, write the following.
require_relative 'pathfinder'
pathfinder = Pathfinder.new
pathfinder.find_freq_path
My data.txt
is about 200KB and 10K rows, and running that script with the target_length
and limit_results
as mentioned above, I got the following result.
$ time ruby pathfinder_test.rb
ruby pathfinder_test.rb 0.20s user 0.05s system 96% cpu 0.258 total
Fairly quick. Let's try with different parameters
pathfinder = Pathfinder.new(target_length: 5, limit_results: 5)
pathfinder.find_freq_path
$ time ruby pathfinder_test.rb
ruby pathfinder_test.rb 0.23s user 0.05s system 97% cpu 0.291 total
Negligible time differences here which is great. We could use limit_results
to cache results frequently which we're storing into a current state, frequently read datastore and run our text processing script whenever we've accumulated more rows in our file.
I hope this example of writing a sample text processing class in ruby has shown that you can write performant, statistical, analytics tools with ruby – so if you're writing alot of api code in rails, you don't always have to switch to using python based libraries for data analytics.