Object Oriented Scraper Backed With Tests Pt. 5

Last night I got the Crawler passing its test for #get_words_by_selector. This morning I realize that when someone sends in a junk selector I want to raise an exception of some kind. Since I don’t know much about Ruby Exceptions I’m doing a little digging…Ruby has both throw/catch and raise/rescue so what’s the difference between throw/catch and raise/rescue in Ruby?

Throwing exceptions for control flow

There’s a great guest post by Avdi Grimm on RubyLearning which covers this topic in depth. To summarize throw/catch is mainly used when doing exceptions as control flow. In other words, if you need to break out of a deeply nested loop or some other expensive operation you can throw an exception symbol which can be caught someone high up the call stack. Initially this rubbed me the wrong way since I know that things like goto and labels are a bad practice. Someone else raised this point in the comments to which Avid responded:

There is a fundamental difference between throw/catch and goto. Goto, in languages which support it, pays no attention to the stack. Any resources which were allocated before the goto are simply left dangling unless they are manually cleaned up.

throw/catch, like exception handling, unwinds the stack, triggering ensure blocks along the way. So, for example, if you throw inside an open() {…} block, the open file will be closed on the way up to the catch() block.

Raising exceptions for everything else

With throw/catch out of the way that leaves raise/rescue to handle everything else. I’m willing to bet that 99% of error code should probably be raising exceptions and throw/catch should only be used in situations where you need the control flow behavior. With that knowledge in hand I need to decide between one of Ruby’s built-in Exceptions or defining one of my own. Let’s define one of our own so we can get that experience under our belt.

Creating an exception subclass in Ruby

One tip I picked up while doing my research into raise and throw is that any exception that doesn’t subclass StandardError will not be caught by default. Here’s an example to illustrate:

###
# First we define an exception class which doesn't
# inherit from StandardError. As a result it won't
# be caught by a simple rescue. Instead we would
# need to rescue by its class name
###
class MyBadException < Exception
end

def miss_bad_exception
  raise MyBadException.new
  rescue
  p "I'll never be called :("
end

miss_bad_exception
MyBadException: MyBadException
  from (irb):4:in `miss_bad_exception'
  from (irb):8
  from /Users/Rob/.rvm/rubies/ruby-1.9.3-p125/bin/irb:16:in `<main>

# See that calling the method produces an uncaught exception...


###
# Next we'll subclass StandardError. As a result
# we won't have to explicitly define our class name
# for a rescue to work.
###
class MyGoodException < StandardError
end

def save_good_exception
  raise MyGoodException.new
  rescue
  p "I'm saved! My hero!"
end

save_good_exception
"I'm saved! My hero!"

# Yay! Our exception was caught!

We’ll call our Exception SelectorError to indicate that the provided selector did not return any results. For reference I often refer to this chart on RubyLearning when I want to see a list of all the available Exception classes. In our case we’ll just inherit from StandardError.

tentacles/lib/selection_error.rb
module Tentacles
  class SelectionError < StandardError
  end
end

I don’t think we actually need to do much more than that. The ability to pass a payload message should come from the super class so I think we’re good to go. Here’s our updated spec:

it "should raise an exception if nothing was returned" do
        expect { @crawler.get_words_by_selector('some-gibberish-selector') }.to raise_error(Tentacles::SelectionError, 'The selector did not return an results!')
end


Initially the test fails so now we need to update our Crawler to check if nothing was returned and raise the custom exception.

Here’s our updated Crawler with additional require and updated method.

tentacles/lib/crawler.rb
require 'open-uri'
require 'nokogiri'
require_relative 'selection_error'

module Tentacles
  class Crawler

    attr_reader :doc

    def self.from_uri(uri)
      new(uri)
    end

    def initialize(uri)
      @uri = uri
      @doc = Nokogiri::HTML(open(@uri))
      @counts = Hash.new(0)
    end

    def get_words_by_selector(selector)
      entries = doc.css(selector)
      raise Tentacles::SelectionError,
        'The selector did not return an results!' if entries.empty?
      entries.each do |entry|
        words = words_from_string(entry.content)
        count_frequency(words)
      end

      sorted = @counts.sort_by { |word, count| count }
      sorted.reverse!
      sorted.map { |word, count| "#{word}: #{count}"}
    end

    def get_metadata_by_selector(selector)
      # TODO
    end

  private

    def words_from_string(string)
      string.downcase.scan(/[\w']+/)
    end

    def count_frequency(word_list)
      for word in word_list
        @counts[word] += 1
      end
      @counts
    end
  end
end

All tests passing, we’re good to go :)

You should follow me on Twitter here.

  • Mood: Alert, Awake, Anxious
  • Sleep: 8
  • Hunger: 3
  • Coffee: 0