Sublime Snippet Basics

Yesterday I covered some tips and tricks I’ve learned over the past few months of using Sublime. Something I didn’t touch on is Sublime’s Snippet architecture.

Sublime Text 2 Tips and Shortcuts

I’ve been using Sublime Text 2 for probably two months now and in that time I’ve discovered tons of useful tricks. I figured I should start writing them down for anyone who might be interested. I’ll try to explain the bits that seem esoteric because there are a lot of cool commands which only work in certain contexts.

Failing at Ruby

I’m just getting my ass kicked by Ruby tonight so I don’t have much to show. Trying to just get my metadata scraping to output something currently looks like this:

require 'open-uri'
require 'nokogiri'
require 'mechanize'
require_relative 'selection_error'

module Tentacles
  class Crawler

    attr_reader :doc

    def self.from_uri(uri)
      new(uri)
    end

    def initialize(uri)
      # Create a new instance of Mechanize and grab our page
      @agent = Mechanize.new

      @uri = uri
      @page = @agent.get(@uri)
      @counts = Hash.new(0)
    end

    def words_by_selector(selector, ignored_selector = nil)
      # Get all the links on the page
      post_links = @page.links.find_all { |l| l.attributes.parent.name == 'h1' }
      # Get rid of the first anchor since it's the site header
      post_links.shift
      post_links.each do |link|
        post = link.click
        @doc = post.parser
        nodes = nodes_by_selector(selector)
        nodes.each do |node|
          if ignored_selector
            ignored = node.css(ignored_selector)
            ignored.remove()
          end
          words = words_from_string(node.content)
          count_frequency(words)
        end
      end

      sorted = @counts.sort_by { |word, count| count }
      sorted.reverse!
      sorted.map! do |word, count|
        { word: word, count: count }
      end
      { word_count: sorted }
    end

    def metadata_by_selector(selector)
      metadata = { posts: [] }

      # Get all the links on the page
      post_links = @page.links.find_all { |l| l.attributes.parent.name == 'h1' }
      # Get rid of the first anchor since it's the site header
      post_links.shift
      post_links.each do |link|
        post = link.click
        @doc = post.parser
        time = @doc.css('time')[0]
        post_data = {}
        post_data[:date] = { date: time['datetime'] }
        post_data[:stats] = []
        nodes = nodes_by_selector(selector)
        nodes.each do |node|
          node.children.each do |child|
            post_data[:stats].push(child.content)
          end
        end
        metadata[:posts].push(post_data)
      end
      p metadata
    end

  private

    def nodes_by_selector(selector)
      nodes = @doc.css(selector)
      raise Tentacles::SelectionError,
        'The selector did not return an results!' if nodes.empty?
      nodes
    end

    def words_from_string(string)
      string.downcase.scan(/[\w']+/)
    end

    def count_frequency(word_list)
      for word in word_list
        @counts[word] += 1
      end
      @counts
    end
  end
end

Really ugly code that still doesn’t work. My biggest problem with Ruby is that I don’t have very good debugging tools and that frustrates the shit out of me. I’m so used to the visual debuggers in the Chrome Dev tools that doing everything with p or puts is just soul-crushing.

Wrapping Up the Word Count Spider

Yeesh, I gotta stop writing so late at night… Last night I was trying to get my spider to follow all the links on the blog’s archive page and then sum up all the words from every post. Unfortunately I was way too tired to get that to actually work. Tonight I finished that step of the process but it required some ugly code and refactoring our unit tests. Without further adieu…

Quick Spider Example

require 'mechanize'

# Create a new instance of Mechanize and grab our page
agent = Mechanize.new
page = agent.get('http://robdodson.me/blog/archives/')



# Find all the links on the page that are contained within
# h1 tags.
post_links = page.links.find_all { |l| l.attributes.parent.name == 'h1' }
post_links.shift



# Follow each link and print out its title
post_links.each do |link|
    post = link.click
    doc = post.parser
    p doc.css('.entry-title').text
end

Having a horrible time getting anything to run tonight. The code from above is a continuation from yesterday’s post except this time we’re finding every link on the page, then following that link and spitting out its title. Using this formula you could build something entirely recursive which acts as a full blown spider.

Unfortunately getting this to integrate into the existing app is not working for me tonight. Coding anything after 11 pm is usually a bad call, so I’m going to shut it down and try again in the morning.

You should follow me on Twitter here.

  • Mood: Tired, Annoyed
  • Sleep: 6
  • Hunger: 0
  • Coffee: 1

Crawling Pages With Mechanize and Nokogiri

Short post tonight because I spent so much time figuring out the code. It’s late and my brain is firing on about 1 cylinder so it took longer than I expected to get everything working.

The scraper that I’m building is supposed to work like a spider and crawl of the pages of my blog. I wasn’t sure what the best way to do that was so I started Googling and came up with Mechanize. There are other tools built on top of Mechanize, like Wombat, but since my task is so simple I figured I could just write everything I needed with Mechanize and Nokogiri. It’s usually a better idea to work with simple tools when you’re first grasping concepts so you don’t get lost in the weeds of some high powered framework.