Ruby Webpage Parser

Ruby Webpage Parser

README

This is how to parse a Website with a Ruby script. All the time when I’m searching over Internet documents (and some funny contents, like manga… yeah I love manga) I found many interesting contents, really so many, and I couldn’t view all of them; generally I ended writing scripts to download those mutlimedia contents (generally videos in Youtube) and review them with more time in my computer. Maybe you or someone else have this same wish, so that’s why I decided to show how do I do this… How to Parse a Website with a Ruby script; in the video I’m showing the webpage parsing process in a Manga Cataloge website, so after refine the script we can use it to know if a content was changed (and after download its content… the new obviously).

Ruby Webpage Parser 1

Explanation

I will use Ruby language because this is my favorite language to program, it has many plugins, modules, and many-many GEMS, learn more about GEM’s here: [](http://www.rubygems.org/)), but this isn’t a tutorial about how to write code in Ruby, you can use this post as a guide to do the same things in languages like Python with the library urllib or PHP with the module php5-curl.

In this case, Ruby has many GEMs to handle HTTP requests, but I like CURB more than others (yeah, CURB with B not CURL); install it with gem install curb.

Basically you’ll need to retrieve the content of an URL, organize the HTML code (split by new lines for example), locate a pattern to filter your search and finally parse the HTML code with a REGEX expression, for example this regular expression could be used to parse a Youtube video identifier (you know that Youtube use an alphanumeric string composed by eleven characters to identify each video):

'http://www.youtube.com/watch?v=JUwc69DjZPU&featured=list'.match(/http:\/\/www\.youtube\.com\/watch\?v=([a-zA-Z0-9_-]{11})(.*)/)

After get all the matches, you can use a resource downloader like the command wget or curl to get the content that you want to have. In the case of Youtube videos I use [](https://github.com/rg3/youtube-dl/) to download them; yes, I know that there are tools much better than a terminal script, but I just prefer to use these utilities than others because I can override all the processes and improve the code easily without many dependencies; anyway, I recommend to use this Python script written by Ricardo Garcia Gonzalez, Danny Colligan, Benjamin Johnson and many other people, [](http://www.clipgrab.de/) written by Philipp Schmieder or just use Java Downloader.

Code

This is the code written in the video, so if you want to check it in your machine; the code was encoded in Base64 because this f*cking sh*t CMS didn’t understood the code inside it, just copy and paste the next code and save it in a file (with any name) and use this command to decode it:

#!/usr/bin/env ruby
# Install (in Debian systems): apt-get install -y rubygems
require 'rubygems'
# Install (in Debian systems): apt-get install -y libcurl3-dev
# And finally: gem install curb
require 'curb'
#
curl = Curl::Easy.new('http://manga.cixtor.com/')
curl.http_get
content = curl.body_str.split("\n")
curl.close
#
manga_titles = Hash.new
manga_images = Hash.new
content.each do |linetext|
    if match = linetext.match(/(.*) \((\d+)\)<\/label>/) then
        manga_titles[ match[1] ] = match[2]
    elsif match = linetext.match(//) then
        manga_images[ match[1] ] = "/img/#{match[1]}/#{match[2]}"
    end
end
# We can debug the manga_{titles,images} array here
# p manga_titles manga_images
manga_list = Array.new
manga_titles.each_pair do |index, value|
    manga_list.push({
        :manga_title => index,
        :manga_chapters => value,
        :manga_cover => manga_images[index]
    })
end
# p manga_list
manga_list.each do |manga|
    print "Manga: #{manga[:manga_title]}; "
    print "Chapters: #{manga[:manga_chapters]}; "
    puts "Cover: #{manga[:manga_cover]}"
end
Do you have a project idea? Let's make it together!