Scraping Amazon item offers

In my pet project Dealesque, I am trying to compare all offers on a number of Amazon items, the idea being that it can help decide which offers to use to minimize shipping and total cost. Using Amazon Product Advertising API was the logical first step, but it doesn’t return all the offers for an item. It does however return the “more offers URL” for each item. Hence, the old scrapin’ was due, and none too late!

Plain wget-like action would not suffice, since Amazon is taking care to block unwanted traffic. So, mechanize gem to the rescue! It actually allows you to impersonate a real browser:

agent = Mechanize.new { |agent| agent.user_agent_alias = 'Mac Safari' }

After that, you can navigate the site, click away, read any forms etc.
For scraping, what I actually ended up using was to get the content of the “more offers URL” page and parse it using Nokogiri. Something like:

page = agent.get(more_offers_url)
root = Nokogiri::HTML(page.content.strip)
scrape_content(root)

For the current development stage, this is doing just fine. Unfortunately, for production use it will not suffice. There will probably be some traffic throttling from Amazon and some benchmarking will need to be done to determine the limits. Also, proxying the requests will probably be required too. But, I leave this for some other times.

The result of scraping the offers for picked items:

Dealesque picked items

About these ads
Tagged , , , ,

One thought on “Scraping Amazon item offers

  1. delbi says:

    majstore!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 108 other followers

%d bloggers like this: