Tag Archives: Amazon

Scraping Amazon item offers

In my pet project Dealesque, I am trying to compare all offers on a number of Amazon items, the idea being that it can help decide which offers to use to minimize shipping and total cost. Using Amazon Product Advertising API was the logical first step, but it doesn’t return all the offers for an item. It does however return the “more offers URL” for each item. Hence, the old scrapin’ was due, and none too late!

Plain wget-like action would not suffice, since Amazon is taking care to block unwanted traffic. So, mechanize gem to the rescue! It actually allows you to impersonate a real browser:

agent = Mechanize.new { |agent| agent.user_agent_alias = 'Mac Safari' }

After that, you can navigate the site, click away, read any forms etc.
For scraping, what I actually ended up using was to get the content of the “more offers URL” page and parse it using Nokogiri. Something like:

page = agent.get(more_offers_url)
root = Nokogiri::HTML(page.content.strip)
scrape_content(root)

For the current development stage, this is doing just fine. Unfortunately, for production use it will not suffice. There will probably be some traffic throttling from Amazon and some benchmarking will need to be done to determine the limits. Also, proxying the requests will probably be required too. But, I leave this for some other times.

The result of scraping the offers for picked items:

Dealesque picked items

Tagged , , , ,

Stubbing Amazon API calls using VCR

When integrating with external services, it is wise to test those interactions. But then again, it soon becomes tedious and slow if you need to repeat tests and wait for those external services responses. So VCR gem comes to the rescue! Much has been written about it so I want go into details, but what put a smile on my face was the solution for the integration test where I needed to test searching Amazon item listings using Vacuum gem. So to test my code I needed to a way to:

  • tell VCR to record the Vacuum gem request and Amazon API response
  • reuse it for subsequent test runs

Two things bothered me:

  • Vacuum uses Excon for HTTP layer
  • Amazon API calls are signed, making two identical search calls have different URI’s – the difference being e.g. with Timestamp part of it

So how does one hook into these layers for test purposes? Fortunately, VCR comes with solutions for both issues.

There is a hook_into VCR configuration option for Excon. Essentially this means VCR can intercept Vacuum calls, great! Configuration is simple, just add :excon hook in spec helper, like in the gist below.

For signed Amazon API requests, VCR magic was needed 🙂 As you probably know, VCR saves the request and response to appropriate cassettes. For tests within cassette it tries to match the request from the test to the saved ones by comparing HTTP method and URI, as explained here. I couldn’t use it since Amazon API requests are signed, remember? And the existing matchers were of no help either. But, VCR also allows for custom matchers. So, I created a custom matcher that compares search keywords from request uri and that was it!

Now, when running tests, on first run the VCR records the Amazon API request and response to the configured location (spec/fixtures/vcr_cassettes in my case). Subsequent runs reuse those calls. This is more than OK for development tasks. If one needs to refresh the Amazon API response, just delete the saved cassette(s) and the sequence is repeated. Another choice I had to make was whether to store those response in SCM or not? In the end I decided not to save them. Search action is not destructive or otherwise dangerous so any developer can repeat the process without cost. Mind you, in some other use case, e.g. when billing some action over payment gateway, it would probably be wise to store the response.

Tagged , , , , , ,