Tuesday, May 26, 2009

Dynamic Link Crawler

So in a typical Web assessment one of the things one normally does is crawl all or some portion of a target website. It just makes the "big" picture a bit clearer when looking at how the web app is structured. Conventional crawlers work on href: references, recursively. Recently I looked at the site which I could not crawl. To be more specific, the site allowed me to either crawl a few links and then it would stall me. Other times, it would send my crawler into infinite loop. Either way, I could not accomplish what I came there to do.

It appeared that Web IDS module was timing the link references and If I went too fast it would shut me down for 10 minutes. Additionally, my crawler's User-Agent near useless navigating dynamically generated buttons/links in AJAX pages. I was a bit stuck. I don't mind manually testing the site, and in fact, I prefer going near-manual once I get through the initial crawl. But I really did not want to spend the whole day manually punching in ( let's call them item number, like in shopping cart scenario ) items to get through all the listings.

Well, we are in Web 2.0. As much as I wanted to stick with good-ol' techniques or fall back to lazy manual crawling I wanted to try and automate form submissions and searches. I needed a browser automation, really.

First I wanted to drive the IE via Powershell COM bridge.
It's a nice solution and I will be looking at it in depth some time later. However, I went with Watir framework to test my crawls and move on to more interesting stuff.

In IE (don;t you love when the site is only interested in MS-made browsers... ) the following was accomplished. Suppose there's a search field and a type of search to perform. I was interested in crawling through the people in the catalog.

I also had a driver script, but manually the following can be invoked as:

C:\Users\dxs\Tools\Ruby\bin\ruby.exe .\wsearch.rb LN "Blakkenship"
LN- for LastName field of search.


Code:

require 'watir'
require 'watir/close_all'


browser = Watir::IE.new
form=nil;
criteria=nil;

if ( ARGV.length == 2 )
criteria = ARGV[1]
format = case ARGV[0]
when "KW" then
form="formKeyword"

when "LN" then
form="formLastname"

when "FN" then
form="formFirstname"

else
raise "Invalid Format Arguments: Need KW,LN,FN etc."
end



else
puts "Error in arguments: specify Format (KW,LN,FN,etc.) and Search Criteria"
exit(1)
end

site = "http://site.com/index.cfm?contentID=21&type=1"

browser.goto site
puts "Searching through for: #{form} , criteria: #{criteria}"
browser.text_field(:name, form).set criteria

browser.button(:name, "Search").click
puts "\n\n"
browser.links.each { |l| puts l.href + "--->" + l.text if l.text =~ /#{criteria}/io }

Watir::IE.close_all



Output:

http://url.reference.to.entities?here&it=comes -> Mnemonic Name

I have to admit, instrumenting browsers is slow, and I was not nearly as fast as with the conventional crawler. Then again, the code is crude, no threading, no optimization, just plain "hacked-up" in a hurry. 7 hours later, and 100% under IDS radar, I emulated a human browsing the site. I had a nice database of stuff to work with come morning...


There is also FireWatir for driving Firefox in case it;s needed.

0 comments: