So in a typical Web assessment one of the things one normally does is crawl all or some portion of a target website. It just makes the "big" picture a bit clearer when looking at how the web app is structured. Conventional crawlers work on href: references, recursively. Recently I looked at the site which I could not crawl. To be more specific, the site allowed me to either crawl a few links and then it would stall me. Other times, it would send my crawler into infinite loop. Either way, I could not accomplish what I came there to do.
It appeared that Web IDS module was timing the link references and If I went too fast it would shut me down for 10 minutes. Additionally, my crawler's User-Agent near useless navigating dynamically generated buttons/links in AJAX pages. I was a bit stuck. I don't mind manually testing the site, and in fact, I prefer going near-manual once I get through the initial crawl. But I really did not want to spend the whole day manually punching in ( let's call them item number, like in shopping cart scenario ) items to get through all the listings.
Well, we are in Web 2.0. As much as I wanted to stick with good-ol' techniques or fall back to lazy manual crawling I wanted to try and automate form submissions and searches. I needed a browser automation, really.
First I wanted to drive the IE via Powershell COM bridge.
It's a nice solution and I will be looking at it in depth some time later. However, I went with Watir framework to test my crawls and move on to more interesting stuff.
In IE (don;t you love when the site is only interested in MS-made browsers... ) the following was accomplished. Suppose there's a search field and a type of search to perform. I was interested in crawling through the people in the catalog.
I also had a driver script, but manually the following can be invoked as:
C:\Users\dxs\Tools\Ruby\bin\ruby.exe .\wsearch.rb LN "Blakkenship"
LN- for LastName field of search.
Code:
require 'watir'
require 'watir/close_all'
browser = Watir::IE.new
form=nil;
criteria=nil;
if ( ARGV.length == 2 )
criteria = ARGV[1]
format = case ARGV[0]
when "KW" then
form="formKeyword"
when "LN" then
form="formLastname"
when "FN" then
form="formFirstname"
else
raise "Invalid Format Arguments: Need KW,LN,FN etc."
end
else
puts "Error in arguments: specify Format (KW,LN,FN,etc.) and Search Criteria"
exit(1)
end
site = "http://site.com/index.cfm?contentID=21&type=1"
browser.goto site
puts "Searching through for: #{form} , criteria: #{criteria}"
browser.text_field(:name, form).set criteria
browser.button(:name, "Search").click
puts "\n\n"
browser.links.each { |l| puts l.href + "--->" + l.text if l.text =~ /#{criteria}/io }
Watir::IE.close_all
Output:
http://url.reference.to.entities?here&it=comes -> Mnemonic Name
I have to admit, instrumenting browsers is slow, and I was not nearly as fast as with the conventional crawler. Then again, the code is crude, no threading, no optimization, just plain "hacked-up" in a hurry. 7 hours later, and 100% under IDS radar, I emulated a human browsing the site. I had a nice database of stuff to work with come morning...
There is also FireWatir for driving Firefox in case it;s needed.
Tuesday, May 26, 2009
Dynamic Link Crawler
Posted by snow at 5:18 PM
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment