Tuesday, June 2, 2009

Google Books Spidering

I don't usually read books on Google Books. The scans are not of the best quality and they do not allow me to conveniently print a page or two for offline reading. This is apparently discouraged by Google. It's supposed to be a "demo" site for further buying experience and I am fine with that.
This post is not meant to discuss the merits of "crippled" demos. I actually wanted to talk a little about content protection Google chose to employ for the content.

In my mind, if you are trying to prevent content sifting or crawling ( and the obviously do since there is a copyright notice on every page), you should evaluate more methods of protection than obfuscating Javascript code fetching images of scanned pages into the browser. You should not rely on AJAX calls to eliminate first generation spidering (href). You should not allow incomplete URL parameter randomization, and you SHOULD tie requests to an existing session.

So, on to the example.
Suppose I like Ruby by Example Book, and I do not agree with Google's TOS, and I want to use the book's content for my own purposes.
Every page of the book scan I am interested in gets fetched with XHR from google and rendered in the browser. Breaking on the request and following it around lands me into the following format JSON response.


Content-Type: application/javascript; charset=UTF-8
Server: OFE/0.1
Content-Length: 2496

{"page":[{"pid":"PR21","src":"http://books.google.com/books?id=kq2dBNdAl3IC\x26pg=PR21\x26img=1\x26zoom=3\x26hl=en\x26sig=ACfU3U2ydqAZXhIBKIH1XKTJhS4Ay2IXkg","highlights":[{"X":370,"Y":51,"W":26,"H":11},{"X":139,"Y":93,"W":19,"H":10},{"X":218,"Y":119,"W":19,"H":10},{"X":352,"Y":186,"W":26,"H":11},{"X":230,"Y":214,"W":25,"H":11},{"X":417,"Y":255,"W":26,"H":11},{"X":493,"Y":269,"W":23,"H":11},{"X":370,"Y":449,"W":25,"H":11},{"X":402,"Y":490,"W":26,"H":11},{"X":139,"Y":585,"W":22,"H":11},{"X":320,"Y":614,"W":23,"H":11},{"X":146,"Y":681,"W":21,"H":9},{"X":158,"Y":690,"W":20,"H":9},{"X":139,"Y":699,"W":20,"H":9}],"flags":0,"order":22,"uf":"http://books.google.com/books_feedback?id=kq2dBNdAl3IC\x26spid=ygOBAha9Lj5wEmJbb7L0E4AMedYBAAAAEwAAAAvsLgsil0rRCj9QbBB0CmBqRC_Lik05VtZnyTK-XBfQ\x26ftype=0","vq":"ruby by


..... Many more goes here.

This blob is processed by and obfuscated long-name JS file which puts into the DOM and renders in the browser. Let's say it's irrelevant at the moment.

Look at the following snippet from JSON response:


"src":"http://books.google.com/books?id=kq2dBNdAl3IC\x26pg=PR21\x26img=1\x26zoom=3\x26hl=en\x26sig=ACfU3U2ydqAZXhIBKIH1XKTJhS4Ay2IXkg


OK, \x26 is really &. Otherwise, it's a valid url 3-time zoomed image of page 21 of the book id=kq2dBNdAl3IC .

Also there is a dynamic signature of the page at the end: sig=ACfU3U2ydqAZXhIBKIH1XKTJhS4Ay2IXkg

Every page of this book has different signature. However, look at the following 2 requests:


http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA102&img=1&zoom=3&hl=en&sig=ACfU3U0j7KKM_nSZ5HTwPQxpka2gDwJFsQ
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA103&img=1&zoom=3&hl=en&sig=ACfU3U2itwtHSRsi3gGA_1uqDFYlX76BqA


There is a non-random element at the beginning of the payload. I am not going to go into how we can try and brute force or fuzz the signature here, or how to read client-side JS file to figure out what that signature consists of. The point is that the content navigation is not tied to session cookies or any other UI navigation data. Simple GET on the URL fetches the image of a page.

All you have to do now it to set your favorite web proxy to log URLs for JPEGs conforming to ".*sig=ACfU3U.*" and iterate the pages. You don't even need to capture the content yet.

Google does the job of fetching all the pages once you start mouse-scrolling in the book DIV. So you scroll through the whole book, then you go to you proxy log and pick up the following records (substituting the \x26 -> & ).


http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA102&img=1&zoom=3&hl=en&sig=ACfU3U0j7KKM_nSZ5HTwPQxpka2gDwJFsQ
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA103&img=1&zoom=3&hl=en&sig=ACfU3U2itwtHSRsi3gGA_1uqDFYlX76BqA
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA105&img=1&zoom=3&hl=en&sig=ACfU3U0SJesKmEQ2HUl2ntgNVBIrLK7UHQ
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA106&img=1&zoom=3&hl=en&sig=ACfU3U3i-gOkxdtYfeGLd7CFsRGZiPnT_Q
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA107&img=1&zoom=3&hl=en&sig=ACfU3U0FbGnYvyAY2T6uGV9rA-bY0J4cvw
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA10&img=1&zoom=3&hl=en&sig=ACfU3U3B0rfiUmevGsmVHgLEDN3sxANqkg
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA11&img=1&zoom=3&hl=en&sig=ACfU3U3uXbNxXALDKMG-OZ2bEGVlzN3JaA
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA13&img=1&zoom=3&hl=en&sig=ACfU3U0Bb32Lu4L9KzlCRS1gbURVfNcklA
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA14&img=1&zoom=3&hl=en&sig=ACfU3U1HVLZyKZBfm9y01Ly-Lp6AEo7B8Q
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA15&img=1&zoom=3&hl=en&sig=ACfU3U3aVGlHL9Sph_ttbm7tfSWVNyyFMQ
http://books.google.com/books?id=kq2dBNdAl3IC&pg=PA16&img=1&zoom=3&hl=en&sig=ACfU3U1CFrpu9LiQwuS1HIcsYu6qBrNppg


You now plug it into the script (curl will do, so will wget) to fetch the book's content.

Now, I have not researched it enough, but I wonder if Watir or Selenium, or other browser automation frameworks can scroll the content for you and automate the process altogether.

I don't encourage anyone to actually copy Google's content - go buy the book if you like it, because the people who suffer most from copying are the authors.
However, the idea here is - how does Google plan to protect my data tomorrow if it cannot protect something it makes money on today.

0 comments: