Monday, June 29, 2009

Translating forbidden URLs

If you ever want to read Web pages in a different language you either need to learn the language or use translation services such as Google Translate. There are two problems with the second approach. First, some sites of interest specifically prohibit translation engines to directly work on the site content. You can, of course, save pages locally and then submit them to the translation engine. Second, translation is limited to X number of characters so if you have a long page you would have to split the content in pieces and feed them back to the translator.

Recently I needed to visit a Russian site, search for content in Russian language and translate it. The site disallowed Google, AltaVista and other well known translation engines. I did not need the full webpage, just what was relevant but that shows the point.

1. Search for relevant content.


A. You will need


require 'rubygems'
require 'hpricot'
require 'iconv'
require 'open-uri'


B. Open desired URL

doc = Hpricot(open("http://www.xxxxx.ru:8568/?text=%DC+%22+&q=-2232&p=#{npage}"


C. Search for a "DIV" with a certain attributes, or other tags

doc.search("//div[@class='title']").each do |elem|
if elem.inner_html =~ /some form of pattern you want to search for/
..........




2. Normalize to UTF-8 if necessary via

ic = Iconv.new('UTF-8','WINDOWS-1251')

# Your relevant content
puts ic.iconv(elem.inner_html)


3. Use Google API to submit for translation.

A. You will need

require 'rubygems'
require 'cgi'
require 'json'
require 'net/http'


B. Use Google Translate JSON API for RU->EN translation

base = 'http://ajax.googleapis.com/ajax/services/language/translate'


params = {
:langpair => "RU|EN",
:q => text_you_need_translated,
:v => 1.0
}

query = params.map{ |k,v| "#{k}=#{CGI.escape(v.to_s)}" }.join('&')

# send get request
response = Net::HTTP.get_response( URI.parse( "#{base}?#{query}" ) )

json = JSON.parse( response.body )

if json['responseStatus'] == 200
json['responseData']['translatedText']
else
raise StandardError, response['responseDetails']
end

0 comments: