If you ever want to read Web pages in a different language you either need to learn the language or use translation services such as Google Translate. There are two problems with the second approach. First, some sites of interest specifically prohibit translation engines to directly work on the site content. You can, of course, save pages locally and then submit them to the translation engine. Second, translation is limited to X number of characters so if you have a long page you would have to split the content in pieces and feed them back to the translator.
Recently I needed to visit a Russian site, search for content in Russian language and translate it. The site disallowed Google, AltaVista and other well known translation engines. I did not need the full webpage, just what was relevant but that shows the point.
1. Search for relevant content.
A. You will need
require 'rubygems'
require 'hpricot'
require 'iconv'
require 'open-uri'
B. Open desired URL
doc = Hpricot(open("http://www.xxxxx.ru:8568/?text=%DC+%22+&q=-2232&p=#{npage}"
C. Search for a "DIV" with a certain attributes, or other tags
doc.search("//div[@class='title']").each do |elem|
if elem.inner_html =~ /some form of pattern you want to search for/
..........
2. Normalize to UTF-8 if necessary via
ic = Iconv.new('UTF-8','WINDOWS-1251')
# Your relevant content
puts ic.iconv(elem.inner_html)
3. Use Google API to submit for translation.
A. You will need
require 'rubygems'
require 'cgi'
require 'json'
require 'net/http'
B. Use Google Translate JSON API for RU->EN translation
base = 'http://ajax.googleapis.com/ajax/services/language/translate'
params = {
:langpair => "RU|EN",
:q => text_you_need_translated,
:v => 1.0
}
query = params.map{ |k,v| "#{k}=#{CGI.escape(v.to_s)}" }.join('&')
# send get request
response = Net::HTTP.get_response( URI.parse( "#{base}?#{query}" ) )
json = JSON.parse( response.body )
if json['responseStatus'] == 200
json['responseData']['translatedText']
else
raise StandardError, response['responseDetails']
end
0 comments:
Post a Comment