Getting Wikipedia info boxes in a format that Ruby can understand

Question

Getting Wikipedia info boxes in a format that Ruby can understand

I'm trying to get data from a Wikipedia infobox to a hash or something else so that I can use it in my Ruby on Rails program. In particular, I am interested in Infobox and the face of Infobox . The example I used is the Ford Motor Company. I want to get company information for this and person information for people associated with the Ford box.

I tried to figure out how to do this from the Wikipedia or DBPedia API , but I had no luck. I know that wikipedia can return some things like json, which I could analyze using ruby, but I could not figure out how to get the info box. In the case of DBPedia, I kind of lost how to even request it to get information for the Ford Motor Company.

+4

ruby web-scraping wikipedia mediawiki-api dbpedia

hadees Dec 27 '10 at 23:31

source share

4 answers

Do not try to parse HTML with RegExp.

See: Open RegEx Tags, Except Standalone XHTML Tags

Use xpath or something similar.

+1

Beepdog Dec 28 '10 at 3:23

source share

I looked at their APIs and there seem to be a lot of details, but complexity is an obstacle. For long-term use, it would be better to figure this out, but for quick and dirty, here is a way to get the data.

I use Nokogiri , which is an XML / HTML parser and very flexible. For ease of use, I use CSS assemblers.

 #!/usr/bin/env ruby require 'open-uri' require 'nokogiri' require 'uri' URL = 'http://en.wikipedia.org/wiki/Ford_Motor_Company' doc = Nokogiri::HTML(open(URL)) infobox = doc.at('table[class="infobox vcard"]') infobox_caption = infobox.at('caption').text uri = URI.parse(URL) infobox_agents = Hash[ *infobox.search('td.agent a').map{ |a| [ a.text, uri.merge(a['href']).to_s ] }.flatten ] require 'ap' ap infobox_caption ap infobox_agents

The result is as follows:

 "Ford Motor Company" { "Henry Ford" => "http://en.wikipedia.org/wiki/Henry_Ford", "William C. Ford, Jr." => "http://en.wikipedia.org/wiki/William_Clay_Ford,_Jr.", "Executive Chairman" => "http://en.wikipedia.org/wiki/Chairman", "Alan R. Mulally" => "http://en.wikipedia.org/wiki/Alan_Mulally", "President" => "http://en.wikipedia.org/wiki/President", "CEO" => "http://en.wikipedia.org/wiki/Chief_executive_officer" }

So, he pulled out the text of the inscription and returned a hash of the names of people, where the keys are their names and the values are URLs.

+1

the tin man Dec 28 '10 at 4:11

source share

You can use open-uri to load the HTML code of a single wiki page and then interpret using Regexp. Take a look:

 require 'open-uri' infobox = {} open('http://en.wikipedia.org/wiki/Wikipedia') do |page| page.read.scan(/<th scope="row" style="text-align:left;">(.*?)<\/th>.<td class="" style="">(.*?)<\/td>/m) do |key, value| infobox[key.gsub(/<.*?>/, '').strip] = value.gsub(/<.*?>/, '').strip # Removes tags (as hyperlink) end end infobox["Slogan"] #=> "The free encyclopedia that anyone can edit." infobox["Available language(s)"] #=> "257 active editions (276 in total)"

There must be some better method. But it does work.

0

Guilherme bernal Dec 28 '10 at 0:42

source share

jimkont · Accepted Answer · 2011-01-12T17:49:22+0000

I will vote for DBpedia.

A simple explanation:

Dbpedia naming scheme http://dbpedia.org/resource/WikipediaArticleName (unique identifier) with spaces replaced by _ .

http://dbpedia.org/page/ArticleName (html preview) and http://dbpedia.org/data/ArticleName(.json/.jsod) is a JSON view for information about your article. (.rdf etc. may be confusing for you right now.)

For Ford Motor Company you should ask:

 http://dbpedia.org/data/Ford_Motor_Company.json

or

 http://dbpedia.org/data/Ford_Motor_Company.jsod

(which is easier for you)

Now, depending on the type of article, person or company, there are different properties that define them, which depend on the dbpedia ontology ( http://wiki.dbpedia.org/Ontology ).

A more advanced step would be to use SPARQL queries to retrieve your data.

Getting Wikipedia info boxes in a format that Ruby can understand

More articles: