Using Nokogiri to Separate Content on BR Tags

Question

Using Nokogiri to Separate Content on BR Tags

I have an im code snippet that is trying to parse with nokogiri that looks like this:

<td class="j"> <a title="title text1" href="http://link1.com">Link 1</a> (info1), Blah 1,<br> <a title="title text2" href="http://link2.com">Link 2</a> (info1), Blah 1,<br> <a title="title text2" href="http://link3.com">Link 3</a> (info2), Blah 1 Foo 2,<br> </td>

I have access to the source of td.j using something like this: data_items = doc.css("td.j")

My goal is to split each of these lines into an array of hashes. The only logical split point that I see is to split by BR and then use some regular expression in the string.

I was wondering if there is a better way to do this, perhaps using only nokogiri? Even if I could use nokogiri to suck out 3 positions, that would make my job easier, as I could just parse the regular expressions from the .content result.

I don’t know how to use Nokogiri to capture lines ending in br, though - should I use xpaths? any direction is appreciated! thanks

+6

ruby parsing xpath screen-scraping nokogiri

Mario zigliotto Aug 14 '11 at 18:50

source share

2 answers

I am not sure that I use an array of hashes, and without an example I can not offer anything. However, to separate the text into tags <br> I would do it as follows:

 require 'nokogiri' doc = Nokogiri::HTML('<td class="j"> <a title="title text1" href="http://link1.com">Link 1</a> (info1), Blah 1,<br> <a title="title text2" href="http://link2.com">Link 2</a> (info1), Blah 1,<br> <a title="title text2" href="http://link3.com">Link 3</a> (info2), Blah 1 Foo 2,<br> </td>') doc.search('br').each do |n| n.replace("\n") end doc.at('tr.j').text.split("\n") # => ["", " Link 1 (info1), Blah 1,", "Link 2 (info1), Blah 1,", "Link 3 (info2), Blah 1 Foo 2,"]

This brings you closer to the hash:

 Hash[*doc.at('td.j').text.split("\n")[1 .. -1].map{ |t| t.strip.split(',')[0 .. 1] }.flatten] # => {"Link 1 (info1)"=>" Blah 1", "Link 2 (info1)"=>" Blah 1", "Link 3 (info2)"=>" Blah 1 Foo 2"}

+17

the tin man Aug 14 '11 at 21:26

source share

mu is too short · Accepted Answer · 2011-08-14T20:10:37+0000

If your data is really just as regular and you don't need attributes from the <a> elements, you can parse the text form of each cell in the table without worrying about <br> elements <br> general.

Given some HTML like this in html :

 <table> <tbody> <tr> <td class="j"> <a title="title text1" href="http://link1.com">Link 1</a> (info1), Blah 1,<br> <a title="title text2" href="http://link2.com">Link 2</a> (info1), Blah 1,<br> <a title="title text2" href="http://link3.com">Link 3</a> (info2), Blah 1 Foo 2,<br> </td> <td class="j"> <a title="title text1" href="http://link4.com">Link 4</a> (info1), Blah 2,<br> <a title="title text2" href="http://link5.com">Link 5</a> (info1), Blah 2,<br> <a title="title text2" href="http://link6.com">Link 6</a> (info2), Blah 2 Foo 2,<br> </td> </tr> <tr> <td class="j"> <a title="title text1" href="http://link7.com">Link 7</a> (info1), Blah 3,<br> <a title="title text2" href="http://link8.com">Link 8</a> (info1), Blah 3,<br> <a title="title text2" href="http://link9.com">Link 9</a> (info2), Blah 3 Foo 2,<br> </td> <td class="j"> <a title="title text1" href="http://linkA.com">Link A</a> (info1), Blah 4,<br> <a title="title text2" href="http://linkB.com">Link B</a> (info1), Blah 4,<br> <a title="title text2" href="http://linkC.com">Link C</a> (info2), Blah 4 Foo 2,<br> </td> </tr> </tbody> </table>

You can do it:

 chunks = doc.search('.j').map { |td| td.text.strip.scan(/[^,]+,[^,]+/) }

and do the following:

 [ [ "Link 1 (info1), Blah 1", "Link 2 (info1), Blah 1", "Link 3 (info2), Blah 1 Foo 2" ], [ "Link 4 (info1), Blah 2", "Link 5 (info1), Blah 2", "Link 6 (info2), Blah 2 Foo 2" ], [ "Link 7 (info1), Blah 3", "Link 8 (info1), Blah 3", "Link 9 (info2), Blah 3 Foo 2" ], [ "Link A (info1), Blah 4", "Link B (info1), Blah 4", "Link C (info2), Blah 4 Foo 2" ] ]

in chunks . You can then convert this to any hash form you needed.

Using Nokogiri to Separate Content on BR Tags

More articles: