How to parse consecutive tags using Nokogiri?
I have an HTML code:
<div id="first"> <dt>Label1</dt> <dd>Value1</dd> <dt>Label2</dt> <dd>Value2</dd> ... </div> My code is not working.
doc.css("first").each do |item| label = item.css("dt") value = item.css("dd") end Show all the <dt> tags and then the <dd> tags, and I need "label: value"
First of all, your HTML should have the <dt> and <dd> elements inside the <dl> :
<div id="first"> <dl> <dt>Label1</dt> <dd>Value1</dd> <dt>Label2</dt> <dd>Value2</dd> ... </dl> </div> but that will not change your analysis. You want to find <dt> and iterate over them, then in each <dt> you can use next_element to get <dd> ; something like that:
doc = Nokogiri::HTML('<div id="first"><dl>...') doc.css('#first').search('dt').each do |node| puts "#{node.text}: #{node.next_element.text}" end This should work as long as the structure matches your example.
Assuming some <dt> may have multiple <dd> , you want to find all <dt> , and then (for each) find the next <dd> to the next <dt> . This is pretty easy to do in pure Ruby, but more fun to do only in XPath .;)
Given this setting:
require 'nokogiri' html = '<dl id="first"> <dt>Label1</dt><dd>Value1</dd> <dt>Label2</dt><dd>Value2</dd> <dt>Label3</dt><dd>Value3a</dd><dd>Value3b</dd> <dt>Label4</dt><dd>Value4</dd> </dl>' doc = Nokogiri.HTML(html) Using XPath :
doc.css('dt').each do |dt| dds = [] n = dt.next_element begin dds << n n = n.next_element end while n && n.name=='dd' p [dt.text,dds.map(&:text)] end #=> ["Label1", ["Value1"]] #=> ["Label2", ["Value2"]] #=> ["Label3", ["Value3a", "Value3b"]] #=> ["Label4", ["Value4"]] Using little XPath :
doc.css('dt').each do |dt| dds = dt.xpath('following-sibling::*').chunk{ |n| n.name }.first.last p [dt.text,dds.map(&:text)] end #=> ["Label1", ["Value1"]] #=> ["Label2", ["Value2"]] #=> ["Label3", ["Value3a", "Value3b"]] #=> ["Label4", ["Value4"]] Using Lotsa XPath :
doc.css('dt').each do |dt| ct = dt.xpath('count(following-sibling::dt)') dds = dt.xpath("following-sibling::dd[count(following-sibling::dt)=#{ct}]") p [dt.text,dds.map(&:text)] end #=> ["Label1", ["Value1"]] #=> ["Label2", ["Value2"]] #=> ["Label3", ["Value3a", "Value3b"]] #=> ["Label4", ["Value4"]] Looking at another answer is an inefficient way to do the same.
require 'nokogiri' a = Nokogiri::HTML('<div id="first"><dt>Label1</dt><dd>Value1</dd><dt>Label2</dt><dd>Value2</dd></div>') dt = [] dd = [] a.css("#first").each do |item| item.css("dt").each {|t| dt << t.text} item.css("dd").each {|t| dd << t.text} end dt.each_index do |i| puts dt[i] + ': ' + dd[i] end In css to link to the ID you need to put the # symbol before. For the class, this. symbol.