If this were my project, I would consider using an HTML parser, something like jsoup (although others are available). There is a tutorial on the jsoup website, and after playing with it for a while, you will most likely find it quite easy to use.
For example, for an HTML table, for example:
jsoup can parse it like this:
import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class TableEg { public static void main(String[] args) { String html = "http://publib.boulder.ibm.com/infocenter/iadthelp/v7r1/topic/" + "com.ibm.etools.iseries.toolbox.doc/htmtblex.htm"; try { Document doc = Jsoup.connect(html).get(); Elements tableElements = doc.select("table"); Elements tableHeaderEles = tableElements.select("thead tr th"); System.out.println("headers"); for (int i = 0; i < tableHeaderEles.size(); i++) { System.out.println(tableHeaderEles.get(i).text()); } System.out.println(); Elements tableRowElements = tableElements.select(":not(thead) tr"); for (int i = 0; i < tableRowElements.size(); i++) { Element row = tableRowElements.get(i); System.out.println("row"); Elements rowItems = row.select("td"); for (int j = 0; j < rowItems.size(); j++) { System.out.println(rowItems.get(j).text()); } System.out.println(); } } catch (IOException e) { e.printStackTrace(); } } }
The result of the following output:
headers ACCOUNT NAME BALANCE row 0000001 Customer1 100.00 row 0000002 Customer2 200.00 row 0000003 Customer3 550.00
source share