How to parse HTML tags in matlab using regexp?

Question

How to parse HTML tags in matlab using regexp?

I took too short a time and specifically wanted to extract the string as shown below. The problem is that the tag does not have the <a> data </a> form.

Considering,

 s = <em style="font-size:medium"> 5,888 </em>

how to extract only 5888 in matlab?

+4

html regex tags matlab

stackoverflow Nov 05 '12 at 6:42

source share

2 answers

Here you will find useful information either here or here , all of which are the results of the first page of Google and would be faster than asking a question here.

Anyway, a quick dirty way: you can filter the <> characters:

 >> s = '<em style="font-size:medium"> 5,888 </em> <sometag> test </sometag>' >> a = regexp(s, '[<>]'); >> s( cell2mat(arrayfun(@(x,y)x:y, a(2:2:end-1)+1, a(3:2:end)-1, 'uni',false)) ) ans = 5,888 test

Or, a little more reliable and cleaner, replace everything between any tags (including tags) with an empty one:

 >> s = regexprep(s, '<.*?>', '') ans = 5,888 test

+3

Rody oldenhuis Nov 05 '12 at 6:52

source share

stackoverflow · Accepted Answer · 2012-11-05T08:26:29+0000

Thank you people for your help. I am mainly trying to get the population of the US county in Matlab. I think I will share my code, although not the most elegant one. It may help some soul. :)

 county = 'morris'; state = 'ks'; county = strrep(county, ' ' , '+'); str = sprintf('https://www.google.com/search?&q=population+%s+%s',county,state); s = urlread(str); pop = regexp(s,'<em[^>]*>(.*?)</em>', 'tokens'); pop = char(pop{:}); pop = strrep(pop, ',' , ''); pop = str2num(pop);

How to parse HTML tags in matlab using regexp?

More articles: