How to use pattern matching for two or more regular expressions in scala

Question

How to use pattern matching for two or more regular expressions in scala

I do not understand how to use pattern matching for two or more regular expressions. For example, I wrote the following program:

import scala.io.Source.{fromInputStream} import java.io._ import java.net._ object craw { def main(args: Array[String]) { val url=new URL("http://contentexplore.com/iphone-6-amazing-looks/") val content=fromInputStream(url.openStream).getLines.mkString("\n") val x="<a href=(\"[^\"]*\")[^<]".r. findAllIn(content). toList. map(x=>x.substring(16,x.length()-2)). mkString(""). split("/"). mkString(""). split(".com"). mkString(""). split("www."). mkString(""). split(".html"). toList print(x) } }

The above text is read in all anchor tags.

 import scala.io.Source.{fromInputStream} import java.io._ import java.net._ object new1 { def main(args: Array[String]) { val url=new URL("http://contentexplore.com/iphone-6-amazing-looks/") val content=fromInputStream(url.openStream).getLines.mkString("\n") val x="<p>.*?</p>".r. findAllIn(content). toList. map(x=>x.substring(3,x.length()-4)). mkString(""). split("</strong>"). mkString(""). split("</em>"). mkString(""). split(";"). mkString(""). split("<em>"). mkString(""). split("<strong>"). mkString(""). split("&nbsp"). toList print(x) } }

The above text is read in all paragraph tags.

I want to combine these two regular expressions into one program using pattern matching. Can I advise how to use more than two regular expressions?

NOTE This question is about combining regular expressions, not how to parse HTML effectively.

+4

scala web-crawler

shashank May 26 '13 at 3:30

source share

1 answer

Johnny everson · Accepted Answer · 2013-05-26T13:29:54+0000

As noted in the comments, it is not recommended to use a regular expression to parse HTML files (or any other technique if you are not sure that you cannot / do not want to use some of the existing ones, for example jsoup).

For educational purposes, there is one way to associate a regular expression with pattern matching (using regular expression as extractors):

 val LinkPattern = "<a href=(\"[^\"]*\")[^<]".r val ParagraphPattern = "<p>.*?</p>".r xmlNodeString match { case LinkPattern(c) => //c bound to capture group here case ParagraphPattern(d) => //d bound to capture group here case _ => }

note: it is assumed that every single node you process is an xmlNodeString, so you will need to traverse the XML nodes corresponding one at a time.

How to use pattern matching for two or more regular expressions in scala

More articles: