I found the simple simple regular expressions to be very intuitive and simple when working with good websites, and IMDB is a good website.
For example, the movie rating on the HTML page of an IMDB movie is in <DIV> with class="star-box-giga-star" . This is VERY easy to extract using regex. The following regex will extract the movie rating from raw HTML to capture group 1:
star-box-giga-star[^>]*>([^<]*)<
It is not very, but it does the job. The regular expression searches for the class identifier star-box-giga-star, then it searches for > , which completes the DIV , and then captures everything until the next < . To create a new regular expression, you must use a web browser that allows you to validate elements (such as Crome or Opera). In Chrome, you can just take a look at the webpage, right-click the element you want to capture, and make an Inspect element , and then inspect the easily identifiable elements that you can use to create a good regular expression. In this case "star-box-giga-star" class is obviously easy to identify! You usually have no problem finding such identifiable elements on good websites, because good websites using CSS and CSS require an ID or class ' es to be able to style elements correctly.
source share