What does HTML parsing mean?

I have heard of HTML Parser libraries such as Simple HTML DOM and HTML Parser. I also heard about questions containing HTML Parsing. What does HTML parsing mean?

+6
source share
2 answers

Unlike what Spudley said, parsing basically consists of a solution (sentence) in its component parts and describes their syntactic roles.

According to Wikipedia, analysis or parsing is the process of analyzing a string of characters either in a natural language or in computer languages , in accordance with the rules of formal grammar. The term parsing comes from Latin pairs (orationis), which means part (speech).

In your case, HTML parsing basically consists of taking HTML code and extracting relevant information, such as the page title, paragraphs on the page, page headers, links, bold text, etc.

Parsers:

A computer program that analyzes content is called a parser. There are, in general, 2 types of parsers:

Parsing vertically down . Top-down parsing can be seen as an attempt to find the left-most derivations of the input stream by searching for parse trees using the top-down extension of this formal grammar rule. Tokens are consumed from left to right. Intelligent choice is used for ambiguity, expanding all the alternative right-hand sides of grammar rules.

Parse parsing . The parser can start with input and try to rewrite it to the start character. Intuitively, the parser tries to find the most basic elements, then the elements containing them, and so on. LR parsers are examples of parsers from the bottom up. Another term used for this type of parser is Shift-Reduce parsing.

Several parsers:

Partisans from top to bottom:

Analyzers from bottom to top:

Analyzer example:

Here is an example HTML parser in python:

from HTMLParser import HTMLParser # create a subclass and override the handler methods class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print "Encountered a start tag:", tag def handle_endtag(self, tag): print "Encountered an end tag :", tag def handle_data(self, data): print "Encountered some data :", data # instantiate the parser and fed it some HTML parser = MyHTMLParser() parser.feed('<html><head><title>Test</title></head>' '<body><h1>Parse me!</h1></body></html>') 

Here's the conclusion:

 Encountered a start tag: html Encountered a start tag: head Encountered a start tag: title Encountered some data : Test Encountered an end tag : title Encountered an end tag : head Encountered a start tag: body Encountered a start tag: h1 Encountered some data : Parse me! Encountered an end tag : h1 Encountered an end tag : body Encountered an end tag : html 

References

+10
source

Analysis is generally applicable to any computer language and is the process of accepting code in the form of text and creating a structure in memory that the computer can understand and operate.

In particular, for HTML, HTML parsing is the process of taking raw HTML code, reading it, and creating from it a structure of objects in the DOM tree.

+6
source

Source: https://habr.com/ru/post/959460/


All Articles