XPaths Generalization

I would like to ask for help with a problem that I am trying to solve with XPaths.

I am trying to generalize a few Xpaths provided by the user to get the XPath that best β€œcustomized” all the provided examples. This is for the web scraping system I am creating.

For example: If the user provides the following xpath (each one points to a link in the "Spotlight" section of the Google News page)

Good examples:

/html/body/div[@id='page']/div/div[@id='main-wrapper']/div[@id='main']/div/div/div[3] /div[1]/table[@id='main-am2-pane']/tbody/tr/td[@id='rt-col']/div[3]/div[@id='s_en_us:ir']/div[2]/div[1]/div[2]/a[@id='MAE4AUgAUABgAmoCdXM']/span /html/body/div[@id='page']/div/div[@id='main-wrapper']/div[@id='main']/div/div/div[3]/div[1]/table[@id='main-am2-pane']/tbody/tr/td[@id='rt-col']/div[3]/div[@id='s_en_us:ir']/div[2]/div[6]/div[2]/a[@id='MAE4AUgFUABgAmoCdXM']/span /html/body/div[@id='page']/div/div[@id='main-wrapper']/div[@id='main']/div/div/div[3]/div[1]/table[@id='main-am2-pane']/tbody/tr/td[@id='rt-col']/div[3]/div[@id='s_en_us:ir']/div[2]/div[12]/div[2]/a[@id='MAE4AUgLUABgAmoCdXM']/span 

Bad examples: (pointing to a link in another section)

 /html/body/div[@id='page']/div/div[@id='main-wrapper']/div[@id='main']/div/div/div[3]/div[1]/table[@id='main-am2-pane']/tbody/tr/td[@id='lt-col']/div[2]/div[@id='replaceable-section-blended']/div[1]/div[4]/div/h2/a[@id='MAA4AEgFUABgAWoCdXM']/span 

He should be able to generalize and create an xpath expression that will select all the links in the Spotlight section. (It should be able to throw the wrong xpath)

Generalized XPath

 /html/body/div[@id='page']/div/div[@id='main-wrapper']/div[@id='main']/div/div/div[3]/div[1]/table[@id='main-am2-pane']/tbody/tr/td[@id='rt-col']/div[3]/div[@id='s_en_us:ir']/div[2]/div/div[2]/a[@id='MAE4AUgLUABgAmoCdXM']/span 

Could you advise me how to do this. I thought about using the Longest Common Substring strategy, but nonetheless, it would be overly generalizing if a bad example is given (for example, the fourth example given) Are there any libraries or any open source software that was made in this region?

I saw several similar messages ( search for a common ancestor from the xpath group? And How to find the first XPath common ancestor in Javascript? ) However, they talk about the longest common ancestor.

I am writing it in Javascript as a form of firefox extension.

Thanks for your time and any help would be greatly appreciated!

+4
source share
1 answer

The question here is the problem of minimizing an automaton. So, you have (Xpath1 | Xpath2 | Xpath3), and you would like to get a minimal Xpath4 machine that matches the same nodes. In addition, the question of minimizing information is lost or not, for example JPEG. For precise minimization, you can use Google's "Algorithms for minimizing finite state machines."

Well, the easiest way is to find a common subsequence after converting each Xpath statement to a character and start interlinear search based on the characters from the string list. So we have, for example,

adcba, acba, adba --common substring β†’ aba --general reg exp β†’ a. * b. * a - return back to xpath β†’ ...

You can also try setting something less general instead. *

+1
source

Source: https://habr.com/ru/post/1342817/


All Articles