I would like to ask for help with a problem that I am trying to solve with XPaths.
I am trying to generalize a few Xpaths provided by the user to get the XPath that best βcustomizedβ all the provided examples. This is for the web scraping system I am creating.
For example: If the user provides the following xpath (each one points to a link in the "Spotlight" section of the Google News page)
Good examples:
/html/body/div[@id='page']/div/div[@id='main-wrapper']/div[@id='main']/div/div/div[3] /div[1]/table[@id='main-am2-pane']/tbody/tr/td[@id='rt-col']/div[3]/div[@id='s_en_us:ir']/div[2]/div[1]/div[2]/a[@id='MAE4AUgAUABgAmoCdXM']/span /html/body/div[@id='page']/div/div[@id='main-wrapper']/div[@id='main']/div/div/div[3]/div[1]/table[@id='main-am2-pane']/tbody/tr/td[@id='rt-col']/div[3]/div[@id='s_en_us:ir']/div[2]/div[6]/div[2]/a[@id='MAE4AUgFUABgAmoCdXM']/span /html/body/div[@id='page']/div/div[@id='main-wrapper']/div[@id='main']/div/div/div[3]/div[1]/table[@id='main-am2-pane']/tbody/tr/td[@id='rt-col']/div[3]/div[@id='s_en_us:ir']/div[2]/div[12]/div[2]/a[@id='MAE4AUgLUABgAmoCdXM']/span
Bad examples: (pointing to a link in another section)
/html/body/div[@id='page']/div/div[@id='main-wrapper']/div[@id='main']/div/div/div[3]/div[1]/table[@id='main-am2-pane']/tbody/tr/td[@id='lt-col']/div[2]/div[@id='replaceable-section-blended']/div[1]/div[4]/div/h2/a[@id='MAA4AEgFUABgAWoCdXM']/span
He should be able to generalize and create an xpath expression that will select all the links in the Spotlight section. (It should be able to throw the wrong xpath)
Generalized XPath
/html/body/div[@id='page']/div/div[@id='main-wrapper']/div[@id='main']/div/div/div[3]/div[1]/table[@id='main-am2-pane']/tbody/tr/td[@id='rt-col']/div[3]/div[@id='s_en_us:ir']/div[2]/div/div[2]/a[@id='MAE4AUgLUABgAmoCdXM']/span
Could you advise me how to do this. I thought about using the Longest Common Substring strategy, but nonetheless, it would be overly generalizing if a bad example is given (for example, the fourth example given) Are there any libraries or any open source software that was made in this region?
I saw several similar messages ( search for a common ancestor from the xpath group? And How to find the first XPath common ancestor in Javascript? ) However, they talk about the longest common ancestor.
I am writing it in Javascript as a form of firefox extension.
Thanks for your time and any help would be greatly appreciated!