Web scraper without knowledge of page structure

I am trying to teach myself concepts by writing a script. Basically, I am trying to write a Python script that, given a few keywords , will scan web pages until it finds the data I need. For example, let's say I want to find a list of Venetian snakes that live in the USA. I can run the script with the keywords list,venemous,snakes,US , and I want to be able to trust at least 80% confidence that it will return the list of snakes in the USA.

I already know how to implement part of a web spider, I just want to find out how I can determine the relevance of a web page without knowing any information about the structure of the page. I researched web filming methods, but they all seem to be aware of the structure of the html page tag. Is there a certain algorithm that will allow me to pull data from a page and determine its relevance?

Any pointers would be greatly appreciated. I am using Python with urllib and BeautifulSoup .

+6
source share
2 answers

using a scanner like scrapy (only for processing simultaneous downloads), you can write a simple spider like this and probably start with Wikipedia as a good starting point. This script is a complete example using scrapy , nltk and whoosh . it will never stop and will index links for later searches using whoosh This is a small Google:

 _Author = Farsheed Ashouri import os import sys import re ## Spider libraries from scrapy.spider import BaseSpider from scrapy.selector import Selector from main.items import MainItem from scrapy.http import Request from urlparse import urljoin ## indexer libraries from whoosh.index import create_in, open_dir from whoosh.fields import * ## html to text conversion module import nltk def open_writer(): if not os.path.isdir("indexdir"): os.mkdir("indexdir") schema = Schema(title=TEXT(stored=True), content=TEXT(stored=True)) ix = create_in("indexdir", schema) else: ix = open_dir("indexdir") return ix.writer() class Main(BaseSpider): name = "main" allowed_domains = ["en.wikipedia.org"] start_urls = ["http://en.wikipedia.org/wiki/Snakes"] def parse(self, response): writer = open_writer() ## for indexing sel = Selector(response) email_validation = re.compile(r'^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[az]{2,4})$') #general_link_validation = re.compile(r'') #We stored already crawled links in this list crawledLinks = set() titles = sel.xpath('//div[@id="content"]//h1[@id="firstHeading"]//span/text()').extract() contents = sel.xpath('//body/div[@id="content"]').extract() if contents: content = contents[0] if titles: title = titles[0] else: return links = sel.xpath('//a/@href').extract() for link in links: # If it is a proper link and is not checked yet, yield it to the Spider url = urljoin(response.url, link) #print url ## our url must not have any ":" character in it. link /wiki/talk:company if not url in crawledLinks and re.match(r'http://en.wikipedia.org/wiki/[^:]+$', url): crawledLinks.add(url) #print url, depth yield Request(url, self.parse) item = MainItem() item["title"] = title print '*'*80 print 'crawled: %s | it has %s links.' % (title, len(links)) #print content print '*'*80 item["links"] = list(crawledLinks) writer.add_document(title=title, content=nltk.clean_html(content)) ## I save only text from content. #print crawledLinks writer.commit() yield item 

This is a file for an example of a completed survey:

+6
source

Basically you ask: "How to write a search engine." This is ... not trivial.

The right way to do this is simply to use the Google search API (or Bing, or Yahoo !, or ...) and show better results. But if you are just working on a personal project to teach yourself some concepts (not sure which of them will be exactly that), then here are some tips:

  • find the text content of the relevant tags ( <p> , <div> , etc.) for the relevant keywords (duh)
  • Use the appropriate keywords to check for tags that may contain what you are looking for. For example, if you are looking for a list of things, then a page containing <ul> or <ol> or even <table> might be a good candidate.
  • create a synonym dictionary and find each page for synonyms of your keywords. Limiting yourself to "USA" may mean an artificially low rating for a page containing only "America."
  • keep a list of words that are not in your set of keywords and indicate a higher rating on pages that contain most of them. These pages are (possibly) more likely to contain the answer you are looking for.

good luck (you need it)!

+2
source

Source: https://habr.com/ru/post/970005/


All Articles