Scrambling html using Python or

One of the arguments I make to my students (Microbiology and Genetics) is that the “data” / is messy, and Python can help with this (of course, there may be other languages). So, here is a practical view of web data collection.

I notice that there are several people who answer questions related to Python among users with the highest reputation. Among the questions that naturally arise are:

I want to restore my current reputation and reputation growth rate for (the highest rated) Pythonistas on Stack Overflow to predict whether or when Alex Martelli will overtake Stephen Lott or Greg Hugill ? what about Conrad Rudolph ? Is this trivial because the increase for these guys is tied to the limit?

More generally, in the absence of an API for requests (which I think is not), is there an alternative to viewing the URLs of pages for templates, loading those pages with Python, and then clearing the html? I understand that there is probably no general approach, but I am interested in how people approach this problem.

Edit: @fitzgeraldsteele: Generally. SO is really a simple (far-fetched) example.

+3
source share
1 answer

There is a beautifully used monthly "data dump" for under a Creative Commons license, see, for example, here (only the first "under my finger" many links about it - at least once a month). For an analysis such as my average weekly reputation for some other posters, such monthly data lotteries are much more useful than screening.

If you want to clear the screen (another ;-) site and not violate their policies or files robots.txt, Python is one of several excellent options - start with scrapy and you won’t have such additional work, for example.

+3
source

Source: https://habr.com/ru/post/1730955/


All Articles