How can I collect so many copies of the text: "[subject] are ..." from the Internet?

I am trying to compile applications from the Internet, looking for suggestions with the following construction:

[subject] [are/is] [rest of sentence]. 

So, for example, I want to search and collect all offers starting with "Computers - [other offers]". Which can lead to things like:

  • Computers are annoying.
  • Computers are great.
  • Computers are expensive.
  • Etc.

What I want to collect is everything from the beginning of a sentence to a period (preferably sorted by frequency of occurrence).

Is there a way to do this with existing search engines, or will I need to build a bot / scraper?

+6
source share
4 answers

You will need to start with a list of nouns that interest you, and then pull up the appropriate sentences for each.

Do I need to be from the Internet? There are several compilations of the English language that you can do through: http://en.wikipedia.org/wiki/Text_corpus

You still have to write some regular expressions to filter out what you don't want.

You can also use the Google search API and look for things like computers are * , you still have to filter the data.

+2
source

It does not give you specific numbers, but you can get popular (often funny) results using the Google Suggest API

eg:

 http://suggestqueries.google.com/complete/search?output=toolbar&hl=en&q=Computers%20are 

.. which returns something like:

 <toplevel> <CompleteSuggestion> <suggestion data="computers are your future"/> </CompleteSuggestion> <CompleteSuggestion> <suggestion data="computers are racist"/> </CompleteSuggestion> <CompleteSuggestion> <suggestion data="computers are us"/> </CompleteSuggestion> <CompleteSuggestion> <suggestion data="computers are stupid"/> </CompleteSuggestion> <CompleteSuggestion> <suggestion data="computers are illegal in florida"/> </CompleteSuggestion> [...] </toplevel> 

It is worth noting that Google will do the usual magic to try to improve the results, for example, if you try to search with the Compuuter is error, it will be fixed to Computer is

+2
source

If you don't mind using ruby, there is a library called spidr that can take pages. There is also a library known as treatment capable of NLP.

I will also watch yubnub

+1
source

You can access the massive body of web pages using Common Crawl . Write a Hadoop MapReduce job to run on AWS and retrieve the pages that interest you. Details and tutorials are available on their website.

0
source

Source: https://habr.com/ru/post/952417/


All Articles