How to get a subset of Wikipedia pages?

How can I get a subset (say, 100 MB) of Wikipedia pages? I found that you can get the entire dataset in XML format, but this is more like 1 or 2 gigs; I do not need it.

I want to experiment with the implementation of the map reduction algorithm.

Having said that, if I could just find text data from 100 megagrams from anywhere, that would be nice too. For example. the database, if available, can be of a good size. I am open to suggestions.

Edit: Any that aren't torrents? I can not get them to work.

+3
source share
6 answers

stackoverflow .

+4

, , " " Wikipedia, 100 -: http://en.wikipedia.org/wiki/Special:Random. , , , ( -, ). .

+2

stackoverflow, .

, ?

+1

, . , script, (, ), , - script, " ", , . Wikipedia Dump Reader , python ( mparser.py).

If you do not want to download all this, you will still have the opportunity to grope. An export function may be useful for this, and wikipediabot has also been proposed in this context.

0
source

Can you use a web crawler and clear 100 MB of data?

0
source

There are many wikipedias available. Why do you want to choose the largest (English wiki)? Wikinews archives are much smaller.

0
source

Source: https://habr.com/ru/post/1715861/


All Articles