How to get a subset of Wikipedia pages?

Question

How to get a subset of Wikipedia pages?

How can I get a subset (say, 100 MB) of Wikipedia pages? I found that you can get the entire dataset in XML format, but this is more like 1 or 2 gigs; I do not need it.

I want to experiment with the implementation of the map reduction algorithm.

Having said that, if I could just find text data from 100 megagrams from anywhere, that would be nice too. For example. the database, if available, can be of a good size. I am open to suggestions.

Edit: Any that aren't torrents? I can not get them to work.

+3

mapreduce wiki sample-data

Chris Aug 24 '09 at 4:26

source share

6 answers

, , " " Wikipedia, 100 -: http://en.wikipedia.org/wiki/Special:Random. , , , ( -, ). .

+2

Jim Ferrans 24 . '09 5:39

stackoverflow, .

, ?

+1

Mike Cooper 24 . '09 4:31

, . , script, (, ), , - script, " ", , . Wikipedia Dump Reader , python ( mparser.py).

If you do not want to download all this, you will still have the opportunity to grope. An export function may be useful for this, and wikipediabot has also been proposed in this context.

0

daphshez Aug 24 '09 at 5:06

source share

Can you use a web crawler and clear 100 MB of data?

0

ben Aug 24 '09 at 5:08

source share

There are many wikipedias available. Why do you want to choose the largest (English wiki)? Wikinews archives are much smaller.

0

Danubian sailor Feb 24 '11 at 8:44

source share

Alex · Accepted Answer · 2009-08-24T04:29:18+0000

stackoverflow .

How to get a subset of Wikipedia pages?

More articles: