Small Datasets for Hadoop-MapReduce

I am trying to be familiar with Hadoop-MapReduce. Having studied the theoretical knowledge of these concepts, I want to practice them.

However, for this technology, I could not find small data sets (up to 3 GB). Where can I find datasets to practice?

OR, How can I do a Hadoop-MapReduce practice? In other words, is there any tutorial or website that offers exercises?

+4
source share
3 answers

Public datasets that you can download and play with. Below are a few examples.

http://www.netflixprize.com/index - As part of the contest, he released a set of user ratings to encourage people to develop better recommendation algorithms. Uncompressed data arrives at 2 GB + . It contains 100 M + movie ratings from 480 K users over 17 K. movies

http://aws.amazon.com/publicdatasets/ - For example, one of the biological datasets is annotated data about the human genome, amounting to approximately 550 GB. Under economics, you can find data sets such as the 2000 US Census (approximately 200 GB).

http://boston.lti.cs.cmu.edu/Data/clueweb09/ - Carnegie Mellon Universities The Institute of Language Technologies has released the ClueWeb09 dataset to assist large-scale research on the Internet. Its flight of a billion web pages in 10 languages. An uncompressed dataset is 25 TB.

+7
source

Why not create some datasets yourself?

A very simple task is to fill the file with millions of random numbers, and then use Hadoop to find duplicates, triples, primes, numbers that have duplicates in their factors, etc.

Of course, it's not as fun as making mutual friends on Facebook, but this is enough to get some Hadoop practice.

+5
source

OR, How can I do a Hadoop-MapReduce practice? In other words, is there any tutorial or website that offers exercises?

Below are some of the toy problems . Also check out "Text-intensive data processing" with MapReduce , it has pseudo-code for some of the algorithms such as page-rank, joins, indexing implemented in MapReduce.

Here are some of the public data sets collected over time. You may need to dig small ones.

http://wiki.gephi.org/index.php/Datasets
Download Big Data for Hadoop
http://datamob.org/datasets
http://konect.uni-koblenz.de/
http://snap.stanford.edu/data/
http://archive.ics.uci.edu/ml/
https://bitly.com/bundles/hmason/1
http://www.inside-r.org/howto/finding-data-internet
https://docs.google.com/document/pub?id=1CNBmPiuvcU8gKTMvTQStIbTZcO_CTLMvPxxBrs0hHCg
http://ftp3.ncdc.noaa.gov/pub/data/noaa/1990/
http://data.cityofsantacruz.com/

+3
source

Source: https://habr.com/ru/post/1440046/


All Articles