What is the typical relationship between computing and storage for large-scale Hadoop clusters?

Question

What is the typical relationship between computing and storage for large-scale Hadoop clusters?

I am considering the size of a large cluster (10k core) that should support both deep analytic processing and I / O-related big data, and I want to hear from some people who built a large data cluster that they used to measure calculations compared with local disk storage. I assume a direct-connect architecture that is supported by MapReduced-based online datastores.

Looking at some anno 2012 mid-density blade hardware, such as dual Xeon 5650s, I can put about 2 TB on the server as direct attached storage. This would give me about 100TFlops for 2 TB of memory or a 5: 1 ratio. Equipment with a low density can have a low value: 1: 1, equipment with a higher density can reach 10: 1.

I would be interested to know what kind of relationships are fulfilled by other big people.

+4

mapreduce data-warehouse

Ravenwater Jan 01 '11 at 20:32

source share

2 answers

From the third article by Pravev from Eric Baldeshwiler at HortonWorks of September 2011:

We are asked a lot of questions about how to choose the Apache Hadoop node hardware. During my time at Yahoo !, we bought many nodes with 6 * 2TB SATA disks, 24-gigabyte RAM and 8 cores in a dual-slot configuration. This turned out to be a pretty good configuration. This year, I saw systems with SATA 12 * 2 TB drives, 48 GB of RAM and 8 cores in a dual-slot configuration. This year we will see the transition to 3 TB.

What configuration makes sense for any given organization is determined by such factors as the ratio of storage volume to computational load and other factors that cannot be answered in a general way. In addition, the hardware industry is moving fast. In this article, I will try to describe the principles that over the past six years have typically defined Hadoop hardware configuration options. All these thoughts are aimed at creating medium and large Apache Hadoop clusters. Scott Carey made a good case for small machines for small clusters the other day on the Apache mailing list.

+1

Ravenwater Jan 13 '12 at 21:40

source share

Praveen sripati · Accepted Answer · 2012-01-02T01:47:25+0000

Here are some 1 2 3 articles to get you started with your Hadoop hardware setup.

What is the typical relationship between computing and storage for large-scale Hadoop clusters?

More articles: