Access files from other file systems along with hdfs files in the hasoop map creation application

I know that we can name the work of reducing the map from a regular Java application. Now the work to reduce the number of cards in my case is associated with files on hdfs, as well as files on another file system. Is it possible in hadoop that we can access files from another file system while using the files in hdf. Is it possible?

So, basically, I intend that I have one large file that I want to put in HDFS for parallel computing, and then compare the blocks of this file with some other files (which I don’t want to attach to HDFS because they are needed for access to a full-length file at the same time.

+4
source share
2 answers

You can use the distributed cache to distribute files to your cards, they can open and read files in their configure() method (do not read them in map() , because it will be called many times.)

change

In order to access the file from the local file system in your work to reduce the map, you can add these files to the distributed cache when setting up your job configuration.

 JobConf job = new JobConf(); DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), job); 

The MapReduce structure ensures that these files are available to your cartographers.

 public void configure(JobConf job) { // Get the cached archives/files Path[] localFiles = DistributedCache.getLocalCacheFiles(job); // open, read and store for use in the map phase. } 

and delete the files when your work is done.

+1
source

Access to a file system other than HDFS should be available from mapper / reducer tasks, like any other tasks. It should be noted that if there are scanners with a size of 1 thousand, and each of them will try to open a file that is not related to HDFS, this can lead to a bottle neck depending on the type of external file system. The same applies to maps that retrieve data from a database.

+2
source

Source: https://habr.com/ru/post/1443519/


All Articles