Here we use maxmind geoIP;
We put the GeoIPCity.dat file in the cloud and use the location of the cloud as an argument when starting the process. The code in which we get the GeoIPCity.data file and create a new LookupService is:
if (DistributedCache.getLocalCacheFiles(context.getConfiguration()) != null) { List<Path> localFiles = Utility.arrayToList(DistributedCache.getLocalCacheFiles(context.getConfiguration())); for (Path localFile : localFiles) { if ((localFile.getName() != null) && (localFile.getName().equalsIgnoreCase("GeoIPCity.dat"))) { m_geoipLookupService = new LookupService(new File(localFile.toUri().getPath())); } } }
Below is an abridged version of the command that we use to start our process.
$HADOOP_HOME/bin/hadoop jar /usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat -libjars /usr/lib/COMPANY/analytics/libjars/geoiplookup.jar
The critical components of this to run the MindMax component are -files and -libjars . These are common parameters in GenericOptionsParser .
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
I assume that Hadoop uses GenericOptionsParser because I cannot find links to it anywhere in my project. :)
If you put GeoIPCity.dat in the bank and specify it using the -files argument, it will be placed in the local cache, which it can then display in the setup function. It does not have to be in setup , but you only need to do it once on the carter, so this is a great place to place it. Then use the -libjars argument to specify geoiplookup.jar (or what you called your own) and it will be able to use it. We do not put geoiplookup.jar in the cloud. I sway with the assumption that the chaop will hand out the jar as it needs.
I hope everything makes sense. I am pretty familiar with hadoop / mapreduce, but I did not write plays that use the gemy maxmind component in the project, so I had to imitate a little to understand it well enough to explain the explanation that I have.
EDIT: Additional Description for -files and -libjars -files The file argument is used to distribute files through the Hadoop distributed cache. In the above example, we distribute the Geo-ip Max Mind data file through the distributed Hadoop cache. We need access to the Geo-ip Max Mind data file to map the IP addresses of users with the corresponding country, region, city, time zone. The API requires that the data file be present locally, which is not possible in a distributed processing environment (we are not guaranteed that the nodes in the cluster will process the data). To extend the relevant data to node processing, we use the Hadoop distributed cache infrastructure. GenericOptionsParser and ToolRunner automatically facilitate this use of the -file argument. Please note that the file we are distributing must be available in the cloud (HDFS). -libjars -libjars is used to distribute any additional dependencies required by map shrink jobs. Like the data file, we also need to copy the dependent libraries to the nodes of the cluster where the task will be performed. GenericOptionsParser and ToolRunner automatically facilitate this use of the -libjars argument.