If I have a constructor that requires a file path, how can I "fake" it if it is packed in a jar?

Question

If I have a constructor that requires a file path, how can I "fake" it if it is packed in a jar?

The context of this question is that I am trying to use maxmind java api in a pig-breeding script that I wrote ... I do not think that knowledge of this is necessary in order to answer the question. / p>

The maxmind API has a constructor that requires a path to a file called GeoIP.dat, which is a comma-delimited file that has the information it needs.

I have a jar file that contains an API, as well as a packaging class that creates an instance of the class and uses it. My idea is to pack the GeoIP.dat file in a jar and then access it as a resource in the jar file. The problem is that I do not know how to build a path that a constructor can use.

After looking at the API, they download the file:

public LookupService(String databaseFile) throws IOException { this(new File(databaseFile)); } public LookupService(File databaseFile) throws IOException { this.databaseFile = databaseFile; this.file = new RandomAccessFile(databaseFile, "r"); init(); }

I just insert this because I don't mind editing the API itself to make this work if necessary, but I don’t know how I could replicate the functionality as such. Ideally, I would like to get it in the form of a file, though, otherwise editing the API would be quite a challenge.

Is it possible?

+4

java jar hadoop apache-pig

A question asker Feb 10 '11 at 16:22

source share

6 answers

Puce · Answer 1 · 2011-02-10T16:29:38+0000

Try:

 new File(MyWrappingClass.class.getResource(<resource>).toURI())

irreputable · Answer 2 · 2011-02-10T16:31:43+0000

upload the data to a temporary file and submit a temporary file to it.

 File tmpFile = File.createTempFile("XX", "dat"); tmpFile.deleteOnExit(); InputStream is = MyClass.class.getResourceAsStream("/path/in/jar/XX.dat"); OutputStream os = new FileOutputStream(tmpFile) read from is, write to os, close

Romain · Answer 3 · 2011-02-11T06:06:34+0000

One recommended way is to use Distributed Cache instead of trying to associate it with a bank.

If you archive GeoIP.dat and copy it to hdfs: // host: port / path / GeoIP.dat.zip. Then add these parameters to the Pig command:

 pig ... -Dmapred.cache.archives=hdfs://host:port/path/GeoIP.dat.zip#GeoIP.dat -Dmapred.create.symlink=yes ...

And LookupService lookupService = new LookupService("./GeoIP.dat"); should work in your UDF, since the file will be present locally for tasks on each node.

Shannon haworth · Answer 4 · 2012-06-11T00:25:35+0000

This works for me.

Assuming you have an org.foo.bar.util package that contains GeoLiteCity.dat

 URL fileURL = this.getClass().getResource("org/foo/bar/util/GeoLiteCity.dat"); File geoIPData = new File(fileURL.toURI()); LookupService cl = new LookupService(geoIPData, LookupService.GEOIP_MEMORY_CACHE );

Edwin buck · Answer 5 · 2011-02-10T16:28:19+0000

Use the classloader.getResource(...) method to search for a file in the classpath that will pull it from the JAR file.

This means that you will have to modify the existing code to override the download. The details on how to do this are highly dependent on your existing code and environment. In some cases, a subclass and registration of a subclass with a framework may be performed. In other cases, you may need to determine the order in which classes are loaded along the class path and place them in the class path with the same signature "earlier".

Nija · Answer 6 · 2011-02-10T20:49:16+0000

Here we use maxmind geoIP;

We put the GeoIPCity.dat file in the cloud and use the location of the cloud as an argument when starting the process. The code in which we get the GeoIPCity.data file and create a new LookupService is:

 if (DistributedCache.getLocalCacheFiles(context.getConfiguration()) != null) { List<Path> localFiles = Utility.arrayToList(DistributedCache.getLocalCacheFiles(context.getConfiguration())); for (Path localFile : localFiles) { if ((localFile.getName() != null) && (localFile.getName().equalsIgnoreCase("GeoIPCity.dat"))) { m_geoipLookupService = new LookupService(new File(localFile.toUri().getPath())); } } }

Below is an abridged version of the command that we use to start our process.

$HADOOP_HOME/bin/hadoop jar /usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat -libjars /usr/lib/COMPANY/analytics/libjars/geoiplookup.jar

The critical components of this to run the MindMax component are -files and -libjars . These are common parameters in GenericOptionsParser .

-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.

I assume that Hadoop uses GenericOptionsParser because I cannot find links to it anywhere in my project. :)

If you put GeoIPCity.dat in the bank and specify it using the -files argument, it will be placed in the local cache, which it can then display in the setup function. It does not have to be in setup , but you only need to do it once on the carter, so this is a great place to place it. Then use the -libjars argument to specify geoiplookup.jar (or what you called your own) and it will be able to use it. We do not put geoiplookup.jar in the cloud. I sway with the assumption that the chaop will hand out the jar as it needs.

I hope everything makes sense. I am pretty familiar with hadoop / mapreduce, but I did not write plays that use the gemy maxmind component in the project, so I had to imitate a little to understand it well enough to explain the explanation that I have.

EDIT: Additional Description for -files and -libjars -files The file argument is used to distribute files through the Hadoop distributed cache. In the above example, we distribute the Geo-ip Max Mind data file through the distributed Hadoop cache. We need access to the Geo-ip Max Mind data file to map the IP addresses of users with the corresponding country, region, city, time zone. The API requires that the data file be present locally, which is not possible in a distributed processing environment (we are not guaranteed that the nodes in the cluster will process the data). To extend the relevant data to node processing, we use the Hadoop distributed cache infrastructure. GenericOptionsParser and ToolRunner automatically facilitate this use of the -file argument. Please note that the file we are distributing must be available in the cloud (HDFS). -libjars -libjars is used to distribute any additional dependencies required by map shrink jobs. Like the data file, we also need to copy the dependent libraries to the nodes of the cluster where the task will be performed. GenericOptionsParser and ToolRunner automatically facilitate this use of the -libjars argument.

If I have a constructor that requires a file path, how can I "fake" it if it is packed in a jar?

More articles: