Apache Tika and Json

When I use Apache Tika to determine the type of file from the content. The XML file is fine, but not json. If the content type is json, it will return "text / plain" instead of "application / json".

Any help?

public static String tiKaDetectMimeType(final File file) throws IOException { TikaInputStream tikaIS = null; try { tikaIS = TikaInputStream.get(file); final Metadata metadata = new Metadata(); return DETECTOR.detect(tikaIS, metadata).toString(); } finally { if (tikaIS != null) { tikaIS.close(); } } } 
+6
source share
2 answers

JSON is based on clear text, so it’s not at all surprising that Tika reported this as such when it was given only bytes with which to work.

Your problem is that you also did not specify a file name, so Tiki did not have something to work with. If you had, Tika could say bytes=plain text + filename=json => json and gave you the answer you expected

Invalid line:

 metadata.set(Metadata.RESOURCE_NAME_KEY, filename); 

Thus, a fixed piece of code will look like this:

 tikaIS = TikaInputStream.get(file); final Metadata metadata = new Metadata(); metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName()); return DETECTOR.detect(tikaIS, metadata).toString(); 

In doing so, you will receive a response from JSON, as you expected

+5
source

For those who are not dealing with the file, I found it easiest to run the payload through Jackson to see if it could be parsed or not. If Jackson can make it out, you know: 1) you are working with JSON and 2) JSON is valid.

 private static final ObjectMapper MAPPER = new ObjectMapper(); public static boolean isValidJSON(final String json) { boolean valid = true; try { MAPPER.readTree(json); } catch (IOException e) { valid = false; } return valid; } 
0
source

Source: https://habr.com/ru/post/956139/


All Articles