How to reliably determine file types?

Purpose: This file determines if the given type (XML, JSON, properties, etc.)

Consider the XML-Up case, until we ran into this problem, the following sample approach worked fine:

try { saxReader.read(f); } catch (DocumentException e) { logger.warn(" - File is not XML: " + e.getMessage()); return false; } return true; 

As expected, when the XML is well-formed, the test passes and the method returns true. If something bad happens and the file cannot be parsed, false is returned.

This violates, however, when we are dealing with malformed XML (still an XML file).

I would rather not rely on the .xml extension (doesn't work all the time), look for the string <?xml version="1.0" encoding="UTF-8"?> Inside the file, etc.

Is there any other way to handle this?

What would you see inside the file to "suspect that it might be XML , although DocumentException been caught." This is necessary for parsing.

+6
source share
3 answers

Apache Tika gives me the least amount of problems and is not platform specific unlike Java 7: Files.probeContentType

 import java.io.File; import java.io.IOException; import javax.activation.MimeType; import org.apache.tika.Tika; File inputFile = ... String type = new Tika().detect(inputFile); System.out.println(type); 

For xml file, I got 'application / xml'

for the properties file I got "text / plain"

However, you can add a detector to the new Tika ()

 <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>1.xx</version> </dependency> 
+6
source

For those who do not need very accurate detection ( Java 7 Files.probeContentType method mentioned by rjdkolb)

 Path filePath = Paths.get("/path/to/your/file.jpg"); String contentType = Files.probeContentType(filePath); 
+2
source

Source: https://habr.com/ru/post/910905/


All Articles