Correct way to distinguish .xls from .doc file?

I was looking for how to detect this file .xls, and I found a solution similar to this (but not deprecated):
POIFSFileSystem:

@Deprecated
@Removal(version="4.0")
public static boolean hasPOIFSHeader(InputStream inp) throws IOException {
    return FileMagic.valueOf(inp) == FileMagic.OLE2;
}

But this one returns true for all Microsoft Word text documents, for example, for .doc

Is there a way to detect a document .xls?

+4
source share
3 answers

Both .doc / .xls documents can be stored in OLE2 storage format. org.apache.poi.poifs.filesystem.FileMagicallows you to define the file storage format only and one is not enough to distinguish between .doc / .xls files.

, POI- - API, (excel document) /.

my , , .xls( .xlsx) , .

    // slurp content from given input and close it
    public static boolean isExcelFile(InputStream in) throws IOException {
        try {
            // it slurp the input stream
            Workbook workbook = org.apache.poi.ss.usermodel.WorkbookFactory.create(in);
            workbook.close();
            return true;

        } catch (java.lang.IllegalArgumentException | org.apache.poi.openxml4j.exceptions.InvalidFormatException e) {
            return false;
        }
    }

excel

, Apache Tika, gagravarr:

public class TikaBasedFileTypeDetector {
    private Tika tika;
    private TemporaryResources temporaryResources;

    public void init() {
        this.tika = new Tika();
        this.temporaryResources = new TemporaryResources();
    }

    // clean up all the temporary resources
    public void destroy() throws IOException {
        temporaryResources.close();
    }

    // return content mime type
    public String detectType(InputStream in) throws IOException {
        TikaInputStream tikaInputStream = TikaInputStream.get(in, temporaryResources);

        return tika.detect(tikaInputStream);
    }

    public boolean isExcelFile(InputStream in) throws IOException{
        // see /questions/12204/what-is-a-correct-mime-type-for-docx-pptx-etc/85600#85600 for information on mimetypes
        String type = detectType(in);
        return type.startsWith("application/vnd.ms-excel") || //for Micorsoft document
                type.startsWith("application/vnd.openxmlformats-officedocument.spreadsheetml"); // for OpenOffice xml format
    }
}

. mime.

+2

Apache POI - HSSF.
() xls ( xlsx - ).
...

InputStream ExcelFileToRead = new FileInputStream("FileNameWithLink.xls");
HSSFWorkbook wb = new HSSFWorkbook(ExcelFileToRead);
HSSFSheet sheet = wb.getSheetAt(0);

... , xls.
, , .. .
, .xls, ( ).
- XSSF .xlsx, HSSF - .xls.

, , .

+2

docx4j. OpcPackage.load(), .

OpcPackage.load()

 * Convenience method to create a WordprocessingMLPackage
 * or PresentationMLPackage
 * from an inputstream (.docx/.docxm, .ppxtx or Flat OPC .xml).
 * It detects the convenient format inspecting two first bytes of stream (magic bytes). 
 * For office 2007 'x' formats, these two bytes are 'PK' (same as zip file)  

load () returns OpcPackage, which is the abstract class on which GloxPackage, PresentationMLPackage, SpreadsheetMLPackage, WordprocessingMLPackage are based. Thus, this should work for Word, Excel and PowerPoint documents.

Basic check

public final String XLSX_FILE = "application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml";
public final String WORD_FILE = "application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml";
public final String UNKNOWN_FILE = "UNKNOWN";



public boolean isFileXLSX(String fileLocation) {
    return getContentTypeFromFile(fileLocation).equals(XLSX_FILE);
}


public String getContentTypeFromFile(String fileLocation) {
    try {
        return OpcPackage.load(new File(fileLocation)).getContentType();
    } catch (Docx4JException e) {
        return UNKNOWN_FILE;
    }
}

You should see values ​​like

application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml
application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml
+2
source

Source: https://habr.com/ru/post/1686188/


All Articles