Convert ppt file to pptx in Python

Is there a way to convert .ppt files to .pptx files.

Purpose: I need to extract text from a table (with column names like name, address, contact number, email address, etc.) from .ppt files. For this, I took this approach:

I converted the .ppt file to pdf and then extracted the data from pdf using PDFminer. Text extracted from PDF is not separated by any delimiter. Because of this, it is very difficult to distinguish between names and other fields in the table.

The likely solution I'm working on is:

  • Convert .ppt files to .pptx
  • Parse the xml of the .pptx file to get formatted text

I was stuck in the first step of converting the file format from .ppt to .pptx. I could not find a solution to convert the .ppt file format to .pptx formt in python.

+4
source share
1 answer

For MacOS Homebrew Users: Install Apache Tika ( brew install tika)

The command line interface works as follows:

tika --text something.ppt > something.txt

And use it inside the python script:

import os
os.system("tika --text temp.ppt > temp.txt")

You can do it, and this is the only solution I have so far.

0
source

Source: https://habr.com/ru/post/1683648/


All Articles