Select only the first page of PDF pypdf2

I am trying to remove only the first page from several PDF files and merge into one file. (I get 150 PDF files per day, the first page is the invoice I need, the next three to 12 pages is just a backup that I don't need). Thus, the input of 150 PDF files of various sizes, and the output I want is 1 PDF file containing only the first page of each of the 150 files.

What I seem to have done is to merge all the pages EXCEPT for the first page (which is the only one I need).

import PyPDF2, os

pdfFiles = []
for filename in os.listdir('.'):
    if filename.endswith('.pdf'):
        pdfFiles.append(filename)
pdfFiles.sort(key=str.lower)
pdfWriter = PyPDF2.PdfFileWriter()

for filename in pdfFiles:
    pdfFileObj = open(filename, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

for pageNum in range(1 , pdfReader.numPages):
    pageObj = pdfReader.getPage(pageNum)
    pdfWriter.addPage(pageObj)


pdfOutput = open('CombinedFirstPages.pdf', 'wb')
pdfWriter.write(pdfOutput)
pdfOutput.close()
+4
source share
1 answer

Try the following:

import PyPDF2, os


your_target_folder = ""

pdf_files = []

for dirpath, _, filenames in os.walk(your_target_folder):

    for items in filenames:

        file_full_path = os.path.abspath(os.path.join(dirpath, items))

        if file_full_path.lower().endswith(".pdf"):
            pdf_files.append(file_full_path)

        else:
            pass


pdf_files.sort(key=str.lower)
pdfWriter = PyPDF2.PdfFileWriter()


for files_address in pdf_files:
    pdfFileObj = open(files_address, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

    pageObj = pdfReader.getPage(0)
    pdfWriter.addPage(pageObj)


with open("CombinedFirstPages.pdf", "wb") as output:
    pdfWriter.write(output)

Good luck ..

0
source

Source: https://habr.com/ru/post/1688770/


All Articles