Get the number of pages in a PDF document

Question

Get the number of pages in a PDF document

This question is for reference and comparison. The solution is the accepted answer below .

For many hours, I searched for a quick and easy, but mostly accurate way to get the number of pages in a PDF document. Since I work in a graphics and printing company that works a lot with PDF files, the number of pages in a document must be known before they are processed. PDF documents come from different clients, so they are not created in the same application and / or do not use the same compression method.

Here are some of the answers that I found insufficient or just NOT working :

Using Imagick (PHP extension)

Imagick requires a lot of installation, apache needs to be restarted, and when it finally worked for me, the processing took an amazingly long time (2-3 minutes per document), and it always returned 1 page in each document (I didn’t see a working copy of Imagick before so far), so I threw it away. It was like with the getNumberImages() and identifyImage() methods.

Using FPDI (PHP library)

FPDI is easy to use and install (it simply extracts files and calls the PHP script), BUT many compression methods are not supported by FPDI. Then it returns an error:

FPDF error: this document (test_1.pdf) probably uses a compression method that is not supported by the free analyzer that comes with FPDI.

Opening a stream and searching with a regular expression:

This opens the PDF file in the stream and looks for some line containing the number of pages or something like that.

 $f = "test1.pdf"; $stream = fopen($f, "r"); $content = fread ($stream, filesize($f)); if(!$stream || !$content) return 0; $count = 0; // Regular Expressions found by Googling (all linked to SO answers): $regex = "/\/Count\s+(\d+)/"; $regex2 = "/\/Page\W*(\d+)/"; $regex3 = "/\/N\s+(\d+)/"; if(preg_match_all($regex, $content, $matches)) $count = max($matches); return $count;

/\/Count\s+(\d+)/ (searches for /Count <number> ) does not work, because only a few documents have the /Count option inside, so most of the time it does not return anything. Source.
/\/Page\W*(\d+)/ (searches /Page<number> ) does not get the number of pages, basically it contains some other data. Source.
/\/N\s+(\d+)/ (looking for /N <number> ) also does not work, since documents can contain several values /N ; most, if not all, not containing page counts. Source.

So, what works reliably and accurately?
See the answer below.

+58

php pdf

Richard de Wit Feb 01 '13 at 10:33

source share

10 answers

The easiest use of ImageMagick

here is a sample code

 $image = new Imagick(); $image->pingImage('myPdfFile.pdf'); echo $image->getNumberImages();

otherwise, you can also use PDF libraries such as MPDF or TCPDF for PHP

+18

Kuldeep Dangi Dec 30 '15 at 15:29

source share

if you cannot install additional packages, you can use this simple single-line interface:

 foundPages=$(strings < $PDF_FILE | sed -n 's|.*Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' | sort -rn | head -n 1)

+1

Muad'Dib Sep 25 '14 at 5:10

source share

This seems to work very well, without the need for special packages or the output of a parsing command.

 <?php $target_pdf = "multi-page-test.pdf"; $cmd = sprintf("identify %s", $target_pdf); exec($cmd, $output); $pages = count($output);

+1

dhildreth Jun 01 '17 at 21:40

source share

If you have access to the shell, the easiest (but not used on 100% PDF files) approach would be to use grep .

This should only return the number of pages:

 grep -m 1 -aoP '(?<=\/N )\d+(?=\/)' file.pdf

Example: https://regex101.com/r/BrUTKn/1

Description of switches:

-m 1 necessary, as some files may have more than one regular expression pattern matching (volonteer needs to be replaced with a regular expression extension for matching only)
-a need to process the binary as text
-o to show only a match
-P to use Perl regex

Regex explanation:

start of "delimiter": (?<=\/N ) lookbehind of /N (the space character does not appear here)
actual result: \d+ any number of digits
end "delimiter": (?=\/) lookahead /

Nota bene: if in some cases no match is found, it is safe to assume that there is only one page.

+1

Saran Jun 21. '17 at 15:57

source share

Since you can use command line utilities, you can use cpdf (Microsoft Windows / Linux / Mac OS X). To get the number of pages in a single PDF:

 cpdf.exe -pages "my file.pdf"

+1

Franck Dernoncourt May 19 '19 at 2:06

source share

You can use qpdf as below. If file_name.pdf contains 100 pages,

 $ qpdf --show-npages file_name.pdf 100

+1

SuperNova Aug 19 '19 at 19:26

source share

Here is the R function, which reports the page number of a PDF file using the pdfinfo command.

 pdf.file.page.number <- function(fname) { a <- pipe(paste("pdfinfo", fname, "| grep Pages | cut -d: -f2")) page.number <- as.numeric(readLines(a)) close(a) page.number } if (F) { pdf.file.page.number("a.pdf") }

0

Feiming Chen Aug 13 '15 at 19:41

source share

Here is a windows script command using gsscript that reports the page number of the pdf file

 @echo off echo. rem rem this file: getlastpagenumber.cmd rem version 0.1 from commander 2015-11-03 rem need Ghostscript eg download and install from http://www.ghostscript.com/download/ rem Install path "C:\prg\ghostscript" for using the script without changes \\ and have less problems with UAC rem :vars set __gs__="C:\prg\ghostscript\bin\gswin64c.exe" set __lastpagenumber__=1 set __pdffile__="%~1" set __pdffilename__="%~n1" set __datetime__=%date%%time% set __datetime__=%__datetime__:.=% set __datetime__=%__datetime__::=% set __datetime__=%__datetime__:,=% set __datetime__=%__datetime__:/=% set __datetime__=%__datetime__: =% set __tmpfile__="%tmp%\%~n0_%__datetime__%.tmp" :check if %__pdffile__%=="" goto error1 if not exist %__pdffile__% goto error2 if not exist %__gs__% goto error3 :main %__gs__% -dBATCH -dFirstPage=9999999 -dQUIET -dNODISPLAY -dNOPAUSE -sstdout=%__tmpfile__% %__pdffile__% FOR /F " tokens=2,3* usebackq delims=:" %%A IN (`findstr /i "number" test.txt`) DO set __lastpagenumber__=%%A set __lastpagenumber__=%__lastpagenumber__: =% if exist %__tmpfile__% del %__tmpfile__% :output echo The PDF-File: %__pdffilename__% contains %__lastpagenumber__% pages goto end :error1 echo no pdf file selected echo usage: %~n0 PDFFILE goto end :error2 echo no pdf file found echo usage: %~n0 PDFFILE goto end :error3 echo.can not find the ghostscript bin file echo. %__gs__% echo.please download it from: echo. http://www.ghostscript.com/download/ echo.and install to "C:\prg\ghostscript" goto end :end exit /b

0

commander Nov 03 '15 at 0:17

source share

The package is R pdftools , and the pdf_info() function contains information about the number of pages in pdf format.

 library(pdftools) pdf_file <- file.path(R.home("doc"), "NEWS.pdf") info <- pdf_info(pdf_file) nbpages <- info[2] nbpages $pages [1] 65

0

emeryville Jan 18 '17 at 22:03

source share

Richard de Wit · Accepted Answer · 2013-02-01 10:33

Simple command line executable: pdfinfo.

Boots for Linux and Windows . You download a compressed file containing several small PDF related programs. Take it out somewhere.

One of these files is pdfinfo (or pdfinfo.exe for Windows). Example data returned at startup in a PDF document:

 Title: test1.pdf Author: John Smith Creator: PScript5.dll Version 5.2.2 Producer: Acrobat Distiller 9.2.0 (Windows) CreationDate: 01/09/13 19:46:57 ModDate: 01/09/13 19:46:57 Tagged: yes Form: none Pages: 13 <-- This is what we need Encrypted: no Page size: 2384 x 3370 pts (A0) File size: 17569259 bytes Optimized: yes PDF version: 1.6

I have not seen a PDF document in which it returned false pagecount (for now). It is also very fast, even with large documents at 200+ MB, the response time is only a few seconds or less.

There is an easy way to extract the pagecount from the output, here in PHP:

 // Make a function for convenience function getPDFPages($document) { $cmd = "/path/to/pdfinfo"; // Linux $cmd = "C:\\path\\to\\pdfinfo.exe"; // Windows // Parse entire output // Surround with double quotes if file name has spaces exec("$cmd \"$document\"", $output); // Iterate through lines $pagecount = 0; foreach($output as $op) { // Extract the number if(preg_match("/Pages:\s*(\d+)/i", $op, $matches) === 1) { $pagecount = intval($matches[1]); break; } } return $pagecount; } // Use the function echo getPDFPages("test 1.pdf"); // Output: 13

Of course, this command line tool can be used in other languages that can parse output from an external program, but I use it in PHP.

I know its not pure PHP , but external programs are better suited for PDF processing (as seen from the question).

I hope this can help people, because I spent a lot of time finding a solution to this question, and I saw many questions about the PDF page in which I did not find the answer I was looking for, that’s why I asked this the question itself answered itself.

Get the number of pages in a PDF document

This question is for reference and comparison. The solution is the accepted answer below .

Using Imagick (PHP extension)

Using FPDI (PHP library)

Opening a stream and searching with a regular expression:

So, what works reliably and accurately?

Simple command line executable: pdfinfo.

More articles: