Skip to content

extractTextFromPDF()

Overview

Parses a .pdf file from disk and returns all its readable text as a plain string. Each page's text is extracted in order and separated by a newline. Use this as the first step before scanning the text for patterns such as PAN numbers, GST numbers, or any other structured data.

Internally uses the smalot/pdfparser library.

!!! note "Requirement" Run composer require smalot/pdfparser before using this function.


Function Signature

function extractTextFromPDF(string $filePath): string

Parameters

Parameter Type Required Description
$filePath string Yes Absolute path to the .pdf file on disk. Relative paths are not recommended — always use __DIR__ or a configured upload path.

Return Value

Type Description
string The full extracted text from all pages, concatenated with \n between pages. Returns an empty string if the PDF has no extractable text layer (e.g. scanned image-only PDFs).

Exceptions

Exception When thrown
\RuntimeException File does not exist at the given $filePath.
\Exception (smalot internal) File is corrupted, password-protected, or unreadable by the parser.

Example

<?php
require_once '/var/www/html/vendor/autoload.php';

$text = extractTextFromPDF('/var/www/uploads/customer_kyc.pdf');

echo $text;
// Output: full text content of the PDF, page by page

Usage with PAN Scanner

$text   = extractTextFromPDF('/var/www/uploads/invoice.pdf');
$pans   = extractPANsWithDetails($text);

foreach ($pans as $item) {
    echo $item['pan'] . ' — ' . $item['entity_type'] . PHP_EOL;
}

Notes & Caveats

!!! warning "Scanned / Image PDFs" If the PDF was created by scanning a physical document (no text layer), extractTextFromPDF() will return an empty string. You must use an OCR library such as thiagoalessio/tesseract_ocr for those files. Use pdffonts yourfile.pdf on the command line to check whether a PDF has a text layer before processing.

!!! tip "Large PDFs" For very large PDFs (100+ pages), memory usage can spike. Consider raising memory_limit in php.ini or processing page-by-page using $pdf->getPages() directly.


See Also


# Add to mkdocs.yml nav:
- Developer Zone:
  - File Parsing Reference:
    - extractTextFromPDF: dev/extract_text_from_pdf.md