extractTextFromPDF()
Overview
Parses a .pdf file from disk and returns all its readable text as a plain string. Each page's text is extracted in order and separated by a newline. Use this as the first step before scanning the text for patterns such as PAN numbers, GST numbers, or any other structured data.
Internally uses the smalot/pdfparser library.
!!! note "Requirement"
Run composer require smalot/pdfparser before using this function.
Function Signature
function extractTextFromPDF(string $filePath): string
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
$filePath |
string |
Yes | Absolute path to the .pdf file on disk. Relative paths are not recommended — always use __DIR__ or a configured upload path. |
Return Value
| Type | Description |
|---|---|
string |
The full extracted text from all pages, concatenated with \n between pages. Returns an empty string if the PDF has no extractable text layer (e.g. scanned image-only PDFs). |
Exceptions
| Exception | When thrown |
|---|---|
\RuntimeException |
File does not exist at the given $filePath. |
\Exception (smalot internal) |
File is corrupted, password-protected, or unreadable by the parser. |
Example
<?php
require_once '/var/www/html/vendor/autoload.php';
$text = extractTextFromPDF('/var/www/uploads/customer_kyc.pdf');
echo $text;
// Output: full text content of the PDF, page by page
Usage with PAN Scanner
$text = extractTextFromPDF('/var/www/uploads/invoice.pdf');
$pans = extractPANsWithDetails($text);
foreach ($pans as $item) {
echo $item['pan'] . ' — ' . $item['entity_type'] . PHP_EOL;
}
Notes & Caveats
!!! warning "Scanned / Image PDFs"
If the PDF was created by scanning a physical document (no text layer), extractTextFromPDF() will return an empty string. You must use an OCR library such as thiagoalessio/tesseract_ocr for those files. Use pdffonts yourfile.pdf on the command line to check whether a PDF has a text layer before processing.
!!! tip "Large PDFs"
For very large PDFs (100+ pages), memory usage can spike. Consider raising memory_limit in php.ini or processing page-by-page using $pdf->getPages() directly.
See Also
# Add to mkdocs.yml nav:
- Developer Zone:
- File Parsing Reference:
- extractTextFromPDF: dev/extract_text_from_pdf.md