Skip to content

Help / Documentation - Merciglobal Cloud ERP

Extract Text from pdf

`extractTextFromPDF()`

Overview

Parses a .pdf file from disk and returns all its readable text as a plain string. Each page's text is extracted in order and separated by a newline. Use this as the first step before scanning the text for patterns such as PAN numbers, GST numbers, or any other structured data.

Internally uses the smalot/pdfparser library.

!!! note "Requirement" Run composer require smalot/pdfparser before using this function.

Function Signature

function extractTextFromPDF(string $filePath): string

Parameters

Parameter	Type	Required	Description
`$filePath`	`string`	Yes	Absolute path to the `.pdf` file on disk. Relative paths are not recommended — always use `__DIR__` or a configured upload path.

Return Value

Type	Description
`string`	The full extracted text from all pages, concatenated with `\n` between pages. Returns an empty string if the PDF has no extractable text layer (e.g. scanned image-only PDFs).

Exceptions

Exception	When thrown
`\RuntimeException`	File does not exist at the given `$filePath`.
`\Exception` (smalot internal)	File is corrupted, password-protected, or unreadable by the parser.

Example

<?php
require_once '/var/www/html/vendor/autoload.php';

$text = extractTextFromPDF('/var/www/uploads/customer_kyc.pdf');

echo $text;
// Output: full text content of the PDF, page by page

Usage with PAN Scanner

$text   = extractTextFromPDF('/var/www/uploads/invoice.pdf');
$pans   = extractPANsWithDetails($text);

foreach ($pans as $item) {
    echo $item['pan'] . ' — ' . $item['entity_type'] . PHP_EOL;
}

Notes & Caveats

!!! warning "Scanned / Image PDFs" If the PDF was created by scanning a physical document (no text layer), extractTextFromPDF() will return an empty string. You must use an OCR library such as thiagoalessio/tesseract_ocr for those files. Use pdffonts yourfile.pdf on the command line to check whether a PDF has a text layer before processing.

!!! tip "Large PDFs" For very large PDFs (100+ pages), memory usage can spike. Consider raising memory_limit in php.ini or processing page-by-page using $pdf->getPages() directly.

See Also

# Add to mkdocs.yml nav:
- Developer Zone:
  - File Parsing Reference:
    - extractTextFromPDF: dev/extract_text_from_pdf.md