`extractTextFromDOCX()`

Overview

Reads a .docx Word file from disk and returns all readable text as a plain string. Extracts content from body paragraphs, tables (formatted as pipe-separated rows), and list items across all sections of the document.

Internally uses the phpoffice/phpword library. Works with .docx only — legacy .doc files must be converted first.

!!! note "Requirement" Run composer require phpoffice/phpword before using this function.

Function Signature

function extractTextFromDOCX(string $filePath): string

Parameters

Parameter	Type	Required	Description
`$filePath`	`string`	Yes	Absolute path to the `.docx` file on disk. Do not pass `.doc` (legacy binary format) — the library will throw an exception.

Return Value

Type	Description
`string`	Full plain-text content of the document. Table rows are formatted as `Cell 1 \\| Cell 2 \\| Cell 3`. Each paragraph and row ends with a newline `\n`.

Exceptions

Exception	When thrown
`\RuntimeException`	File does not exist at the given `$filePath`.
`\PhpOffice\PhpWord\Exception\Exception`	File is not a valid `.docx`, is corrupted, or is a legacy `.doc` binary.

Example

<?php
require_once '/var/www/html/vendor/autoload.php';

$text = extractTextFromDOCX('/var/www/uploads/vendor_agreement.docx');

echo $text;
// Output: all paragraphs and table cells as plain text

Table Output Format

If the .docx contains a table like:

Name	PAN	City
Ramesh Kumar	ABCPK1234F	Surat

The extracted text will be:

Name | PAN | City
Ramesh Kumar | ABCPK1234F | Surat

Usage with PAN Scanner

$text = extractTextFromDOCX('/var/www/uploads/kyc_form.docx');
$pans = extractPANsWithDetails($text);

foreach ($pans as $item) {
    echo $item['pan'] . ' — ' . $item['entity_type'] . PHP_EOL;
}

Helper Function: `extractElementText()`

extractTextFromDOCX() relies on a recursive helper extractElementText($element) that handles nested document structures. You do not call this directly, but it supports:

Element Type	How It Is Extracted
`TextRun` / `Paragraph`	Recursively extracts child text nodes, adds `\n`
`Text`	Returns the raw text string
`Table`	Iterates rows → cells → elements, joins with `\\|`
`ListItem`	Extracts the list item's text object and adds `\n`

Notes & Caveats

!!! warning "Legacy .doc Files" .doc (older binary Word format) is not supported by phpoffice/phpword directly. Convert to .docx first using LibreOffice: soffice --headless --convert-to docx yourfile.doc

!!! tip "Headers and Footers" Standard headers and footers are accessible via sections but may not always render if the document uses complex master-page structures. Test with your specific templates.

!!! note "Embedded Images" Images embedded in the .docx are ignored — only text elements are extracted.

extractTextFromDOCX()