Skip to content

extractTextFromDOCX()

Overview

Reads a .docx Word file from disk and returns all readable text as a plain string. Extracts content from body paragraphs, tables (formatted as pipe-separated rows), and list items across all sections of the document.

Internally uses the phpoffice/phpword library. Works with .docx only — legacy .doc files must be converted first.

!!! note "Requirement" Run composer require phpoffice/phpword before using this function.


Function Signature

function extractTextFromDOCX(string $filePath): string

Parameters

Parameter Type Required Description
$filePath string Yes Absolute path to the .docx file on disk. Do not pass .doc (legacy binary format) — the library will throw an exception.

Return Value

Type Description
string Full plain-text content of the document. Table rows are formatted as Cell 1 \| Cell 2 \| Cell 3. Each paragraph and row ends with a newline \n.

Exceptions

Exception When thrown
\RuntimeException File does not exist at the given $filePath.
\PhpOffice\PhpWord\Exception\Exception File is not a valid .docx, is corrupted, or is a legacy .doc binary.

Example

<?php
require_once '/var/www/html/vendor/autoload.php';

$text = extractTextFromDOCX('/var/www/uploads/vendor_agreement.docx');

echo $text;
// Output: all paragraphs and table cells as plain text

Table Output Format

If the .docx contains a table like:

Name PAN City
Ramesh Kumar ABCPK1234F Surat

The extracted text will be:

Name | PAN | City
Ramesh Kumar | ABCPK1234F | Surat

Usage with PAN Scanner

$text = extractTextFromDOCX('/var/www/uploads/kyc_form.docx');
$pans = extractPANsWithDetails($text);

foreach ($pans as $item) {
    echo $item['pan'] . ' — ' . $item['entity_type'] . PHP_EOL;
}

Helper Function: extractElementText()

extractTextFromDOCX() relies on a recursive helper extractElementText($element) that handles nested document structures. You do not call this directly, but it supports:

Element Type How It Is Extracted
TextRun / Paragraph Recursively extracts child text nodes, adds \n
Text Returns the raw text string
Table Iterates rows → cells → elements, joins with \|
ListItem Extracts the list item's text object and adds \n

Notes & Caveats

!!! warning "Legacy .doc Files" .doc (older binary Word format) is not supported by phpoffice/phpword directly. Convert to .docx first using LibreOffice: soffice --headless --convert-to docx yourfile.doc

!!! tip "Headers and Footers" Standard headers and footers are accessible via sections but may not always render if the document uses complex master-page structures. Test with your specific templates.

!!! note "Embedded Images" Images embedded in the .docx are ignored — only text elements are extracted.


See Also


# Add to mkdocs.yml nav:
- Developer Zone:
  - File Parsing Reference:
    - extractTextFromDOCX: dev/extract_text_from_docx.md