extractTextFromDOCX()
Overview
Reads a .docx Word file from disk and returns all readable text as a plain string. Extracts content from body paragraphs, tables (formatted as pipe-separated rows), and list items across all sections of the document.
Internally uses the phpoffice/phpword library. Works with .docx only — legacy .doc files must be converted first.
!!! note "Requirement"
Run composer require phpoffice/phpword before using this function.
Function Signature
function extractTextFromDOCX(string $filePath): string
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
$filePath |
string |
Yes | Absolute path to the .docx file on disk. Do not pass .doc (legacy binary format) — the library will throw an exception. |
Return Value
| Type | Description |
|---|---|
string |
Full plain-text content of the document. Table rows are formatted as Cell 1 \| Cell 2 \| Cell 3. Each paragraph and row ends with a newline \n. |
Exceptions
| Exception | When thrown |
|---|---|
\RuntimeException |
File does not exist at the given $filePath. |
\PhpOffice\PhpWord\Exception\Exception |
File is not a valid .docx, is corrupted, or is a legacy .doc binary. |
Example
<?php
require_once '/var/www/html/vendor/autoload.php';
$text = extractTextFromDOCX('/var/www/uploads/vendor_agreement.docx');
echo $text;
// Output: all paragraphs and table cells as plain text
Table Output Format
If the .docx contains a table like:
| Name | PAN | City |
|---|---|---|
| Ramesh Kumar | ABCPK1234F | Surat |
The extracted text will be:
Name | PAN | City
Ramesh Kumar | ABCPK1234F | Surat
Usage with PAN Scanner
$text = extractTextFromDOCX('/var/www/uploads/kyc_form.docx');
$pans = extractPANsWithDetails($text);
foreach ($pans as $item) {
echo $item['pan'] . ' — ' . $item['entity_type'] . PHP_EOL;
}
Helper Function: extractElementText()
extractTextFromDOCX() relies on a recursive helper extractElementText($element) that handles nested document structures. You do not call this directly, but it supports:
| Element Type | How It Is Extracted |
|---|---|
TextRun / Paragraph |
Recursively extracts child text nodes, adds \n |
Text |
Returns the raw text string |
Table |
Iterates rows → cells → elements, joins with \| |
ListItem |
Extracts the list item's text object and adds \n |
Notes & Caveats
!!! warning "Legacy .doc Files"
.doc (older binary Word format) is not supported by phpoffice/phpword directly. Convert to .docx first using LibreOffice: soffice --headless --convert-to docx yourfile.doc
!!! tip "Headers and Footers" Standard headers and footers are accessible via sections but may not always render if the document uses complex master-page structures. Test with your specific templates.
!!! note "Embedded Images"
Images embedded in the .docx are ignored — only text elements are extracted.
See Also
# Add to mkdocs.yml nav:
- Developer Zone:
- File Parsing Reference:
- extractTextFromDOCX: dev/extract_text_from_docx.md