Web Development

Extracting text from Word Documents via PHP and COM

I was recently working on an enterprise project in which I needed to detect the text inside a Word Document. Now, I could have got rid of all the non-standard characters from the .doc file and hoped I got something reasonable at the end. I could have tried to run Word 2007 via command line to save the file as a .docx. Or I could just talk to any copy of MS Word via COM and have it do all the dirty work for me.

Naturally, I chose the latter. Here’s ten lines of code that do just that.

Communicating via COM in PHP is easy as ever; especially for people coming from a VB background where executing complex tasks in MS-applications is a piece of cake, you will feel right at home in PHP. In fact, VB COM calls can be converted to PHP COM calls in just a few simple search and replaces.

Without further ado, here’s the code. I’ll point out a gotcha I noticed in just a moment.

<?php
$word = new COM("word.application") or die ("Could not initialise MS Word object.");
$word->Documents->Open(realpath("Sample.doc"));

// Extract content.
$content = (string) $word->ActiveDocument->Content;

echo $content;

$word->ActiveDocument->Close(false);

$word->Quit();
$word = null;
unset($word);

We first create a new COM object of word.application, which provides access to core MS Word functionality. We then tell it to open “Sample.doc” in the current directory. There’s a bit of a bug/feature when trying to get the content out, however. If you debug this code, you’ll find that $word->ActiveDocument->Content is an empty object (variant). If you assign the value to a variable you’ll get an empty string, as the variant object has no real __toString(). The workaround in PHP is to explicitly type cast the value as a string and make PHP/COM take care of finding the real value. If you try this in a VB macro run inside Word, this is not neccessary – a MsgBox(ActiveDocument.Content) works fine.

We also need to be aware of performance considerations on Windows servers – creating a COM object initialises a fully-fledged instance of WINWORD.exe, and a 10-15MB memory footprint associated with it. We first call the Quit() method on the Word COM object, then null the variable and destroy it to be safe. Watch your task manager while running this script and you’ll notice WINWORD.exe appears briefly then exits.

So, in just ten lines of code you can get the text out of an MS Word document, easy as ever!

If you want to explore this approach further, load up MS Word and open the Visual Basic Script Editor – press Alt+F11, then F2. (If that doesn’t work, Tools > Macros > Visual Basic Script Editor, or “Visual Basic” under the Developer tab 2007, then press F2.) From the library drop-down box in the “Object Browser” – which might say “” by default – select Word. Just about everything you see there is available via COM in PHP.

About the author

Written by .

If you found this post useful you may also want to check these out:

  1. Extracting Objects in Photoshop
  2. Variable Substitution In XML Documents
  3. Replacing Text in a MySQL Database Using PHP
  4. How to Position Text and Images Exactly
  5. Text::Autoformat: Smart Text Reformatting with Perl
  6. Building Easy Text or Images Scrolling with Flash MX 2004
  • http://www.advertisingmediaworks.com John

    Have you figured out a way to do this on a unix box?

  • http://bitmeta.org/ Akash Mehta

    @John: The best way is to install a Word-file-reading application on the server, and call the binary via exec(). Try http://directory.fsf.org/project/catdoc/, wv or antiword. This blog post – http://tech.forumone.com/archives/53-Extracting-text-from-Office-and-PDF-files.html – may also help.

  • Andrew

    // Extract content.
    $content = (string) $word->ActiveDocument->Content;

    This command does work, however it only brings back the plain text of the document, you loose all the formatting of the original document.

    Do you know how you can get the formatted text (spacing, tabs, bolding, fonts…) of the word document back into php ?

    Thanks.

  • shanker

    Extracting text from Word Documents via PHP and COM:(I have implemented your code in my application it is working properly in my localhost.
    But in server i am getting the following error.
    Fatal error: Class ‘COM’ not found in /www/htdocs/v138698/LSP/QuickQuote/index2.php on line 223
    The code is not working in linux server.
    please tell me how to run this code without errors in linux server.

    Thanks in advance,

    Shanker

  • Pankaj

    Extracting text from Word Documents via PHP and COM:(I have implemented your code in my application it is working properly in my localhost.
    But in server i am getting the following error.
    Fatal error: Class ‘COM’ not found in /home/iworklac/public_html/projects/chh/beta/view_cv.php on line 87
    The code is not working in linux server.
    please tell me how to run this code without errors in linux server.

    Thanks,
    Pankaj

  • SOPHY SEM

    i test your code in localhost, but nothing happen. the browser just keep running.

    thank.

  • Kirubakaran

    The same error will be occur for me.
    The code is working properly in the localhost but its not working in the linux server. I am also using the PHP code.
    please solve the problem.

    Thanks,
    Kirubakaran.

  • Adam David

    I used the ZEND IDE and i am new in PHP development i try your code and get this error!!!

    Debug Error: PHPDocument1 line 3 – Uncaught exception ‘com_exception’ with message ‘Parameter 0: Type mismatch.
    ‘ in PHPDocument1:3
    Stack trace:
    #0 PHPDocument1(3): variant->Open(false)
    #1 C:\Program Files\Zend\ZendStudio-5.5.1\bin\php5\dummy.php(1): include(‘PHPDocument1′)
    #2 {main}
    thrown

    Please tell me what should i do next? PLZ its my FYP!!!

  • Rishi Jain

    It’s working yay, thanks buddy

  • http://pulse.yahoo.com/_S7EXDVJJ34JN2EHHXSMZ2KG4HE ian

    any idea if this can be done with access aswell?

  • http://pulse.yahoo.com/_3UKL2AVCBY4MEK6UEPRAKJ2Q6U ganbca

    i had this error.
    how to solve this…..

    Fatal error: Uncaught exception ‘com_exception’ with message ‘Parameter 0: Type mismatch. ‘ in D:wampwwwi2psgetcv.php:18 Stack trace: #0 D:wampwwwi2psgetcv.php(18): variant->Open(false) #1 {main} thrown in D:wampwwwi2psgetcv.php on line 18