• Home

Logo

Navigation
  • Home
  • Articles
    • Content Writing
    • Design
    • General
    • Internet Marketing
    • Social Media
    • Tools and Tips
    • Usability
    • Web Hosting Articles
  • Tutorials
    • AJAX Tutorials
    • ASP Tutorials
    • C# Tutorials
    • CGI and Perl Tutorials
    • CSS Tutorials
    • Flash Tutorials
    • HTML Tutorials
    • Illustrator Tutorials
    • Java Tutorials
    • JavaScript Tutorials
    • Linux Tutorials
    • Miscellaneous Tutorials
    • MySQL Tutorials
    • Photoshop Tutorials
    • PHP Tutorials
    • Python Tutorials
    • Wireless Tutorials
    • WordPress Tutorials
    • XML Tutorials
  • Scripts
    • AJAX Scripts
    • ASP Scripts
    • ASP.NET Scripts
    • CGI & Perl Scripts
    • Flash Scripts
    • Java Scripts
    • JavaScript Scripts
    • PHP Scripts
    • Python Scripts
    • Remotely Hosted
    • Tools and Utilities
    • XML Scripts
  • Answers
  • Online Services
  • Tools

Extracting text from Word Documents via PHP and COM

By Akash Mehta | on Mar 14, 2008 | 12 Comments
PHP Tutorials
  • Tweet
  • Share
  • Tweet
  • Share

I was recently working on an enterprise project in which I needed to detect the text inside a Word Document. Now, I could have got rid of all the non-standard characters from the .doc file and hoped I got something reasonable at the end. I could have tried to run Word 2007 via command line to save the file as a .docx. Or I could just talk to any copy of MS Word via COM and have it do all the dirty work for me.

Naturally, I chose the latter. Here’s ten lines of code that do just that.

Communicating via COM in PHP is easy as ever; especially for people coming from a VB background where executing complex tasks in MS-applications is a piece of cake, you will feel right at home in PHP. In fact, VB COM calls can be converted to PHP COM calls in just a few simple search and replaces.

Without further ado, here’s the code. I’ll point out a gotcha I noticed in just a moment.

<?php
$word = new COM("word.application") or die ("Could not initialise MS Word object.");
$word->Documents->Open(realpath("Sample.doc"));

// Extract content.
$content = (string) $word->ActiveDocument->Content;

echo $content;

$word->ActiveDocument->Close(false);

$word->Quit();
$word = null;
unset($word);

We first create a new COM object of word.application, which provides access to core MS Word functionality. We then tell it to open “Sample.doc” in the current directory. There’s a bit of a bug/feature when trying to get the content out, however. If you debug this code, you’ll find that $word->ActiveDocument->Content is an empty object (variant). If you assign the value to a variable you’ll get an empty string, as the variant object has no real __toString(). The workaround in PHP is to explicitly type cast the value as a string and make PHP/COM take care of finding the real value. If you try this in a VB macro run inside Word, this is not neccessary – a MsgBox(ActiveDocument.Content) works fine.

We also need to be aware of performance considerations on Windows servers – creating a COM object initialises a fully-fledged instance of WINWORD.exe, and a 10-15MB memory footprint associated with it. We first call the Quit() method on the Word COM object, then null the variable and destroy it to be safe. Watch your task manager while running this script and you’ll notice WINWORD.exe appears briefly then exits.

So, in just ten lines of code you can get the text out of an MS Word document, easy as ever!

If you want to explore this approach further, load up MS Word and open the Visual Basic Script Editor – press Alt+F11, then F2. (If that doesn’t work, Tools > Macros > Visual Basic Script Editor, or “Visual Basic” under the Developer tab 2007, then press F2.) From the library drop-down box in the “Object Browser” – which might say “” by default – select Word. Just about everything you see there is available via COM in PHP.

Share this story:
  • tweet

Tags: php COMphp tipsphp tricksPHP Tutorials

Author Description

12 Responses to “Extracting text from Word Documents via PHP and COM”

  1. April 16, 2008

    John Log in to Reply

    Have you figured out a way to do this on a unix box?

  2. April 16, 2008

    Akash Mehta Log in to Reply

    @John: The best way is to install a Word-file-reading application on the server, and call the binary via exec(). Try http://directory.fsf.org/project/catdoc/, wv or antiword. This blog post – http://tech.forumone.com/archives/53-Extracting-text-from-Office-and-PDF-files.html – may also help.

  3. August 9, 2008

    Andrew Log in to Reply

    // Extract content.
    $content = (string) $word->ActiveDocument->Content;

    This command does work, however it only brings back the plain text of the document, you loose all the formatting of the original document.

    Do you know how you can get the formatted text (spacing, tabs, bolding, fonts…) of the word document back into php ?

    Thanks.

  4. September 26, 2008

    shanker Log in to Reply

    Extracting text from Word Documents via PHP and COM:(I have implemented your code in my application it is working properly in my localhost.
    But in server i am getting the following error.
    Fatal error: Class ‘COM’ not found in /www/htdocs/v138698/LSP/QuickQuote/index2.php on line 223
    The code is not working in linux server.
    please tell me how to run this code without errors in linux server.

    Thanks in advance,

    Shanker

  5. September 10, 2009

    Pankaj Log in to Reply

    Extracting text from Word Documents via PHP and COM:(I have implemented your code in my application it is working properly in my localhost.
    But in server i am getting the following error.
    Fatal error: Class ‘COM’ not found in /home/iworklac/public_html/projects/chh/beta/view_cv.php on line 87
    The code is not working in linux server.
    please tell me how to run this code without errors in linux server.

    Thanks,
    Pankaj

  6. October 21, 2009

    SOPHY SEM Log in to Reply

    i test your code in localhost, but nothing happen. the browser just keep running.

    thank.

  7. October 21, 2009

    Kirubakaran Log in to Reply

    The same error will be occur for me.
    The code is working properly in the localhost but its not working in the linux server. I am also using the PHP code.
    please solve the problem.

    Thanks,
    Kirubakaran.

  8. March 13, 2010

    Adam David Log in to Reply

    I used the ZEND IDE and i am new in PHP development i try your code and get this error!!!

    Debug Error: PHPDocument1 line 3 – Uncaught exception ‘com_exception’ with message ‘Parameter 0: Type mismatch.
    ‘ in PHPDocument1:3
    Stack trace:
    #0 PHPDocument1(3): variant->Open(false)
    #1 C:\Program Files\Zend\ZendStudio-5.5.1\bin\php5\dummy.php(1): include(‘PHPDocument1′)
    #2 {main}
    thrown

    Please tell me what should i do next? PLZ its my FYP!!!

  9. April 20, 2010

    Rishi Jain Log in to Reply

    It’s working yay, thanks buddy

  10. December 16, 2010

    ian Log in to Reply

    any idea if this can be done with access aswell?

  11. January 12, 2011

    ganbca Log in to Reply

    i had this error.
    how to solve this…..

    Fatal error: Uncaught exception ‘com_exception’ with message ‘Parameter 0: Type mismatch. ‘ in D:wampwwwi2psgetcv.php:18 Stack trace: #0 D:wampwwwi2psgetcv.php(18): variant->Open(false) #1 {main} thrown in D:wampwwwi2psgetcv.php on line 18

  12. September 20, 2012

    rizal aditya Log in to Reply

    Thank you.. it works….

You must be logged in to post a comment.

Connect With Us

RSSSubscribe 0Followers 495Likes
  • Popular
  • Recent
  • Comments
  • Creating Energy Spheres in Photoshop

    Apr 15, 2008 - 96 Comments
  • Easy Screen Scraping in PHP with the Simple HTML DOM Library

    Aug 6, 2008 - 20 Comments
  • Calculating date difference more precisely in PHP

    Mar 7, 2008 - 13 Comments
  • When Does Hosting Your Website in the Cloud Make Sense?

    Oct 8, 2010 - 2 Comments
  • Fun with the Microsoft Managed Extensibility Framework Part 2

    Oct 6, 2010 - 0 Comment
  • Fun with the Microsoft Managed Extensibility Framework Part 1

    Sep 22, 2010 - 0 Comment
  • Website Management on the go with the iPad

    I appreciated your post, but I was looking for something I didn't...
    November 24, 2012 - drmoderator
  • Creating Energy Spheres in Photoshop

    I'm a little stuck down here especially at the step of creating the...
    November 23, 2012 - sarah
  • Running background processes in PHP

    Can you give an example? As see it, you can use this only when you...
    November 16, 2012 - Shaked Klein Orbach
Developer Resources
  • Tutorial Directory
  • Learn HTML
  • Learn PHP
  • Learn CSS
  • Learn AJAX
  • Learn JavaScript
  • Learn Pear
  • White Papers
  • Resources
    • NetVisits Web Directory
    • Realtor Pixels
    • Answers On The Run
    • Ask A Geek
  • Recent Posts

    • When Does Hosting Your Website in the Cloud Make Sense?
    • Fun with the Microsoft Managed Extensibility Framework Part 2
    • Fun with the Microsoft Managed Extensibility Framework Part 1
    • Website Management on the go with the iPad
    • Code Contracts in C# 4.0 – Part 1

    Calendar

    June 2013
    M T W T F S S
    « Oct    
     12
    3456789
    10111213141516
    17181920212223
    24252627282930

    Recent Comments

    • drmoderator on Website Management on the go with the iPad
    • sarah on Creating Energy Spheres in Photoshop
    • Shaked Klein Orbach on Running background processes in PHP
    • Thomas Cuvillier on How To Upload Files Using PHP
    • rizal aditya on Extracting text from Word Documents via PHP and COM
    • Home
    © 2003 - 2013 DeveloperTutorials.com. All Rights Reserved. Privacy Policy.