Reading .doc File In Python Using Antiword In Windows (also .docx)

November 26, 2023 Post a Comment

I tried reading a .doc file like - with open('file.doc', errors='ignore') as f: text = f.read() It did read that file but with huge junk, I can't remove that junk as I don't

Solution 1:

You can use antiword command line utility to do this, I know most of you would have tried it but still I wanted to share.

Download antiword from here

Extract the antiword folder to C:\ and add the path C:\antiword to your PATH environment variable.

Here is a sample of how to use it, handling docx and doc files:

import os, docx2txt
defget_doc_text(filepath, file):
    if file.endswith('.docx'):
       text = docx2txt.process(file)
       return text
    elif file.endswith('.doc'):
       # converting .doc to .docx
       doc_file = filepath + file
       docx_file = filepath + file + 'x'ifnot os.path.exists(docx_file):
          os.system('antiword ' + doc_file + ' > ' + docx_file)
          withopen(docx_file) as f:
             text = f.read()
          os.remove(docx_file) #docx_file was just to read, so deletingelse:
          # already a file with same name as doc exists having docx extension, # which means it is a different file, so we cant read itprint('Info : file with same name of doc exists having docx extension, so we cant read it')
          text = ''return text

Now call this function:

filepath ="D:\\input\\"
files = os.listdir(filepath)
for file in files:
    text = get_doc_text(filepath, file)
    print(text)

This could be good alternate way to read .doc file in Python on Windows.

Hope it helps, Thanks.

Solution 2:

Mithilesh's example is good, but it's simpler to directly use textract once you have antiword installed. Download antiword, and extract the antiword folder to C:\. Then add the antiword folder to your PATH environment variable. (instructions for adding to PATH here). Open a new terminal or command console to re-load your PATH env variable. Install textract with pip install textract.

Then you can use textract (which uses antiword for .doc files) like this:

import textract
text = textract.process('filename.doc')
text.decode('utf-8')  # converts from bytestring to string

If you are getting errors, try running the command antiword from a terminal/console to make sure it works. Also be sure the filepath to the .doc file is correct (e.g. use os.path.exists('filename.doc')).

howtostartbloggingformoney

Reading .doc File In Python Using Antiword In Windows (also .docx)

Solution 1:

Solution 2:

Post a Comment for "Reading .doc File In Python Using Antiword In Windows (also .docx)"

Widget HTML #3