Reading .doc File In Python Using Antiword In Windows (also .docx)
Solution 1:
You can use antiword
command line utility to do this, I know most of you would have tried it but still I wanted to share.
- Download
antiword
from here
- Extract the
antiword
folder toC:\
and add the pathC:\antiword
to yourPATH
environment variable.
Here is a sample of how to use it, handling docx and doc files:
import os, docx2txt
defget_doc_text(filepath, file):
if file.endswith('.docx'):
text = docx2txt.process(file)
return text
elif file.endswith('.doc'):
# converting .doc to .docx
doc_file = filepath + file
docx_file = filepath + file + 'x'ifnot os.path.exists(docx_file):
os.system('antiword ' + doc_file + ' > ' + docx_file)
withopen(docx_file) as f:
text = f.read()
os.remove(docx_file) #docx_file was just to read, so deletingelse:
# already a file with same name as doc exists having docx extension, # which means it is a different file, so we cant read itprint('Info : file with same name of doc exists having docx extension, so we cant read it')
text = ''return text
Now call this function:
filepath ="D:\\input\\"
files = os.listdir(filepath)
for file in files:
text = get_doc_text(filepath, file)
print(text)
This could be good alternate way to read .doc
file in Python
on Windows
.
Hope it helps, Thanks.
Solution 2:
Mithilesh's example is good, but it's simpler to directly use textract
once you have antiword installed. Download antiword, and extract the antiword folder to C:\
. Then add the antiword folder to your PATH
environment variable. (instructions for adding to PATH here). Open a new terminal or command console to re-load your PATH
env variable. Install textract with pip install textract
.
Then you can use textract
(which uses antiword
for .doc files) like this:
import textract
text = textract.process('filename.doc')
text.decode('utf-8') # converts from bytestring to string
If you are getting errors, try running the command antiword
from a terminal/console to make sure it works. Also be sure the filepath to the .doc file is correct (e.g. use os.path.exists('filename.doc')
).
Post a Comment for "Reading .doc File In Python Using Antiword In Windows (also .docx)"