Spark - Missing 1 Required Position Argument (lambda Function)
I'm trying to distribute some text extraction from PDFs between multiple servers using Spark. This is using a custom Python module I made and is an implementation of this question.
Solution 1:
If you only want the path, not the content then you should not use sc.binaryFiles
. In that case you should parallelize the paths and then have the Python code load each file individually, as so:
paths = ['/path/to/file1', '/path/to/file2']
input = sc.parallelize(paths)
processed = input.map(lambda path: (path, processFile(path)))
This assumes of course that each executor Python process can access the files directly. This wouldn't work with HDFS or S3 for instance. Can your library not take binary content directly?
Solution 2:
map
takes function of as single argument and your passing a function of two arguments:
input.map(lambda filename, content: (STE.extractTextFromPdf(filename,'ste-config.yaml'), content)
Use either
input.map(lambda fc: (STE.extractTextFromPdf(fc[0],'ste-config.yaml'), fc[1])
or
defprocess(x):
filename, content = x
return STE.extractTextFromPdf(filename,'ste-config.yaml'), content
Not that it won't work in general unless:
STE.extractTextFromPdf
can use Hadoop compliant file system or- You use POSIX compliant file system.
If it doesn't you can try :
- Using pseudofiles like
io.BytesIO
(if it supports reading from file-like objects at some level). - Write
content
to a temporary file on a local FS and read it from there.
Post a Comment for "Spark - Missing 1 Required Position Argument (lambda Function)"