Skip to content Skip to sidebar Skip to footer

Spark - Missing 1 Required Position Argument (lambda Function)

I'm trying to distribute some text extraction from PDFs between multiple servers using Spark. This is using a custom Python module I made and is an implementation of this question.

Solution 1:

If you only want the path, not the content then you should not use sc.binaryFiles. In that case you should parallelize the paths and then have the Python code load each file individually, as so:

paths = ['/path/to/file1', '/path/to/file2']
input = sc.parallelize(paths)
processed = input.map(lambda path: (path, processFile(path)))

This assumes of course that each executor Python process can access the files directly. This wouldn't work with HDFS or S3 for instance. Can your library not take binary content directly?

Solution 2:

map takes function of as single argument and your passing a function of two arguments:

input.map(lambda filename, content: (STE.extractTextFromPdf(filename,'ste-config.yaml'), content)

Use either

input.map(lambda fc: (STE.extractTextFromPdf(fc[0],'ste-config.yaml'), fc[1])

or

defprocess(x):
    filename, content = x
    return STE.extractTextFromPdf(filename,'ste-config.yaml'), content

Not that it won't work in general unless:

  • STE.extractTextFromPdf can use Hadoop compliant file system or
  • You use POSIX compliant file system.

If it doesn't you can try :

  • Using pseudofiles like io.BytesIO (if it supports reading from file-like objects at some level).
  • Write content to a temporary file on a local FS and read it from there.

Post a Comment for "Spark - Missing 1 Required Position Argument (lambda Function)"