Hadoop Streaming With Python: Keeping Track Of Line Numbers

June 25, 2024 Post a Comment

I am trying to do what should be a simple task: I need to convert a text file to upper case using Hadoop streaming with Python. I want to do it by using the TextInputFormat which p

Solution 1:

If your job is just to upper case a single file, then Hadoop isn't really going to give you anything that streaming the file to a single machine, performing the upper case and then writing the contents back up to HDFS. Even with a huge file (say 1TB), you are still going to need to get everything to a single reducer such that when it is written back to HDFS it's stored in a single contiguous file.

In this case i would configure your streaming job to have a single mapper per file (set the split min and max size to something huge, larger than the file itself), and run a map only job.

howtostartbloggingformoney

Hadoop Streaming With Python: Keeping Track Of Line Numbers

Solution 1:

Post a Comment for "Hadoop Streaming With Python: Keeping Track Of Line Numbers"

Widget HTML #3