Skip to content Skip to sidebar Skip to footer

Hadoop Streaming With Python: Keeping Track Of Line Numbers

I am trying to do what should be a simple task: I need to convert a text file to upper case using Hadoop streaming with Python. I want to do it by using the TextInputFormat which p

Solution 1:

If your job is just to upper case a single file, then Hadoop isn't really going to give you anything that streaming the file to a single machine, performing the upper case and then writing the contents back up to HDFS. Even with a huge file (say 1TB), you are still going to need to get everything to a single reducer such that when it is written back to HDFS it's stored in a single contiguous file.

In this case i would configure your streaming job to have a single mapper per file (set the split min and max size to something huge, larger than the file itself), and run a map only job.

Post a Comment for "Hadoop Streaming With Python: Keeping Track Of Line Numbers"