Hadoop Streaming With Python: Keeping Track Of Line Numbers
I am trying to do what should be a simple task: I need to convert a text file to upper case using Hadoop streaming with Python. I want to do it by using the TextInputFormat which p
Solution 1:
If your job is just to upper case a single file, then Hadoop isn't really going to give you anything that streaming the file to a single machine, performing the upper case and then writing the contents back up to HDFS. Even with a huge file (say 1TB), you are still going to need to get everything to a single reducer such that when it is written back to HDFS it's stored in a single contiguous file.
In this case i would configure your streaming job to have a single mapper per file (set the split min and max size to something huge, larger than the file itself), and run a map only job.
Post a Comment for "Hadoop Streaming With Python: Keeping Track Of Line Numbers"