Skip to content Skip to sidebar Skip to footer

Running Hadoop Jar Using Luigi Python

I need to run a Hadoop jar job using Luigi from python. I searched and found examples of writing mapper and reducer in Luigi but nothing to directly run a Hadoop jar. I need to ru

Solution 1:

You need to use the luigi.contrib.hadoop_jar package (code).

In particular, you need to extend HadoopJarJobTask. For example, like that:

from luigi.contrib.hadoop_jar import HadoopJarJobTask
from luigi.contrib.hdfs.target import HdfsTarget

classTextExtractorTask(HadoopJarJobTask):
    defoutput(self):
        return HdfsTarget('data/processed/')

    defjar(self):
        return'jobfile.jar'defmain(self):
        return'com.ololo.HadoopJob'defargs(self):
        return ['--param1', '1', '--param2', '2']

You can also include building a jar file with maven to the workflow:

import luigi
from luigi.contrib.hadoop_jar import HadoopJarJobTask
from luigi.contrib.hdfs.target import HdfsTarget
from luigi.file import LocalTarget

import subprocess
import os

classBuildJobTask(luigi.Task):
    defoutput(self):
        return LocalTarget('target/jobfile.jar')

    defrun(self):
        subprocess.call(['mvn', 'clean', 'package', '-DskipTests'])

classYourHadoopTask(HadoopJarJobTask):
    defoutput(self):
        return HdfsTarget('data/processed/')

    defjar(self):
        return self.input().fn

    defmain(self):
        return'com.ololo.HadoopJob'defargs(self):
        return ['--param1', '1', '--param2', '2']

    defrequires(self):
        return BuildJobTask()

Post a Comment for "Running Hadoop Jar Using Luigi Python"