Skip to content Skip to sidebar Skip to footer

Numpy And Static Linking

I am running Spark programs on a large cluster (for which, I do not have administrative privileges). numpy is not installed on the worker nodes. Hence, I bundled numpy with my prog

Solution 1:

There are at least two problems with your approach and both can be reduced to a simple fact that NumPy is a heavyweight dependency.

  • First of all Debian packages come with multiple dependencies including libgfortran, libblas, liblapack and libquadmath. So you cannot simply copy NumPy installation and expect that things will work (to be honest you shouldn't do anything like this if it wasn't the case). Theoretically you could try to build it using static linking and this way ship it with all the dependencies but it hits the second issue.

  • NumPy is pretty large by itself. While 20MB doesn't look particularly impressive and with all the dependencies it shouldn't be more 40MB it has to be shipped to the workers each time you start your job. The more workers you have the worse it gets. If you decide you need SciPy or SciKit it can get much worse.

Arguably this makes NumPy a really bad candidate for being shipped with pyFile method.

If you hadn't have direct access to the workers but all the dependencies, including header files and a static library were present, you could simply try to install NumPy in the user space from the task itself (it assumes that pip is installed as well) with something like this:

try:
    import numpy as np

expect ImportError:
    import pip
    pip.main(["install", "--user", "numpy"])
    import numpy as np

You'll find other variants of this method in How to install and import Python modules at runtime?

Since you have access to the workers a much better solution is to create a separate Python environment. Probably the simplest approach is to use Anaconda which can be used to package non-Python dependencies as well and doesn't depend on the system-wide libraries. You can easily automate this task using tools like Ansible or Fabric, it doesn't require administrative privileges and all you really need is bash and some way to fetch basic installers (wget, curl, rsync, scp).

See also: shipping python modules in pyspark to other nodes?

Post a Comment for "Numpy And Static Linking"