Skip to content Skip to sidebar Skip to footer

Write Output Directly From A Dask Worker

I have a pipeline that transforms (maps) a dataframe. The output is large - rows in the input dataframe contain audio in binary format and rows in the output dataframe contain extr

Solution 1:

Your understanding is incorrect: the workers will read and write to shared storage or cloud/network services directly, this is the normal way that things are calculated.

df = dd.read_parquet(url)
df_out = do_work(df)
df_out.to_parquet(url2)

In this case, the data is never seen by the scheduler or the client. They do communicate, though: the client will load metadata about the dataset, so that it can make inferences about how to split up the work to be done, and the scheduler talks to both the client and the workers to farm out these task specifications and check when they are done.

You can optionally bring the whole dataset into the memory of the client as a pandas dataframe with

local_df = df.compute()

but this is optional and obviously not recommended where the data size is bigger than memory. You usually never need to do this for the whole dataset, only maybe for some aggregate result much smaller than the original. Even in this case, the scheduler itself does not store the results.

Post a Comment for "Write Output Directly From A Dask Worker"