Write Output Directly From A Dask Worker
Solution 1:
Your understanding is incorrect: the workers will read and write to shared storage or cloud/network services directly, this is the normal way that things are calculated.
df = dd.read_parquet(url)
df_out = do_work(df)
df_out.to_parquet(url2)
In this case, the data is never seen by the scheduler or the client. They do communicate, though: the client will load metadata about the dataset, so that it can make inferences about how to split up the work to be done, and the scheduler talks to both the client and the workers to farm out these task specifications and check when they are done.
You can optionally bring the whole dataset into the memory of the client as a pandas dataframe with
local_df = df.compute()
but this is optional and obviously not recommended where the data size is bigger than memory. You usually never need to do this for the whole dataset, only maybe for some aggregate result much smaller than the original. Even in this case, the scheduler itself does not store the results.
Post a Comment for "Write Output Directly From A Dask Worker"