Skip to content Skip to sidebar Skip to footer

Pyspark: Concat Function Generated Columns Into New Dataframe

I have a pyspark dataframe (df) with n cols, I would like to generate another df of n cols, where each column records the percentage difference b/w consecutive rows in the correspo

Solution 1:

In this case, you can do a list comprehension inside of a call to select.

To make the code a little more compact, we can first get the columns we want to diff in a list:

diff_columns = [c for c in df.columns if c != 'index']

Next select the index and iterate over diff_columns to compute the new column. Use .alias() to rename the resulting column:

df_diff = df.select(
    'index',
    *[(func.log(func.col(c)) - func.log(func.lag(func.col(c)).over(w))).alias(c + "_diff")
      for c in diff_columns]
)
df_diff.show()
#+-----+------------------+-------------------+-------------------+#|index|         col1_diff|          col2_diff|          col3_diff|#+-----+------------------+-------------------+-------------------+#|    1|              null|               null|               null|#|    2| 0.693147180559945| 0.6931471805599454| 0.6931471805599454|#|    3|0.4054651081081646|0.40546510810816416|0.40546510810816416|#+-----+------------------+-------------------+-------------------+

Post a Comment for "Pyspark: Concat Function Generated Columns Into New Dataframe"