Home > Article > Backend Development > How do you Convert VectorUDTs into Columns in PySpark?
In PySpark, you may encounter the need to extract individual dimensions from vector columns stored as VectorUDTs. To accomplish this, you can leverage various approaches based on your Spark version.
Spark >= 3.0.0
PySpark 3.0.0 brings built-in functionality for this task:
<code class="python">from pyspark.ml.functions import vector_to_array df.withColumn("xs", vector_to_array("vector")).select(["word"] + [col("xs")[i] for i in range(3)])</code>
This concisely converts the vector into an array and projects the desired columns.
Spark < 3.0.0
Pre-3.0.0 Spark versions require more intricate approaches:
RDD Conversion:
<code class="python">df.rdd.map(lambda row: (row.word,) + tuple(row.vector.toArray().tolist())).toDF(["word"])</code>
UDF Method:
<code class="python">from pyspark.sql.functions import udf, col from pyspark.sql.types import ArrayType, DoubleType def to_array(col): return udf(lambda v: v.toArray().tolist(), ArrayType(DoubleType()))(col) df.withColumn("xs", to_array(col("vector"))).select(["word"] + [col("xs")[i] for i in range(3)])</code>
Note: For increased performance, ensure asNondeterministic is used with the UDF (requires Spark 2.3 ).
Scala Equivalent
For the Scala equivalent of these approaches, refer to "Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)]."
The above is the detailed content of How do you Convert VectorUDTs into Columns in PySpark?. For more information, please follow other related articles on the PHP Chinese website!