Home >Backend Development >Python Tutorial >How to Split a Vector Column into Individual Columns in PySpark?

How to Split a Vector Column into Individual Columns in PySpark?

Mary-Kate Olsen
Mary-Kate OlsenOriginal
2024-11-03 12:25:291033browse

How to Split a Vector Column into Individual Columns in PySpark?

PySpark: Split Vector Into Columns

In PySpark, you may encounter a DataFrame with a vector column and the need to split it into multiple columns, one for each dimension. Here's how to achieve this:

For Spark >= 3.0.0

Starting from Spark 3.0.0, a convenient way to extract vector components is using vector_to_array function:

<code class="python">df = df.withColumn("xs", vector_to_array("vector"))

# Pick the first three dimensions for illustration
result = df.select(["word"] + [col("xs")[i] for i in range(3)])</code>

For Spark < 3.0.0

Method 1:RDD Conversion

One approach involves converting the DataFrame to an RDD and extracting the vector components manually:

<code class="python">rdd = df.rdd.map(lambda row: (row.word, ) + tuple(row.vector.toArray().tolist()))
result = rdd.toDF(["word"])</code>

Method 2: UDF Creation

Alternatively, you can create a user-defined function (UDF) and apply it to the vector column:

<code class="python">@udf(ArrayType(DoubleType()))
def to_array(vector):
    return vector.toArray().tolist()

result = df.withColumn("xs", to_array(col("vector"))).select(["word"] + [col("xs")[i] for i in range(3)])</code>

The above is the detailed content of How to Split a Vector Column into Individual Columns in PySpark?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn