Home  >  Article  >  Backend Development  >  How to Split a Vector Column into Rows in PySpark?

How to Split a Vector Column into Rows in PySpark?

Patricia Arquette
Patricia ArquetteOriginal
2024-10-31 20:10:01412browse

How to Split a Vector Column into Rows in PySpark?

Splitting a Vector Column into Rows in PySpark

In PySpark, splitting a column containing vector values into separate columns for each dimension is a common task. This article will guide you through different approaches to achieve this:

Spark 3.0.0 and Above

Spark 3.0.0 introduced the vector_to_array function, simplifying this process:

<code class="python">from pyspark.ml.functions import vector_to_array

df = df.withColumn("xs", vector_to_array("vector"))</code>

You can then select the desired columns:

<code class="python">df.select(["word"] + [col("xs")[i] for i in range(3)])</code>

Spark Less Than 3.0.0

Approach 1: Converting to RDD

<code class="python">def extract(row):
    return (row.word, ) + tuple(row.vector.toArray().tolist())

df.rdd.map(extract).toDF(["word"])  # Vector values will be named _2, _3, ...</code>

Approach 2: Using a UDF

<code class="python">from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, DoubleType

def to_array(col):
    def to_array_(v):
        return v.toArray().tolist()
    return udf(to_array_, ArrayType(DoubleType())).asNondeterministic()(col)

df = df.withColumn("xs", to_array(col("vector")))</code>

Select the desired columns:

<code class="python">df.select(["word"] + [col("xs")[i] for i in range(3)])</code>

By implementing any of these methods, you can effectively split a vector column into individual columns, making it easier to work with and analyze your data.

The above is the detailed content of How to Split a Vector Column into Rows in PySpark?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn