Home  >  Article  >  Backend Development  >  How to Split a Vector Column into Columns in PySpark?

How to Split a Vector Column into Columns in PySpark?

Susan Sarandon
Susan SarandonOriginal
2024-11-01 01:06:01987browse

How to Split a Vector Column into Columns in PySpark?

Splitting Vector Column into Columns using PySpark

You have a PySpark DataFrame with two columns: word and vector, where vector is a VectorUDT column. Your goal is to split the vector column into multiple columns, each representing one dimension of the vector.

Solution:

Spark >= 3.0.0

In Spark versions 3.0.0 and above, you can use the vector_to_array function to achieve this:

<code class="python">from pyspark.ml.functions import vector_to_array

(df
    .withColumn("xs", vector_to_array("vector")))
    .select(["word"] + [col("xs")[i] for i in range(3)]))</code>

This will create new columns named word and xs[0], xs[1], xs[2], and so on, representing the values of the original vector.

Spark < 3.0.0

For older Spark versions, you can follow these approaches:

Convert to RDD and Extract

<code class="python">from pyspark.ml.linalg import Vectors

df = sc.parallelize([
    ("assert", Vectors.dense([1, 2, 3])),
    ("require", Vectors.sparse(3, {1: 2}))
]).toDF(["word", "vector"])

def extract(row):
    return (row.word, ) + tuple(row.vector.toArray().tolist())

df.rdd.map(extract).toDF(["word"])  # Vector values will be named _2, _3, ...</code>

Create a UDF:

<code class="python">from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, DoubleType

def to_array(col):
    def to_array_(v):
        return v.toArray().tolist()
    # Important: asNondeterministic requires Spark 2.3 or later
    # It can be safely removed i.e.
    # return udf(to_array_, ArrayType(DoubleType()))(col)
    # but at the cost of decreased performance
    return udf(to_array_, ArrayType(DoubleType())).asNondeterministic()(col)

(df
    .withColumn("xs", to_array(col("vector")))
    .select(["word"] + [col("xs")[i] for i in range(3)]))</code>

Either approach will result in a DataFrame with separate columns for each dimension of the original vector, making it easier to work with the data.

The above is the detailed content of How to Split a Vector Column into Columns in PySpark?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn