Home >Database >Mysql Tutorial >How Can I Preserve Null Values During Apache Spark Joins?

How Can I Preserve Null Values During Apache Spark Joins?

DDDOriginal: 2024-12-31 17:36:11303browse

Preserving Null Values in Apache Spark Joins

By default, Apache Spark omits rows with null values when performing joins. To include these values in the join output, Spark provides several options.

NULL-Safe Equality Operator (<=>)

Spark 1.6 introduced a special NULL-safe equality operator that allows you to include null values in your join criteria.

numbersDf
  .join(lettersDf, numbersDf("numbers") <=> lettersDf("numbers"))
  .drop(lettersDf("numbers"))

Column.eqNullSafe (PySpark 2.3.0 )

In PySpark 2.3.0 and later, you can use Column.eqNullSafe to perform NULL-safe equality checks.

numbers_df = sc.parallelize([
    ("123", ), ("456", ), (None, ), ("", )
]).toDF(["numbers"])

letters_df = sc.parallelize([
    ("123", "abc"), ("456", "def"), (None, "zzz"), ("", "hhh")
]).toDF(["numbers", "letters"])

numbers_df.join(letters_df, numbers_df.numbers.eqNullSafe(letters_df.numbers))

%<=>% (SparkR)

SparkR offers a %<=>% operator for NULL-safe equality checks.

numbers_df <- createDataFrame(data.frame(numbers = c("123", "456", NA, "")))
letters_df <- createDataFrame(data.frame(
  numbers = c("123", "456", NA, ""),
  letters = c("abc", "def", "zzz", "hhh")
))

head(join(numbers_df, letters_df, numbers_df$numbers %<=>% letters_df$numbers))

IS NOT DISTINCT FROM (SQL)

In SQL (Spark 2.2.0 ), you can use IS NOT DISTINCT FROM to preserve null values in joins.

SELECT * FROM numbers JOIN letters 
ON numbers.numbers IS NOT DISTINCT FROM letters.numbers

This operator can also be used with the DataFrame API:

numbersDf.alias("numbers")
  .join(lettersDf.alias("letters"))
  .where("numbers.numbers IS NOT DISTINCT FROM letters.numbers")

The above is the detailed content of How Can I Preserve Null Values During Apache Spark Joins?. For more information, please follow other related articles on the PHP Chinese website!

sql NULL for include operator default this column spark apache

Statement：

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Previous article：Why Can't I Use Aliased Aggregates in SQL's HAVING Clause?Next article：Why Can't I Use Aliased Aggregates in SQL's HAVING Clause?

See more

How Can I Preserve Null Values During Apache Spark Joins?

Preserving Null Values in Apache Spark Joins

Related articles