Home >Database >Mysql Tutorial >How Can I Preserve Null Values During Apache Spark Joins?
By default, Apache Spark omits rows with null values when performing joins. To include these values in the join output, Spark provides several options.
NULL-Safe Equality Operator (<=>)
Spark 1.6 introduced a special NULL-safe equality operator that allows you to include null values in your join criteria.
numbersDf .join(lettersDf, numbersDf("numbers") <=> lettersDf("numbers")) .drop(lettersDf("numbers"))
Column.eqNullSafe (PySpark 2.3.0 )
In PySpark 2.3.0 and later, you can use Column.eqNullSafe to perform NULL-safe equality checks.
numbers_df = sc.parallelize([ ("123", ), ("456", ), (None, ), ("", ) ]).toDF(["numbers"]) letters_df = sc.parallelize([ ("123", "abc"), ("456", "def"), (None, "zzz"), ("", "hhh") ]).toDF(["numbers", "letters"]) numbers_df.join(letters_df, numbers_df.numbers.eqNullSafe(letters_df.numbers))
%<=>% (SparkR)
SparkR offers a %<=>% operator for NULL-safe equality checks.
numbers_df <- createDataFrame(data.frame(numbers = c("123", "456", NA, ""))) letters_df <- createDataFrame(data.frame( numbers = c("123", "456", NA, ""), letters = c("abc", "def", "zzz", "hhh") )) head(join(numbers_df, letters_df, numbers_df$numbers %<=>% letters_df$numbers))
IS NOT DISTINCT FROM (SQL)
In SQL (Spark 2.2.0 ), you can use IS NOT DISTINCT FROM to preserve null values in joins.
SELECT * FROM numbers JOIN letters ON numbers.numbers IS NOT DISTINCT FROM letters.numbers
This operator can also be used with the DataFrame API:
numbersDf.alias("numbers") .join(lettersDf.alias("letters")) .where("numbers.numbers IS NOT DISTINCT FROM letters.numbers")
The above is the detailed content of How Can I Preserve Null Values During Apache Spark Joins?. For more information, please follow other related articles on the PHP Chinese website!