Home  >  Article  >  Problem with postgreSQL, trying to connect PySpark on Jupyter Notebook on Docker

Problem with postgreSQL, trying to connect PySpark on Jupyter Notebook on Docker

王林
王林forward
2024-02-11 20:00:111329browse

php editor Youzi recently received feedback from users that they encountered problems when using Jupyter Notebook on Docker to connect to PySpark. The specific problem is that I encountered some problems related to PostgreSQL during the connection process. In response to this problem, we will provide you with solutions and operation steps to help users successfully connect to PySpark and solve the problem. In this article, we will introduce in detail how to use Jupyter Notebook on Docker to connect to PySpark, and provide solutions to some common problems. We hope it will be helpful to everyone.

Problem content

I encountered this problem py4jjavaerror: An error occurred when calling o124.save. :org.postgresql.util.psqexception: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster accepts tcp/ip connections. When I run this pysark code on jupyter notbook and run everything using docker, postgresql is installed in my local machine (windows).

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, col, explode
import pyspark.sql.functions as f

spark = SparkSession.builder.appName("ETL Pipeline").config("spark.jars", "./postgresql-42.7.1.jar").getOrCreate()
df = spark.read.text("./Data/WordData.txt")

df2 = df.withColumn("splitedData", f.split("value"," "))
df3 = df2.withColumn("words", explode("splitedData"))
wordsDF = df3.select("words")
wordCount = wordsDF.groupBy("words").count()

driver = "org.postgresql.Driver"
url = "jdbc:postgresql://localhost:5432/local_database"
table = "word_count"
user = "postgres"
password = "12345"

wordCount.write.format("jdbc") \
    .option("driver", driver) \
    .option("url", url) \
    .option("dbtable", table) \
    .option("mode", "append") \
    .option("user", user) \
    .option("password", password) \
    .save()

spark.stop()

I tried editing postgresql.conf adding "listen_addresses='localhost'" and editing pg_hba.conf adding "host all all 0.0.0.0/0 md5" but it didn't work for me so I don't know what to do Do.

Workaround

I also solved the problem of installing PostgreSQL on docker (using this image https://hub.docker .com/_/postgres/ only Create a container for postgres) and use the command to create a network between the PySpark container and the postgreSQL container

docker network creates my_network,

This command is for postgres container

docker run --name postgres_container --network my_network -e POSTGRES_PASSWORD=12345 -d -p 5432:5432 postgres:latest

This is for Jupyter-pyspark container

docker run --name jupyter_container --network my_network -it -p 8888:8888 -v C:\home\work\path:/home/jovyan/work jupyter/pyspark-notebook:latest

The above is the detailed content of Problem with postgreSQL, trying to connect PySpark on Jupyter Notebook on Docker. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:stackoverflow.com. If there is any infringement, please contact admin@php.cn delete