-
Notifications
You must be signed in to change notification settings - Fork 0
Trouble Shooting
JDBC connector
Problem:
java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
Solution:
Copy the connector .jar file (download from mysql website) to the jars directory of spark
cp /usr/share/java/mysql-connector-java-5.1.45.jar ~/spark-3.0.1-bin-hadoop2.7/jars/
Problem: overwrite
mode in spark doesn't work
In this project, we use latest_data
table for tracking the last updated data. Before extracting data we read from the lastest_data
table. After extracting a new data, we update the lastest date.
latest_df = spark.read.format('jdbc').options(
url=jdbc_mysql_url,
driver=driver_name,
dbtable=LATEST_DATA_TABLE_NAME,
user=config['DATABASE']['MYSQL_USER'],
password=config['DATABASE']['MYSQL_PASSWORD']).load()
...
latest_df.write.format('jdbc').options(
truncate=True,
url=jdbc_mysql_url,
driver=driver_name,
dbtable=LATEST_DATA_TABLE_NAME,
user=config['DATABASE']['MYSQL_USER'],
password=config['DATABASE']['MYSQL_PASSWORD'])\
.option("truncate", True)\
.mode('overwrite').save()
Cause: Spark use a lazy approach that just read metadata from the table but not actual data. When we write with overwrite
mode with truncate=True
, Spark truncate the table before reading data from it, so we lost our data before the new update actually write on.
Solution: Force Spark read data on main memory using .cache()
and action operation such as count()
# Read data from database
# <Read code>
latest_df = latest_df.cache()
latest_df.count()
# Write data to database
# <write code>
From mysql 8.0, mysql uses caching_sha2_password for authentication
mysql> ALTER USER