Forums

pyspark exception handling database operations

We are replacing our legacy ETL tool with PySpark code. Our ETL program fetches rows from source databases(Oracle) and then inserts the final transformed dataset to Oracle database. We are planning to use dataframes and temporary tables in spark for ETL processing . When we write the final output to Oracle table we want to log the bad records to a text file if we hit any database exception and continue processing the other database records. For example we will hold 10 rows in our spark temporary table and try to insert them to final oracle table . if few rows fail due to any database constraint violation we need to write these failed rows to a text file and continue to process other rows. Please let me know how this can be achieved.

These forums are for the hosting environment PythonAnywhere -- your question looks like it's more of a general programming query, which would be best asked over at Stack Overflow. One keyword you might find it useful to include is "atomicity" or "atomic", which is the appropriate database terminology for the "all-or-nothing" kind of behaviour it sounds like you need.