Question Details

No question body available.

Tags

json apache-spark pyspark apache-spark-sql ndjson

Answers (1)

August 6, 2025 Score: 1 Rep: 20,205 Quality: Medium Completeness: 60%

As far as I can tell, it looks like spark does not consider missing fields (or mismatches in type) to be corrupted records, but it does enforce syntax errors in the json itself. Also, if you want a corrupted record to appear, you need to add corruptrecord as a field in the schema.

I created the following ndjson for a little more coverage:

{"name": "foo", "id": "1"}
{"name": "bar"}
{"name": "foobar",}
{"name": "foobar","id": 2}

Record 3 is clearly corrupt and when reading in this file, spark will write this line to the corruptrecord field, but Record 2 will populate id with NULL as you observed. Record 4 will have its numeric id cast as a string.

I am not sure if this will work for your use case, but we could create a rule that detects whether there is a NULL in a list of required fields, and then write an entry to the corruptrecord field to indicate this. For example:

babySchema = StructType([ StructField("name", StringType(), True), StructField("id", StringType(), False), StructField("corruptrecord", StringType(), True), ])

requiredfields = ["id", "name"]

anynullcondition = reduce( lambda acc, colname: acc | F.col(colname).isNull(), requiredfields[1:], F.col(requiredfields[0]).isNull() )

df = spark.read \ .schema(babySchema) \ .option("mode","permissive") \ .option("columnNameOfCorruptRecord","corruptrecord") \ .json("sample.ndjson") \ .withColumn( 'corruptrecord', F.coalesce(F.col('corruptrecord'), F.when( anynullcondition, F.lit("Invalid field") ).otherwise(F.lit(None)) ) )

Result:

+------+----+-------------------+ |name |id |
corrupt_record | +------+----+-------------------+ |foo |1 |NULL | |bar |NULL|Invalid field | |NULL |NULL|{"name": "foobar",}| |foobar|2 |NULL | +------+----+-------------------+