stackoverflow August 5, 2025 Rep: 8,903

enforcing nullability on json schema

Score

Answers

Views

13.0

Trend Score

Question Details

No question body available.

Answers (1)

August 6, 2025 Score: 1 Rep: 20,205 Quality: Medium Completeness: 60%

As far as I can tell, it looks like spark does not consider missing fields (or mismatches in type) to be corrupted records, but it does enforce syntax errors in the json itself. Also, if you want a corrupted record to appear, you need to add corruptrecord as a field in the schema.

I created the following ndjson for a little more coverage:

{"name": "foo", "id": "1"}
{"name": "bar"}
{"name": "foobar",}
{"name": "foobar","id": 2}

Record 3 is clearly corrupt and when reading in this file, spark will write this line to the corruptrecord field, but Record 2 will populate id with NULL as you observed. Record 4 will have its numeric id cast as a string.

I am not sure if this will work for your use case, but we could create a rule that detects whether there is a NULL in a list of required fields, and then write an entry to the corruptrecord field to indicate this. For example:


babySchema = StructType([
    StructField("name", StringType(), True),
    StructField("id", StringType(), False),
    StructField("corruptrecord", StringType(), True),
])
requiredfields = ["id", "name"]
anynullcondition = reduce(
    lambda acc, colname: acc | F.col(colname).isNull(),
    requiredfields[1:],
    F.col(requiredfields[0]).isNull()
)df = spark.read \
    .schema(babySchema) \
    .option("mode","permissive") \
    .option("columnNameOfCorruptRecord","corruptrecord") \
    .json("sample.ndjson") \
    .withColumn(
        'corruptrecord', F.coalesce(F.col('corruptrecord'), 
            F.when(
                anynullcondition, F.lit("Invalid field")
            ).otherwise(F.lit(None))
        )
    )

Result:


+------+----+-------------------+
|name  |id  |corrupt_record    |
+------+----+-------------------+
|foo   |1   |NULL               |
|bar   |NULL|Invalid field      |
|NULL  |NULL|{"name": "foobar",}|
|foobar|2   |NULL               |
+------+----+-------------------+

Export Question Data

Export this question and its answers for further analysis or reporting.

Back to Questions

enforcing nullability on json schema

Question Details

Tags

Answers (1)

Analysis Metrics

Question Information

Actions

Related Questions

Export Question Data