Question Details

No question body available.

Tags

apache-spark

Answers (2)

March 20, 2026 Score: 2 Rep: 950 Quality: Low Completeness: 50%

While convenient for ad-hoc analysis, relying on inferSchema: true introduces significant overhead and fragility into your data pipelines. Try to avoid it in production.

Could use DDL Strings Instead of letting Spark guess, tell it exactly what to expect using a DDL-formatted string. It’s concise and readable.

schemaddl = "id INT, name STRING, metadata STRUCT" df = spark.read.schema(schemaddl).json("path/to/data.json")
March 20, 2026 Score: 1 Rep: 6,243 Quality: Low Completeness: 80%

It seems that inferring dates is not supported at the moment, Spark 4.1.1 . Although there is a "dateFormat" option, this one is only used to parse strings to dates when manually specifying the schema.

There is an inferTimestamp though. I can use this together with timestampFormat to at least get a timestamp type (see JSONOptions.scala):

spark.read.option("inferTimestamp","true").option("timestampFormat","yyyy-MM-dd").json("myfile.json").printSchema
root
 |-- d: timestamp (nullable = true)
 |-- id: long (nullable = true)
 |-- str: string (nullable = true)

It is not optimal, because now I cannot distinguish between dates and timestamps, but it is at least better than strings.

Let's see if I can create a PR for this.