Login

Register

Login

Register

✆+91-9916812177 | contact@beingdatum.com

Complexity in Big Data — Process nested json with changing schema tags

Complexity in Big Data — Process nested json with changing schema tags

You may have seen various cases of reading json data ranging from nested structure to json having corrupt structure. But, let’s see how do we process a nested json with a schema tag changing incrementally. We will use Spark Dataframe API in its native language Scala to solve this problem.

Let’s see how do we process sample json structure as below-

{ “id” : “595ada836ef4fb4fe47d8c01”),
“response” : { “0” : { “currency” : “JPY”, “rate” : 112.27 },
“1” : { “currency” : “AUD”, “rate” : 1.30078 },
“2” : { “currency” : “EUR”, “rate” : 0.87544 },
“3” : { “currency” : “GBP”, “rate” : 0.76829 },
“4” : { “currency” : “CNY”, “rate” : 6.77907 } },
“ratesDate” : “2017–07–03”,
“createdAt” : “2017–07–04T00:00:03.421Z”,
“updatedAt” : “2017–07–04T00:00:03.421Z” }

We can see in the above json that the response from API is a nested struct type having incremental tags ranging from 0 to n. What makes this problem complex but still easily solvable is because we know the tags are incremented by 1.

Lets start by reading the data and printing the json schema to understand it’s complexity using below-

val dataframe = spark.read.option(“multiLine”,true).json( “hdfs://path/of/json”)dataframe.printSchemaroot
| — _id: string (nullable = true)
| — createdAt: string (nullable = true)
| — ratesDate: string (nullable = true)
| — response: struct (nullable = true)
| | — 0: struct (nullable = true)
| | | — currency: string (nullable = true)
| | | — rate: double (nullable = true)
| | — 1: struct (nullable = true)
| | | — currency: string (nullable = true)
| | | — rate: double (nullable = true)
| | — 2: struct (nullable = true)
| | | — currency: string (nullable = true)
| | | — rate: double (nullable = true)
| | — 3: struct (nullable = true)
| | | — currency: string (nullable = true)
| | | — rate: double (nullable = true)
| | — 4: struct (nullable = true)
| | | — currency: string (nullable = true)
| | | — rate: double (nullable = true)
| — updatedAt: string (nullable = true)

We can very well see above schema is quite complex but let’s take the challenge of converting this schema to a more structured format-
First, we have to find out how many responses are getting in that particular json record, so we have to loop through the nested field names and create a counter for getting the number of responses.
Below is a code snippet to do that –

import scala.util.Try
import org.apache.spark.sql.DataFrame
var counts=0
for(name <- dataframe.select(“response.*”).schema.fieldNames) {
if (Try(name.toInt).isSuccess == true)
{
counts+=1
}
}

Now that we know how many responses we have, we can create a dataframe and add response tags content as column names with the values being concatenated based on a fixed nested structure. See, below code snippet for the actual implementation-

var dataframeflattened = dataframe
for(c <- 0 to counts-1) {
var concatstringcurrency = “response.”+c.toString()+”.currency”
var concatstringcurrencyrate = “response.”+c.toString()+”.rate”
dataframeflattened = dataframeflattened.withColumn(concatstringcurrency,col(concatstringcurrency)).withColumn(concatstringcurrencyrate,col(concatstringcurrencyrate))
}

And, finally, we arrive at the required flattened dataframe –

dataframeflattened.drop(“response”).show

Also, an alternative way of flattening the data can be using explode() as shown below –

var dataframeflattenedalttmp = dataframe.withColumn(“temp”,explode(array(col(“response.*”))))var dataframeflattenedalt = dataframeflattenedalttmp.withColumn(“currency”,explode(array(col(“temp.currency”)))).withColumn(“rate”,explode(array(col(“temp.rate”)))).drop(“temp”,”response”)dataframeflattenedalt.show

Great, we solved major part of it, but in the next part of Complexity in Big Data series we will see how do we parse multiple json having different schema altogether and unifying all different json dataframes to a single dataframe and writing it to a target path Or a Hive table.

Special thanks to Priyanshu and Subhasish for giving different ideas for solving it.

Stay Tuned! and Keep Shining!

 

Saurav Agarwal

0 responses on "Complexity in Big Data — Process nested json with changing schema tags"

    Leave a Message

    Your email address will not be published. Required fields are marked *

    © BeingDatum. All rights reserved.
    X