Login

Register

Login

Register

✆+91-9916812177 | contact@beingdatum.com

Data Masking in Big Data [Spark]

Data Masking in Big Data [Spark]

[vc_row][vc_column][vc_column_text]

We often face challenges over masking data in our Big Data pipelines so that all sensitive data is masked from the unauthorized users. These users can be developers, business analysts, data engineers or just anyone trying to play around.

Since we know that we just cannot mask data to make it XXXX to make it useless, I have come up with a useful method for data masking in a generic way for all types of data so that the same data can be used for useful purposes.

.[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][vc_column_text]

Data Masking is different than encryption in terms of using and purpose, masking is done mainly on PII data while encryption can be done on the entire dataset as well.

[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][vc_column_text]

Data formats can be of any type e.g. email, credit card, number, string, date, etc. We need to preserve these types so that user at least knows what can be the expected data formats.

Take an example of following Deals dataset defined as DealsDF

 — — -+ — — — — — — — -+ — — — — — + — — — — — -+
|deal_id|discount_amount|product_id|active_flag|
+ — — — -+ — — — — — — — -+ — — — — — + — — — — — -+
|deal1 |1.0 |product1 |Y |
|deal2 |1.2 |product2 |Y |
|deal1 |1.5 |product3 |N |
|deal1 |1.2 |product4 |Y |
|deal1 |1.2 |product2 |Y |
+ — — — -+ — — — — — — — -+ — — — — — + — — — — — -+

The task is to mask the deal_id, discount amount and product_id for each deal_id. We can use below snippet to do it.

def mask(inputDF: Dataset[Row], colmask: Map[String, List[String]], customMaskSring: String = “”): Dataset[Row] = {
 var initialDF = inputDF
 colmask.foreach {
 case (“email”, value) => value.foreach { v => { initialDF = initialDF.withColumn(v, lit(“aXXX@XXXX.com”)) } }
 case (“custom”, value) => value.foreach { v => { initialDF = initialDF.withColumn(v, lit(customMaskSring)) } }
 case (“creditcard”, value) => value.foreach { v => { initialDF = initialDF.withColumn(v, lit(“xxxx-xxxx-xxxx-1234”)) } }
 case (“randomnumber”, value) => value.foreach { v => { initialDF = initialDF.withColumn(v, lit(scala.util.Random.nextInt(10000))) } }
 case (“number”, value) => value.foreach { v => { initialDF = initialDF.withColumn(v, lit(0)) } }
 case (“string”, value) => value.foreach { v => { initialDF = initialDF.withColumn(v, lit(“xxxx”)) } }
 case (“date”, value) => value.foreach { v => { initialDF = initialDF.withColumn(v, lit(“01–01–1900”)) } }
 case _ => { initialDF }
 }
 initialDF
 }

And if we run below code example, we get the masked data.

val maskeddf= mask(DealsDF, Map(“string” -> List(“deal_id”, “product_id”),”number” -> List(“discount_amount”)))maskeddf.show(false)+ — — — -+ — — — — — — — -+ — — — — — + — — — — — -+|deal_id|discount_amount|product_id|active_flag|+ — — — -+ — — — — — — — -+ — — — — — + — — — — — -+|xxxx |0 |xxxx |Y ||xxxx |0 |xxxx |Y ||xxxx |0 |xxxx |N ||xxxx |0 |xxxx |Y ||xxxx |0 |xxxx |Y |+ — — — -+ — — — — — — — -+ — — — — — + — — — — — -+

We can also create a config table and fetch value of Map e.g. Map(“string” -> List(“deal_id”, “product_id”),”number” -> List(“discount_amount”)) from the table instead of coding.

It’s as simple as this, in the next post I will include different type of encryption algorithms to mix and match with masking to secure your big data pipelines.

Happy Coding!

[/vc_column_text][/vc_column][/vc_row]

 

Saurav Agarwal

0 responses on "Data Masking in Big Data [Spark]"

    Leave a Message

    Your email address will not be published. Required fields are marked *

    © BeingDatum. All rights reserved.
    X