Data formats refer to how the data fields are formatted, or structured, within a file itself. In the previous section, we compared CSV to TSV file formats. Both of these file formats save data in a flat way, meaning there isn’t any nested relationships between data records within the file. For this reason you’ll hear CSV and TSV files being referred to as flat files, that have a fixed structure.
In cases where the data records contain more complex relationships, we need to store the data in a more flexible format that allows for these relationships to be maintained.
For example, what if we wanted to store data for someone’s name and type of car:
Name |
Car |
Serge |
Ford |
Colby |
Saturn |
But, what if one person has two cars? Would we add a third column (see table below), and leave the other person’s cell blank (an inefficient use of space)? What if someone had five cars?
Name |
Car |
Second Car |
Serge |
Ford |
|
Colby |
Saturn |
Honda |
This is where semi-structured/nested data formats come into play, and popular data format for this type of use case is called JSON (JavaScript Object Notation). JSON is made up of what we call key - value pairs, the names on the left being the keys, and the fields on the right hand side being the values, all being enveloped by curly braces. For each key, there can be any number of different values, without having to add columns.
The above example in JSON would look like this:
[
{“Name”: “Serge” , “Car”: [“Ford”]},
{“Name”: “Colby” , “Car”: [“Saturn”, “Honda”]}
]
The point here isn’t to be an expert on decoding JSON, rather it’s to simply know there’s various forms of storing and structuring data. For instance, in addition to JSON some other common formats are Avro, ORC and Parquet.