Tuesday, November 8, 2016

Data Serialization Comparison: JSON, YAML, BSON, MessagePack

Data Serialization Comparison

JSON is the de facto standard for data exchange on the web, but it has its drawbacks, and there are other formats that may be more suitable for certain scenarios. I’ll compare the pros and cons of the alternatives, including ease of use and performance.

Note: I won't cover implementation details here, but if you're a Ruby programmer, check out this article, where Dhaivat writes about implementing some serialization formats in Ruby.

What Is Data Serialization

According to Wikipedia, serialization is:

the process of translating data structures or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection link) and reconstructed later in the same or another computer environment.

Let's say you want to collect certain data about a group of people --- name, last name, nickname, date of birth, instruments they play. You could easily set a spreadsheet, define some columns, and make every row an entry. You could go just a little further, define that the date of birth column must be a number, and that the instruments columns could be a list of options. It'd look like this:

name last name dob nickname instruments
William Bailey 1962 Axl Rose vocals, piano
Saul Hudson 1965 Slash guitar

More or less, what you did there was define a data structure; and you'll do just fine if you only need this on a spreadsheet format. The problem is that, if you ever want to exchange this information with a database or a website, the mechanics by which these data structures are implemented on these other platforms --- even if the underlying semantics are overall the same --- will be dramatically different. You can't just plug-n-play a spreadsheet into a web application, unless the application has been specifically designed for it. And you can't transfer that info from the website to the database unless you have some sort of export tool or gateway for it.

Let's assume that our website already has these data structures implemented in its internal logic, and that it just cannot deal with a spreadsheet format. In order to solve these problems, you can translate these data structures into a format that can be easily shared across different applications, architectures, or what have you: you serialize them. And by doing so, you ensure not only that you can transfer this data across platforms, but that they can be reconstructed in the reverse process called deserialization. Furthermore, if exchanged back from the website to the spreadsheet, you'll get a semantically identical clone of the original object --- that is, a row that looks exactly the same as the one you originally sent.

In short: serializing data is finding some sort of universal format that can be easily shared across different applications.

The Formats

JSON

JSON logo

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It's easy for humans to read and write; it's easy for machines to parse and generate.

JSON is the most widespread format for data serialization, and it has the following features:

  • (Mostly) human readable code: even if the code has been obscured or minified, you can always indent with tools such as JSONLint and make it readable again.
  • Very simple and straightforward specification: a summary of the whole spec fits on a single page (as displayed on the JSON site).
  • Widespread support: not only does every programming language or IDE come with JSON support, but also many web services APIs offer JSON as a means of data interchange.
  • As a subset of JavaScript, it supports the following JavaScript data types:
    • string
    • number
    • object
    • array
    • true and false
    • null

This is how our previous spreadsheet would look, after being serialized in JSON:

[
  {
    "name": "William",
    "last name": "Bailey",
    "dob": 1962,
    "nickname": "Axl Rose",
    "instruments": [
      "vocals",
      "piano"
    ]
  },
  {
    "name": "Saul",
    "last name": "Hudson",
    "dob": 1965,
    "nickname": "Slash",
    "instruments": [
      "guitar"
    ]
  }
]

BSON

BSON logo

BSON, short for Bin­ary JSON, is a bin­ary-en­coded seri­al­iz­a­tion of JSON-like doc­u­ments. … It also con­tains ex­ten­sions that al­low rep­res­ent­a­tion of data types that are not part of the JSON spec.

JSON is a plain text format, and while binary data can be encoded in text, this has certain limitations and can make JSON files very big. BSON comes in to deal with these problems.

It has the following features:

  • convenient storage of binary information: better suitable for exchanging images and attachments
  • designed for fast in-memory manipulation
  • simple specification: like JSON, BSON has a very short and simple spec
  • primary data rep­res­ent­a­tion for Mon­goDB: BSON is de­signed to be tra­versed eas­ily
  • extra data types:
    • double (64-bit IEEE 754 floating point number)
    • date (integer number of milliseconds since the Unix epoch)
    • byte array (binary data)
    • BSON object and BSON array
    • JavaScript code
    • MD5 binary data
    • regular expressions

Continue reading %Data Serialization Comparison: JSON, YAML, BSON, MessagePack%


by Lucero del Alba via SitePoint

No comments:

Post a Comment