Wednesday, February 3, 2016

Validating Data With JSON-Schema, Part 1

When you’re dealing with complex and structured data, you need to determine whether the data is valid or not. JSON-Schema is the standard of JSON documents that describes the structure and the requirements of your JSON data. In this two-part series, you’ll learn how to use JSON-Schema to validate data.

Let’s say you have a database of users where each record looks similar to this example:

The question we are going to deal with is how to determine whether the record like the one above is valid or not.

Examples are very useful but not sufficient when describing your data requirements. JSON-Schema comes to the rescue. This is one of the possible schemas describing a user record:

Have a look at the schema above and the user record it describes (that is valid according to this schema). There is a lot of explaining to do here.

JavaScript code to validate the user record against the schema could be:

or for better performance: javascript var validate = ajv.compile(userSchema); var valid = validate(userData); if (!valid) console.log(validate.errors);

All the code samples are available in the GitHub repo tutsplus-json-schema. You can also try it in the browser.

Ajv, the validator used in the example, is the fastest JSON-Schema validator for JavaScript. I created it, so I am going to use it in this tutorial.

Before we continue, let’s quickly deal with all the whys.

Why Validate Data as a Separate Step?

  • to fail fast
  • to avoid data corruption
  • to simplify processing code
  • to use validation code in tests

Why JSON (and not XML)?

  • as wide adoption as XML
  • easier to process and more concise than XML
  • dominates web development because of JavaScript

Why Use Schemas?

  • declarative
  • easier to maintain
  • can be understood by non-coders
  • no need to write code, third party open-source libraries can be used

Why JSON-Schema?

  • the widest adoption among all standards for JSON validation
  • very mature (current version is 4, there are proposals for version 5)
  • covers a big part of validation scenarios
  • uses easy-to-parse JSON documents for schemas
  • platform independent
  • easily extensible
  • 30+ validators for different languages, including 10+ for JavaScript, so no need to code it yourself

Tasks

This tutorial includes several relatively simple tasks to help you better understand the JSON schema and how it can be used. There are simple JavaScript scripts to check that you’ve done them correctly. To run them you will need to install node.js (you need no experience with it). Just install nvm (node version manager) and a recent node.js version:

You also need to clone the repo and run npm install (it will install Ajv validator).

Let’s Dive Into the Schemas!

JSON-schema is always an object. Its properties are called “keywords”. Some of them describe the rules for the data (e.g., “type” and “properties”), and some describe the schema itself (“$schema”, “id”, “title”, “description”)—we will get to them later.

The data is valid according to the schema if it is valid according to all keywords in this schema—that’s really simple.

Data Properties

Because most JSON data consists of objects with multiple properties, the keyword “properties” is probably the most commonly used keyword. It only applies to objects (see the next section about what “apply” means).

You might have noticed in the example above that each property inside the “properties” keyword describes the corresponding property in your data.

The value of each property is itself a JSON-schema—JSON-schema is a recursive standard. Each property in the data should be valid according to the corresponding schema in the “properties” keyword.

The important thing here is that the “properties” keyword doesn’t make any property required; it only defines schemas for the properties that are present in the data.

For example, if our schema is:

then objects with or without property “foo” can be valid according to this schema:

and only objects that have property foo that is not a string are invalid:

Try this example in the browser.

Data Type

You’ve already figured out what the keyword “type” does. It is probably the most important keyword. Its value (a string or array of strings) defines what type (or types) the data must be to be valid.

As you can see in the example above, the user data must be an object.

Most keywords apply to certain data types—for example, the keyword “properties” only applies to objects, and the keyword “pattern” only applies to strings.

What does “apply” mean? Let’s say we have a really simple schema:

You may expect that to be valid according to such schema, the data must be a string matching the pattern:

But the JSON-schema standard specifies that if a keyword doesn’t apply to the data type, then the data is valid according to this keyword. That means that any data that is not of type “string” is valid according to the schema above—numbers, arrays, objects, boolean, and even null. If you want only strings matching the pattern to be valid, your schema should be:

Because of this, you can make very flexible schemas that will validate multiple data types.

Look at the property “id” in the user example. It should be valid according to this schema:

This schema requires that the data to be valid should be either a “string” or an “integer”. There is also the keyword “pattern” that applies only to strings; it requires that the string should consist of digits only and not start from 0. There is the keyword “minimum” that applies only to numbers; it requires that the number should be not less than 1.

Another, more verbose, way to express the same requirement is:

But because of the way JSON-schema is defined, this schema is equivalent to the first one, which is shorter and faster to validate in most validators.

Data types you can use in schemas are “object”, “array”, “number”, “integer”, “string”, “boolean”, and “null”. Note that “number” includes “integer”—all integers are numbers too.

Numbers Validation

There are several keywords to validate numbers. All the keywords in this section apply to numbers only (including integers).

“minimum” and “maximum” are self-explanatory. In addition to them, there are the keywords “exclusiveMinimum” and “exclusiveMaximum”. In our user example, the user age is required to be an integer that is 13 or bigger. If the schema for the user age were:

then this schema would have required that user age is strictly bigger than 13, i.e. the lowest allowed age would be 14.

Another keyword to validate numbers is “multipleOf”. Its name also explains what it does, and you can check out the JSON-schema keywords reference to see how it works.

Strings Validation

There are also several keywords to validate strings. All the keywords in this section apply to strings only.

“maxLength” and “minLength” require that the string is not longer or not shorter than the given number. The JSON-schema standard requires that a unicode pair, e.g. emoji character, is counted as a single character. JavaScript counts it as two characters when you access the .length property of strings.

Some validators determine string lengths as required by the standard, and some do it the JavaScript way, which is faster. Ajv allows you to specify how to determine string lengths, and the default is to comply with the standard.

You have already seen the “pattern” keyword in action—it simply requires that the data string matches the regular expression defined according to the same standard that is used in JavaScript. See the example below for schemas and matching regular expressions:

The “format” keyword defines the semantic validation of strings, such as “email”, “date” or “date-time” in the user example. JSON-Schema also defines the formats “uri”, “hostname”, “ipv4”, and “ipv6”. Validators define formats differently, optimizing for validation speed or for correctness. Ajv gives you a choice:

Most validators allow you to define custom formats either as regular expressions or validating functions. We could define a custom format “phone” for our schema to use it in multiple places:

and then the schema for the phone property would be:

Task 1

Create a schema that will require the data to be a date (string) or a year (number) and that a year is bigger than or equal to 1976.

Put your answer in the file part1/task1/date_schema.json and run node part1/task1/validate to check it.

Object Validation

In addition to “properties”, you can see several other keywords in our user example that apply to objects.

The “required” keyword lists properties that must be present in the object for it to be valid. As you remember, the “properties” keyword doesn’t require properties, it only validates them if they are present. “required” complements “properties”, allowing you to define which properties are required and which are optional.

If we had this schema:

then all objects without property foo would be invalid.

Please note that this schema still doesn’t require our data to be an object—all other data types are valid according to it. To require that our data is an object, we have to add the “type” keyword to it.

Try the example above in the browser.

The “patternProperties” keyword allows you to define schemas according to which the data property value should be valid if the property name matches the regular expression. It can be combined with the “properties” keyword in the same schema.

The feeds property in the user example should be valid according to this schema:

To be valid, feeds should be an object with properties whose names consist only of Latin letters and whose values are boolean.

The “additionalProperties” keyword allows you to either define the schema according to which all other keywords (not used in “properties” and not matching “patternProperties”) should be valid, or to prohibit other properties completely, as we did in the feeds property schema above.

In the following example, “additionalProperties” is used to create a simple schema for hash of integers in a certain range:

The “maxProperties” and “minProperties” keywords allow you to limit the number of properties in the object. In our user example, the schema for the address property is:

This schema requires that the address is an object with required properties street, postcode, city and country, allows two additional properties (“maxProperties” is 6), and requires that all properties are strings.

“dependencies” is probably the most complex and confusing and the most rarely used keyword, but is a very powerful keyword at the same time. It allows you to define the requirements that the data should satisfy if it has certain properties.

There are two types of such requirements to the object: to have some other properties (it is called “property dependency”) or to satisfy some schema (“schema dependency”).

In our user example, one of the possible schemas that the user connection should be valid against is this:

It requires that the connType property is equal to “relative” (see the “enum” keyword below) and that if the relation property is present, it is a string.

It does not require that relation is present, but the “dependencies” keyword requires that IF the relation property is present, THEN the close property should be present too.

There are no validation rules defined for the close property in our schema, although from the example we can see that it probably must be boolean. One of the ways we could correct this omission is to change the “dependencies” keyword to use “schema dependency”:

You can play with the updated user example in the browser.

Please note that the schema in the “relation” property in the “dependencies” keyword is used to validate the parent object (i.e. connection) and not the value of the relation property in the data.

Task 2

Your database contains humans and machines. Using only the keywords that I’ve explained so far create a schema to validate both of them. A sample human object:

A sample machine object:

Note that it should be one schema to validate both humans and machines, not two schemas.

Put your answer in the file part1/task2/human_machine_schema.json and run node part1/task2/validate to check it.

Hints: use the “dependencies” keyword, and look in the file part1/task2/invalid.json to see which objects should be invalid.

Which objects that probably should be invalid too are not in the invalid.json file?

The main takeaway from this task is the fact that the purpose of validation is not only to validate all valid objects as valid. I’ve heard this argument many times: “This schema validates my valid data as valid, therefore it is correct.” This argument is wrong because you don’t need to do much to achieve it—an empty schema will do the job, because it validates any data as valid.

I think that the main purpose of validation is to validate invalid data as invalid, and that’s where all the complexity comes from.

Array Validation

There are several keywords to validate arrays (and they apply to arrays only).

“maxItems” and “minItems” require that the array has not more (or not less) than a certain number of items. In the user example, the schema requires that the number of connections is not more than 150.

The “items” keyword defines a schema (or schemas) according to which the items should be valid. If the value of this keyword is an object (as in the user example), then this object is a schema according to which the data should be valid.

If the value of the “items” keyword is an array, then this array contains schemas according to which the corresponding items should be valid:

The schema in the simple example above requires that the data is an array, with the first item that is an integer and the second that is a string.

What about items after these two? The schema above defines no requirements for other items. They can be defined with the “additionalItems” keyword.

The “additionalItems” keyword only applies to the situation in which the “items” keyword is an array and there are more items in the data than in the “items” keyword. In all other cases (no “items” keyword, it is an object, or there are not more items in the data), the “additionalItems” keyword will be ignored, regardless of its value.

If the “additionalItems” keyword is true, it is simply ignored. If it is false and the data has more items than the “items” keyword—then validation fails:

If the “additionalItems” keyword is an object, then this object is a schema according to which all additional items should be valid:

Please experiment with these examples to see how “items” and “additionalItems” work.

The last keyword that applies to arrays is “uniqueItems”. If its value is true, it simply requires that all items in the array are different.

Validating the keyword “uniqueItems” can be computationally expensive, so some validators chose not to implement it or to do so only partially.

Ajv has an option to ignore this keyword:

Task 3

One of the ways to create a date object in JavaScript is to pass from 2 to 7 numbers to the Date constructor:

You have an array. Create a schema that will validate that this is a valid list of arguments for the Date constructor.

Put your answer in the file part1/task3/date_args_schema.json and run node part1/task3/validate to check it.

“Enum” Keyword

The “enum” keyword requires that the data is equal to one of several values. It applies to all types of data.

In the user example, it is used to define the gender property inside the personal property as either “male” or “female”. It is also used to define the connType property in user connections.

The “enum” keyword can be used with any types of values, not only strings and numbers, although it is not very common.

It can also be used to require that data is equal to a specific value, as in the user example:

The proposals for the next version (v5) of the JSON-Schema standard include the keyword “constant” to do the same.

Ajv allows you to use “constant” and some other keywords from v5:

Compound Validation Keywords

There are several keywords that allow you to define an advanced logic involving validation against multiple schemas. All the keywords in this section apply to all data types.

Our user example uses the “oneOf” keyword to define requirements to the user connection. This keyword is valid if the data successfully validates against exactly one schema inside the array.

If data is invalid according to all schemas in the “oneOf” keyword or valid according to two or more schemas, then the data is invalid.

Let’s look more closely at our example:

The schema above requires that user connection is either “relative” (connType property), in which case it may have properties relation (string) and close (boolean), or one of types “friend”, “colleague” or “other” (in which case it must not have properties relation and close).

These schemas for user connection are mutually exclusive, because there is no data that can satisfy both of them. So if the connection is valid and has type “relative”, there is no point validating it against the second schema—it will always be invalid. Nevertheless, any validator will always be validating data against both schemas to make sure that it is only valid according to one.

There is another keyword that allows you to avoid it: “anyOf”. This keyword simply requires that data is valid according to some schema in the array (possibly to several schemas).

In cases such as above, where schemas are mutually exclusive and no data can be valid according to more than one schema, it is better to use the “anyOf” keyword—it will validate faster in most cases (apart from the one, in which the data is valid according to the last schema).

Using “oneOf” in cases where “anyOf” does an equally good job is a very common mistake that negatively affects validation performance.

Our user example would also benefit from replacing “oneOf” with “anyOf”.

There are some cases, though, when we really need the “oneOf” keyword:

The schema above will successfully validate strings that mention oranges or apples, but not both (and there do exist strings that can mention both). If that’s what you need, then you need to use “oneOf”.

Comparing with boolean operators, “anyOf” is like boolean OR and “oneOf” is like XOR (exclusive OR). The fact that JavaScript (and many other languages) don’t define operators for exclusive OR shows that it is rarely needed.

There is also the keyword “allOf”. It requires that the data is valid according to all schemas in the array:

The “$ref” keyword allows you to require that data is valid according to the schema in another file (or some part of it). We will be looking at it in the second part of this tutorial.

Another mistake is to put more than absolutely necessary inside schemas in the “oneOf”, “anyOf” and “allOf” keyword. For example, in our user example, we could put inside “anyOf” all requirements that the connection should satisfy.

We also could have unnecessarily complicated the example with apple and oranges:

Another “logical” keyword is “not”. It requires that the data is NOT valid according to the schema that is the value of this keyword.

For example:

The schema above would require that the data is a string that does not contain “apple”.

In the user example, the “not” keyword is used to prevent some properties from being used in one of the cases in “oneOf”, although they are defined:

The value of the “not” keyword in the example above is an empty schema. An empty schema will validate any value as valid, and the “not” keyword will make it invalid. So the schema validation will fail if the object has the property relation or close. You can achieve the same with the combination of “not” and “required” keywords.

Another use of the “not” keyword is to define the schema that requires that an array contains an item that is valid according to some schema:

The schema above requires that the data is an array and it contains at least one integer item greater than or equal to 5.

V5 proposals include the keyword “contains” to satisfy this requirement.

Task 4

You have a database of users that all match schema from the user example. Create a schema according to which only users that satisfy all these criteria will be valid:

  • unmarried men younger than 21 or older than 60 years
  • have 5 or less connections
  • subscribe to 3 or less feeds

Put your answer in the file part1/task4/filter_schema.json and run node part1/task4/validate to check it.

The test data is simplified, so please do not use the “required” keyword in your schema.

Keywords Describing the Schema

Some keywords used in the user example do not directly affect validation, but they describe the schema itself.

The “$schema” keyword defines the URI of the meta-schema for the schema. The schema itself is a JSON document, and it can be validated using JSON-schema. A JSON-schema that defines any JSON-schema is called a meta-schema. The URI for the meta-schema for draft 4 of the JSON-schema standard is http://ift.tt/1n3c9zE.

If you extend the standard, it is recommended that you use a different value of the “$schema” property.

“id” is the schema URI. It can be used to refer to the schema (or some part of it) from another schema using the “$ref” keyword—see the second part of the tutorial. Most validators, including Ajv, allow any string as “id”. According to the standard, the schema id should be a valid URI that can be used to download the schema.

You can also use “title” and “description” to describe the schema. They are not used during the validation. Both these keywords can be used on any level inside the schema to describe some parts of it, as is done in the user example.

Final Task

Create an example of a user record that when validated with the example user schema will have 8 or more errors.

Put your answer in the file part1/task5/invalid_user.json and run node part1/task5/validate to check it.

What is still very wrong with our user schema?

What’s Next?

By now you know all the validation keywords defined by the standard, and you should be able to create quite complex schemas. As your schemas grow, you will be reusing some parts of them. Schemas can be structured into multiple parts and even multiple files to avoid repetition. We will be doing this in the second part of the tutorial.

We also will:

  • use a schema to define default values
  • filter additional properties from the data
  • use keywords included in the proposals for version 5 of the JSON-schema standard
  • define new validation keywords
  • compare existing JSON-schema validators

Thanks for reading!


by Evgeny Poberezkin via Envato Tuts+ Code

No comments:

Post a Comment