Tuesday, June 6, 2017

How to Implement Your Own Data Structure in Python

Python provides full-fledged support for implementing your own data structure using classes and custom operators. In this tutorial you will implement a custom pipeline data structure that can perform arbitrary operations on its data. We will use Python 3.

The Pipeline Data Structure

The pipeline data structure is interesting because it is very flexible. It consists of a list of arbitrary functions that can be applied to a collection of objects and produce a list of results. I will take advantage of Python's extensibility and use the pipe character ("|") to construct the pipeline.

Live Example

Before diving into all the details, let's see a very simple pipeline in action:

What's going on here? Let's break it down step by step. The first element range(5) creates a list of integers [0, 1, 2, 3, 4]. The integers are fed into an empty pipeline designated by Pipeline(). Then a "double" function is added to the pipeline, and finally the cool Ω function terminates the pipeline and causes it to evaluate itself. 

The evaluation consists of taking the input and applying all the functions in the pipeline (in this case just the double function). Finally, we store the result in a variable called x and print it.

Python Classes

Python supports classes and has a very sophisticated object-oriented model including multiple inheritance, mixins, and dynamic overloading. An __init__() function serves as a constructor that creates new instances. Python also supports an advanced meta-programming model, which we will not get into in this article. 

Here is a simple class that has an __init__() constructor that takes an optional argument x (defaults to 5) and stores it in a self.x attribute. It also has a foo() method that returns the self.x attribute multiplied by 3:

Here is how to instantiate it with and without an explicit x argument:

Custom Operators

With Python, you can use custom operators for your classes for nicer syntax. There are special methods known as "dunder" methods. The "dunder" means "double underscore". These methods like "__eq__", "__gt__" and "__or__" allow you to use operators like "==", ">" and "|" with your class instances (objects). Let's see how they work with the A class.

If you try to compare two different instances of A to each other, the result will always be False regardless of the value of x:

This is because Python compares the memory addresses of objects by default. Let's say we want to compare the value of x. We can add a special "__eq__" operator that takes two arguments, "self" and "other", and compares their x attribute:

Let's verify:

Implementing the Pipeline as a Python Class

Now that we've covered the basics of classes and custom operators in Python, let's use it to implement our pipeline. The __init__() constructor takes three arguments: functions, input, and terminals. The "functions" argument is one or more functions. These functions are the stages in the pipeline that operate on the input data. 

The "input" argument is the list of objects that the pipeline will operate on. Each item of the input will be processed by all the pipeline functions. The "terminals" argument is a list of functions, and when one of them is encountered the pipeline evaluates itself and returns the result. The terminals are by default just the print function (in Python 3, "print" is a function). 

Note that inside the constructor, a mysterious "Ω" is added to the terminals. I'll explain that next. 

The Pipeline Constructor

Here is the class definition and the __init__() constructor:

Python 3 fully supports Unicode in identifier names. This means we can use cool symbols like "Ω" for variable and function names. Here, I declared an identity function called "Ω", which serves as a terminal function: Ω = lambda x: x

I could have used the traditional syntax too:

The "__or__" and "__ror__" Operators

Here comes the core of the Pipeline class. In order to use the "|" (pipe symbol), we need to override a couple of operators. The "|" symbol is used by Python for bitwise or of integers. In our case, we want to override it to implement chaining of functions as well as feeding the input at the beginning of the pipeline. Those are two separate operations.

The "__ror__" operator is invoked when the second operand is a Pipeline instance as long as the first operand is not. It considers the first operand as the input and stores it in the self.input attribute, and returns the Pipeline instance back (the self). This allows the chaining of more functions later.

Here is an example where the __ror__() operator would be invoked: 'hello there' | Pipeline()

The "__or__" operator is invoked when the first operand is a Pipeline (even if the second operand is also a Pipeline). It accepts the operand to be a callable function and it asserts that the "func" operand is indeed callable. 

Then, it appends the function to the self.functions attribute and checks if the function is one of the terminal functions. If it is a terminal then the whole pipeline is evaluated and the result is returned. If it's not a terminal, the pipeline itself is returned.

Evaluating the Pipeline

As you add more and more non-terminal functions to the pipeline, nothing happens. The actual evaluation is deferred until the eval() method is called. This can happen either by adding a terminal function to the pipeline or by calling eval() directly. 

The evaluation consists of iterating over all the functions in the pipeline (including the terminal function if there is one) and running them in order on the output of the previous function. The first function in the pipeline receives an input element.

Using Pipeline Effectively

One of the best ways to use a pipeline is to apply it to multiple sets of input. In the following example, a pipeline with no inputs and no terminal functions is defined. It has two functions: the infamous double function we defined earlier and the standard math.floor

Then, we provide it three different inputs. In the inner loop, we add the Ω terminal function when we invoke it to collect the results before printing them:

You could use the print terminal function directly, but then each item will be printed on a different line:

Future Improvements

There are a few improvements that can make the pipeline more useful:

  • Add streaming so it can work on infinite streams of objects (e.g. reading from files or network events).
  • Provide an evaluation mode where the entire input is provided as a single object to avoid the cumbersome workaround of providing a collection of one item.
  • Add various useful pipeline functions.

Conclusion

Python is a very expressive language and is well equipped for designing your own data structure and custom types. The ability to override standard operators is very powerful when the semantics lend themselves to such notation. For example, the pipe symbol ("|") is very natural for a pipeline. 

A lot of Python developers enjoy Python's built-in data structures like tuples, lists, and dictionaries. However, designing and implementing your own data structure can make your system simpler and easier to work with by elevating the level of abstraction and hiding internal details from users. Give it a try.


by Gigi Sayfan via Envato Tuts+ Code

No comments:

Post a Comment