Abstractions are wonderful things. Take a look at Rails: we can achieve huge amounts of functionality with comparatively few lines of code written. I don't necessarily need to know a great deal about how my database is implemented in order to be up and running quickly, for example.
The downside of this level of insulation from core functionality is that developers don't learn all of the things they perhaps should. Only last week I was able to troubleshoot a slow query by removing the index because someone along the line didn't understand cardinality or how an index worked. Could you implement a database yourself? Let's dig into another piece of technology that we take for granted!
How do Indexes Work?
We expect relatively fast look up times when querying a database. If you were implementing a function to pull in every value between two numbers, you might be tempted to check each value to see if it met the criterion, but this would be prohibitively inefficient for large datasets, as it performs in linear time (O(n)
). What you need is a way to store records in a specified order to facilitate quicker access times. A structure like a Binary Search Tree (BST) does exactly this, by ensuring that any given record is greater than all the records in its left subtree, but lower than all items in its right subtree. This brings our search time down to a much more respectable O(log n)
, and is the backbone of a database index.
Where a BST lets us down is that we seldom require a specific record. Rather, we're more likely to want all records matching a particular criterion. A naive solution would be to check each record again, which brings us right back up to linear time, or O(n)
. What we need is a way to store all records in order, traverse that list and then stop when a condition is met (in this case, the upper bound of our query). A data structure well suited to this task is the B+ Tree, which brings access times down to O(m + log n)
.
While researching this article, I was struck by the inefficiency of this approach. Ostensbily, all that's required is fast indexing of an ordered list --- a task well suited to criminally under-used and little-known data structure, the Skip List. It turns out that, although B+ Trees typically requires more CPU work than equivalent operations in other data structures, they excel at requiring far fewer transfers from disk to memory, which is a natural bottleneck. This Google Blog post provides some context as to why this is. Additionally, B+ Trees are also worst-case efficient --- a huge factor when datasets are larger.
Why Is Using Too Many Indexes a Bad Idea?
Let's imagine removing a row in our database. This means we'd also need to remove the corresponding node in the B+ Tree. You can't just remove the node, because we're relying on the ordered nature of the data structure, which means we need to retain the order in addition to ensuring that the overall height of the tree is as low as it can be. To combat this, our database engine will leverage smart insertion and deletion methods, but these will take logarithmic (O(log n)
) time. Crucially, any field you apply an index to has to rebalance the tree each time you perform an insert or delete operation.
A field with few possible options (such as boolean, trilean, etc.) is said to have low cardinality. Here is a great IBM article that warns against the pitfalls of indexing fields with low cardinality, but the gist of it is that you should avoid doing it. Anecdotally however, I've reduced queries by 80% before, simply by indexing a True/False column, so as always, profile profile profile!
Continue reading %Beyond Rails Abstractions: A Dive into Database Internals%
by David Bush via SitePoint