As published in Towards data science
Examining the infinite amount of data created online, namely blogs, webinars, and so on, requires a bloom filter data structure to optimize the run time of some of our validations, and ultimately create value for our customers at Brew.
Beyond the use of bloom filters, to ensure we are able to effectively validate data while continually improving search performance in our database, we needed to integrate query restructuring and function-based indexes into our algorithm.
Note: Because we work with Django and PostgreSQL, all code examples in this blog will use Django ORM as it converts to raw SQL queries, but the principles and takeaways are applicable to any developer who works with an ORM and relational database.
The Brew platform works with an infinite amount of data generated by an endless number of marketing activities, each of which has a unique URL.
Every URL can have several legitimate versions, such as a URL that uses the HTTP or HTTPS scheme, a URL that starts with “www” or one that doesn’t, a URL with a language prefix in the path (e.g “http://example.com/en/page”), a URL that ends with a “HTML” suffix, etc.
Since we can receive any of these variations as a link of a potentially new activity, and since we want to check whether this is a new activity, we had to check if this URL existed and was used one of these versions in our database. To do that , we constructed a query with regular expressions that covered all common variants to enable these searches. Specifically, in Django’s ORM. Querying with regular expressions is simple because it has a native field lookup operator that supports them, such as:
The query above fetches all URLs that have a HTTP or HTTPS scheme and can either start with “www.” or without it. The usage of the iregex operator instead of regex makes the search case-insensitive.
Zooming In On The Problem
Using a regular expression in the query worked great for a while, but as our database grew, we noticed some slowness in part of our processes. We began looking into why this slowness occurs and what is going on in our servers at these times. We noticed that the database’s CPU usage was quite high during some of our searches. This was surprising because, in order to optimize the search, we made sure that the columns that are frequently used in queries are indexed. Yet, a high CPU in the database, while running specific queries, implies that an index is not being used in that query. To validate this assumption, we checked the execution plan of the suspicious queries as the culprit.
Checking the execution plan can be done with the following command in PostgreSQL:
EXPLAIN ANALYZE SELECT * FROM entity_table WHERE link = 'https://www.example.com'
It can also be visualized with a DB administration tool like pgAdmin.
The visual explain plans of the queries we were checking looked like this:
The icon used for an entity_table represents a full table scan which means there is no use of indexes. A full table scan is an O(N) operation and can be a very expansive one when trying to search for a data point in a table with millions of rows.
So why weren’t our indexes used in these queries?
When using the native field lookup for regex in Django ORM, the ORM translates this operator to the regex operator in the database. So, if we run the following search:
In PostgreSQL it will translate to:
WHERE url ~* ‘http(s)?:\/\/(www\.)?example\.com’
The regular expression operator in the database makes the query evaluate the string in each row to see whether it matches the regular expression. This operator can’t use a regular B-tree index (which was the index built for this column) since B-tree indexes are used for equality and simple comparisons, which is not the case in regular expressions.
Solving The Problem
In prior versions of PostgreSQL (before 9.4), there was no index that could be used for queries with regular expressions. As a result, it was necessary to change the query, to make sure that indexes were used and the query’s run time was optimized.
To do that, we had to stop using the regular expression's operator, since it prevented the use of b-tree indexes. However, we still wanted to search for all of the URL variations. To do that, we had to split the query into multiple conditions instead of using one condition with a regular expression. Our previous query turned into this query:
This query in the ORM translates to the following query in the database:
WHERE url LIKE ‘http://www.example.com%' OR
url LIKE 'https://www.example.com%' OR
url LIKE 'http://example.com%' OR
url LIKE 'https://example.com%'
This query now uses the index and even though we have multiple conditions instead of one, it is much more optimized than the previous query, because the index is being used.
Note: even though we are not using the equality operator but rather the LIKE operator, which matches only part of the string, an index can still be used because we anchor the beginning of the string and it can be used as an equality match when searching the index. If we weren’t anchoring the beginning of the string, the index couldn’t be used.
For example, the following query:
WHERE url LIKE ‘%://www.example.com%'
This query will have to run a full table scan, since the beginning of the string is a wildcard, and there is no optimized way to search for this string in a b-tree index when the beginning of the string is not a specific string.
Now our query is optimized and uses the DB index and our run time is significantly reduced. However, this query is not complete since it misses one functionality that we had before. In the previous query, we used the operator iregex, which translated to a case-insensitive regular expression. In the new query, we use the operator startswith, which is case sensitive and doesn’t completely match our use case. To solve this issue, we had two options — one is to use the LOWER or UPPER function operators of the ORM, and the other is to use the istartswith operator which is the case insensitive equivalent of the startswith operator. Learning our lesson, we didn’t want to use an operator without making sure that it doesn’t degrade the query’s run time. Looking into the translation of this query in the ORM to a database query, we saw the following code:
Translates to the query:
WHERE UPPER(url) = UPPER("http://example.com")
Note: In this case, there was no difference between the two optional solutions since both are using the LOWER or the UPPER database function to make the query case insensitive.
Using these functions introduces the same issue experienced previously since the regular b-tree index is not used any time we transform the column that we are searching.
Fortunately, there is an easy solution for that and that is the function-based index.
A function-based index is an index that is created on the results of a function or an expression, instead of on the original value in this column. In our example, we can create a function-based index that is built on the results of the UPPER function on each value in the ‘url’ column in the database. This index will be used any time we use this function on the ‘url’ column in a query and this allows us to optimize the query even when not using the original values in the column.
Most, if not all, relational databases support function-based indexes and they can be used in any scenario in which common queries of the table use a function or some kind of simple expression on the column, e.g rounding values in a float column, fetching only the month part of a date, truncating a date to the beginning of the day or even a sum of two columns. Each of these examples will cancel the use of a regular b-tree index and can cause a degradation in the query run-time. Therefore, if these are common queries on your table, you should consider using a function-based index for the common expressions/functions that are being used and this can significantly optimize the query run time¹.
This experience of run-time degradation due to the wrong usage of the ORM’s operators has reiterated to us how important it is to know our ORM and how it translates our code to queries.
ORMs are a great tool and they add a level of abstraction which allow us to continue using an object-oriented paradigm while interacting with our databases. However, we always need to remember that these abstractions come with a price. Simple queries will always work great with ORMs, but more complex queries require a better understanding of how the ORM translates our code. Knowing how the ORM works can help us avoid major pitfalls, like the one that we have encountered.
Another lesson that can be learned from our experience is how important it is to know how databases perform our queries. Having this knowledge and knowledge of the variety of solutions to optimize a query can help us plan ahead when building our models and prevent optimization problems in advance or easily solve them when we do encounter them.
¹ It is now relatively simple for Django developers to add a function-based index to a model, and it can be declared in the model specification, as of version 3.2. For example:
url = models.URLField(max_length=2100)
name = models.TextField()
indexes = [