Ana Balica

Hi, I'm Ana.

I'm a software developer. I mostly do Python. This blog is about my adventures with code, travel experiences and relevant life events. I try not to take myself too seriously.

Here's what I'm doing now.

Occasionally I give talks.

Please don't take my words for granted, because Internet is full of bad advice, and I might be part of it inadvertently.

Datastore indexes basics

If you’ve been developing apps to host on Google App Engine (GAE) and use Google Cloud Datastore database (further referred to simply as Datastore), then you are probably aware of indexes. Simply put, indexes are what makes reads faster with the caveat of slower writes.

Timestamped

This is a section that I want to include more often to ground any of my explanations into a specific time-dependencies frame.

  • App Engine Python SDK 1.9.40 - 2016-07-15 (examples are in Python, though the concepts are language agnostic and applicable to other GAE environments)
  • Python 2.7 (because App Engine Python SDK doesn’t support Python3, the issue about that has been open since 2008)
  • Djangae v0.9.6 because it’s the easiest way to run Django on GAE and I’ll be using Django’s ORM integration with Datastore instead of ndb (still same concepts apply regardless of who makes the query)

Simple indexes

Say we have the following Django model:

class Book(models.Model):
    title = models.CharField(max_length=200)
    author = models.CharField(max_length=200)
    published_date = models.DateTimeField()

App Engine will define a single index on each property of an entity (model in this case), except for TextProperty and BlobProperty. Each index has a direction, by default ascending. In our case it means entity Book will have 4 simple ascending indexes (3 explicit properties and implicit pk).

What can we do with only 4 indexes? Well, lots of things: fetch all the books, query books using only equality filters or only inequality filters, sort only by a single property with no additional filters. App Engine docs contain the full list of allowed queries for automatically predefined indexes.

Composite indexes

What if we want make a more sophisticated query:

Book.objects.filter(author='Mark Twain').order_by('-published_date')

In this case Datastore can’t return a result based on the simple indexes. It requires a composite index. All composite indexes go in an index configuration file named index.yaml. When uploaded to production server, App Engine will parse this file and create all the indexes based on this configuration. Already existent indexes will not be recreated.

Generating composite indexes

We wrote a complex query. So far App Engine doesn’t know about it. For the system to be able to recognize a new composite index, we should execute the query on the development server. This will trigger an automatic write to index.yaml of the perfect index in the section bellow this line # AUTOGENERATED.

To be noted: indexes will be not created while running Django tests, because Django test framework will not run the server during tests (it interacts directly with the WSGI interface to produce requests and responses, so dev appserver is not involved).

Furthermore indexes can be manipulated manually, by listing them above # AUTOGENERATED line in index.yaml. We have the power to decide what’s the best index to serve our query.

Require indexes

If we want to be aware of what indexes are getting created, rather then letting the dev appserver do its automatic index creation, we can configure our dev environment to throw NeedIndexError exceptions when it’s missing an index. If running the server using the App Engine command, then we can do:

dev_appserver.py --require_indexes

If using Django’s/Djangae’s management command to run the server, then there’s no easy to do that.

./manage.py runserver --require_indexes  # won't work :(

That won’t work because Djangae isn’t passing command line args to the dev appserver args parser and there’s an open issue about it (if by the time you are reading this, this issue is closed and has a merged patch, then you might be lucky and that command works).

Hacky workaround to just make it work: go to google/appengine/tools/devappserver2/devappserver2.py and replace datastore_require_indexes=options.require_indexes with datastore_require_indexes=True inside DevelopmentServer._create_api_server. Sorry, I just grepped (actually aged) inside Python SDK and picked up the first convenient place to force this setting. Maybe there’s a better way ¯\_(ツ)_/¯

Therefore now if we want to run this query on dev server, Book.objects.filter(author='Mark Twain').order_by('-published_date'), we will get a pretty error saying:

Appstats Console

If this index fits us well, we can copy it to index.yaml. Most important we are aware what index got created and what query triggered it.

Cleaning indexes

Consider that we have pushed this index file to your production server.

indexes:

- kind: library_book
  properties:
  - name: author
  - name: published_date
    direction: desc

After a month or so we get a notice that librarians are not using this feature, so they ask us to remove it. Instead they want to be able to filter by author and sort by book titles in alphabetical order. This request yields another index. Since the old query has been removed, we remove the old index as well and hence end up with this index conf:

indexes:

- kind: library_book
  properties:
  - name: author
  - name: title

Once pushed to production, App Engine will build the new index, but will also keep the old indexes, ignoring the fact that we’ve updated the configuration file. This is intentional, since other versions of your app might still be using this index, so we have to be explicit.

Once we’re sure we don’t need any of the removed indexes, let’s run appcfg.py vacuum_indexes. We should be careful what we vacuum clean, since a missing index will result in a 500 for our users.


There is a lot more to say about indexes and various intricacies related to them. Now that you know the basics, we can delve into more details in my next post Intricacies and optimization of Datastore indexes.