If you’ve been developing apps to host on Google App Engine (GAE) and use Google Cloud Datastore database (further referred to simply as Datastore), then you are probably aware of indexes. Simply put, indexes are what makes reads faster with the caveat of slower writes.
This is a section that I want to include more often to ground any of my explanations into a specific time-dependencies frame.
- App Engine Python SDK 1.9.40 - 2016-07-15 (examples are in Python, though the concepts are language agnostic and applicable to other GAE environments)
- Python 2.7 (because App Engine Python SDK doesn’t support Python3, the issue about that has been open since 2008)
- Djangae v0.9.6 because it’s the easiest way to run Django on GAE and I’ll be using Django’s ORM integration with Datastore instead of ndb (still same concepts apply regardless of who makes the query)
Say we have the following Django model:
App Engine will define a single index on each property of an entity (model in this case), except for
BlobProperty. Each index has a direction, by default ascending. In our case it means entity Book will have 4 simple ascending indexes (3 explicit properties and implicit
What can we do with only 4 indexes? Well, lots of things: fetch all the books, query books using only equality filters or only inequality filters, sort only by a single property with no additional filters. App Engine docs contain the full list of allowed queries for automatically predefined indexes.
What if we want make a more sophisticated query:
In this case Datastore can’t return a result based on the simple indexes. It requires a composite index. All composite indexes go in an index configuration file named
index.yaml. When uploaded to production server, App Engine will parse this file and create all the indexes based on this configuration. Already existent indexes will not be recreated.
Generating composite indexes
We wrote a complex query. So far App Engine doesn’t know about it. For the system to be able to recognize a new composite index, we should execute the query on the development server. This will trigger an automatic write to
index.yaml of the perfect index in the section bellow this line
To be noted: indexes will be not created while running Django tests, because Django test framework will not run the server during tests (it interacts directly with the WSGI interface to produce requests and responses, so dev appserver is not involved).
Furthermore indexes can be manipulated manually, by listing them above
# AUTOGENERATED line in
index.yaml. We have the power to decide what’s the best index to serve our query.
If we want to be aware of what indexes are getting created, rather then letting the dev appserver do its automatic index creation, we can configure our dev environment to throw
NeedIndexError exceptions when it’s missing an index. If running the server using the App Engine command, then we can do:
If using Django’s/Djangae’s management command to run the server, then there’s no easy to do that.
That won’t work because Djangae isn’t passing command line args to the dev appserver args parser and there’s an open issue about it (if by the time you are reading this, this issue is closed and has a merged patch, then you might be lucky and that command works).
Hacky workaround to just make it work: go to
google/appengine/tools/devappserver2/devappserver2.py and replace
DevelopmentServer._create_api_server. Sorry, I just grepped (actually aged) inside Python SDK and picked up the first convenient place to force this setting. Maybe there’s a better way ¯\_(ツ)_/¯
Therefore now if we want to run this query on dev server,
Book.objects.filter(author='Mark Twain').order_by('-published_date'), we will get a pretty error saying:
If this index fits us well, we can copy it to
index.yaml. Most important we are aware what index got created and what
query triggered it.
Consider that we have pushed this index file to your production server.
After a month or so we get a notice that librarians are not using this feature, so they ask us to remove it. Instead they want to be able to filter by author and sort by book titles in alphabetical order. This request yields another index. Since the old query has been removed, we remove the old index as well and hence end up with this index conf:
Once pushed to production, App Engine will build the new index, but will also keep the old indexes, ignoring the fact that we’ve updated the configuration file. This is intentional, since other versions of your app might still be using this index, so we have to be explicit.
Once we’re sure we don’t need any of the removed indexes, let’s run
appcfg.py vacuum_indexes. We should be careful what we vacuum clean, since a missing index will result in a 500 for our users.
There is a lot more to say about indexes and various intricacies related to them. Now that you know the basics, we can delve into more details in my next post Intricacies and optimization of Datastore indexes.