Import Madness

Sing, goddess, the rage of George and the ImportError,

and its devastation, which put pains thousandfold upon his programs,

hurled in their multitudes to the house of Hades strong ideas

of systems, but gave their code to be the delicate feasting

of dogs, of all birds, and the will of Guido was accomplished

since that time when first there stood in division of conflict

Brett Cannon’s son the lord of modules: the brilliant import keyword…

Backstory Before Things Get Weird

This post is about how an ImportError lead me to a very strange place.

I was writing a simple Python program. It was one of my first attempts at Python 3.

I tried to import some code and got an ImportError. Normally I solve ImportError’s by shuffling files around until
the error goes away. But this time none of my shuffling solved the problem. So I found myself actually reading the
official documentation for the Python import system. Somehow I’d spent over five years writing Python code professionally
without ever reading more than snippets of those particular docs.

What I learned there changed me.

Yes, I answered my simple question, which had something to do with when I should use .’s in an import:

# most of the time, don't use dots at all:
from spam import eggs

# If Python has trouble finding spam and spam.py is in the same directory:
# (i.e. "package" as the code doing the import):
from .spam import eggs

# when spam.py is in the *enclosing* package, i.e. one level up:
from ..spam import eggs

# ImportError! At least in Python 3, you can only use the dots with
# the `from a import b` syntax:
import .spam

But more importantly I realized that this whole time I had never really understood what the word module means in Python.

According to the official Python tutorial, a “module” is a file containing Python definitions and statements.

In other words, spam.py is a module.

But it’s not quite that simple. In my running Python program, if I import requests, then what is type(requests)?

It’s module.

That means module is a type of object in a running Python program. And requests in my running program is derived from requests.py, but it’s not the same thing.

So what is the module class in Python and how is babby module formed?

Modules and the Python Import System

Modules are created automatically in Python when you import. It turns out that the import keyword in Python is syntactic sugar for a somewhat more complicated process. When you import requests, Python actually does two things:

1) Calls an internal function: __import__('requests') to create, load, and initialize the requests module object

2) Binds the local variable requests to that module

And how exactly does __import__() create, load, and initialize a module?

Well, it’s complicated. I’m not going to go into full detail, but there’s a great video where Brett Cannon, the main maintainer of the Python import system, painstakingly walks through the whole shebang.

But in a nutshell, importing in Python has 5 steps:

1. See if the module has already been imported

Python maintains a cache of modules that have already been imported. The cache is a dictionary held at sys.modules.

If you try to import requests, __import__ will first check if there’s a module in sys.modules named “requests”. If there is, Python just gives you the module object in the cache and does not do any more work.

If the module isn’t cached (usually because it hasn’t been import yet, but also maybe because someone did something nefarious…) then:

2. Find the source code using sys.path

sys.path is a list in every running Python program that tells the interpreter where it should look for modules when
it’s asked to import them. Here’s an excerpt from my current sys.path:

# the directory our code is running in:
'',
# where my Python executable lives:
'/Users/rogueleaderr/miniconda3/lib/python3.5',
# the place where `pip install` puts stuff:
'/Users/rogueleaderr/miniconda3/lib/python3.5/site-packages'

When I import requests Python goes and looks in those directories for requests.py. If it can’t find it, I’m in for an ImportError. I’d estimate that the large majority of real life ImportError’s happen because the source code you’re
trying to import isn’t in a directory that’s on sys.path. Move your module or add the directory to sys.path and you’ll have a better day.

In Python 3, you can do some pretty crazy stuff to tell Python to look in esoteric places for code. But that’s a topic for another day!

3. Make a Module object

Python has a builtin type called ModuleType. Once __import__ has found your source code, it’ll create a new ModuleType instance and attach your module.py’s source code to it.

Then, the exciting part:

4. Execute the module source code!

__import__ will create a new namespace, i.e. scope, i.e. the __dict__ attribute attached to most Python objects.
And then it will actually exec your code inside of that namespace.

Any variables or functions that get defined during that execution are captured in that namespace. And the namespace is
attached to the newly created module, which is itself then returned into the importing scope.

5. Cache the module inside sys.modules

If we try to import requests again, we’ll get the same module object back. Steps 2-5 will not be repeated.

Okay! This is a pretty cool system. It lets us write many pretty Python programs.

But, if we’re feeling demented, it also lets us write some pretty dang awful Python programs.

Where it gets weird

I learned how to fix my immediate import problem. That wasn’t enough.

Gizmo gets wet

With these new import powers in hand, I immediately starting thinking about how I could use them for evil, rather than good. Because, as we know:

Good is dumb
(c. Five Finger Tees)

So far, the worst idea I’ve had for how to misuse the Python import system is to
implement a mergesort algorithm using just the import keyword. At first I didn’t know if it was possible. But, spoiler alert, it is!

It doesn’t actually take much code. It just takes the stubbornness to figure out how to subvert a lot of very well-intentioned, normally helpful machinery in the import system.

We can do this. Here’s how:

Remember that when we import a module, Python executes all the source code.

So imagine I start up Python and define a function:

>>> def say_beep():
>>>    print("beep!.........beep!")

>>> say_beep()

This will print out some beeps.

Now imagine instead I write the same lines of code as above into a file called say_beep.py. Then I open my interpreter and run

>>> import say_beep.py

What happens? The exact same thing: Python prints out some beeps.

If I create a module that contains the same source code as the body of a function then importing the module will produce the same result as calling the function.

Well, what if I need to return something from my function body? Simple:

# make_beeper.py

beeper = lambda x: print("say beep")

# main.py

from make_beeper import beeper
beeper()

Anything that gets defined in the module is available in the module’s namespace after it’s imported. So from a import b
is structurally the same as b = f(), if I structure my module correctly.

Okay, what about passing arguments? Well, that gets a bit harder. The trick is that Python source code is just a long string, so we
can modify the source of a module before we import it:

# with_args.py

a = None
b = None
result = a + b

# main.py

src = ""
with open("with_args.py") as f:
    for line in f:
        src += line

a = "10"
b = "21"

src = src.replace("a = None", f"a = {a}")
src = src.replace("b = None", f"b = {b}")

with open("with_args.py", "w") as f:
    f.write(src)

from with_args import result

print(result)  # it's 31!

Now this certainly isn’t pretty. But where we’re going, nothing is pretty. Buckle up!

How to mergesort

Okay…how can we apply these ideas to implement mergesort?

First, let’s quickly review what mergesort is: it’s a recursive sorting algorithm with n log n worst-case computational complexity
(meaning it’s pretty darn good, especially compared to bad sorting algorithms like bubble sort that have n^2 complexity.)

It works by taking a list, splitting it in half, and then splitting the halves in half until we’re left with individual elements.

Then we merge adjacent elements by interleaving them in sorted order. Take a look at this diagram:

picture of mergesort

Or read the Wikipedia article for more details.

Some rules

  1. No built in sorting functionality. Python’s built in sort uses a derivative of mergesort
    so just putting result = sorted(lst) into a module and importing it isn’t very sporting.
  2. No user-defined functions at all.
  3. All the source code has to live inside one module file, which we will fittingly call madness.py

The code

Well, here’s the code: (Walk-through below, if you don’t feel like reading 100 lines of bizarre Python)

"""
# This is the algorithm we'll use:

import sys
import re
import inspect
import os
import importlib
import time

input_list = []
sublist = input_list
is_leaf = len(sublist) < 2
if is_leaf:
    sorted_sublist = sublist
else:
    split_point = len(sublist) // 2
    left_portion = sublist[:split_point]
    right_portion = sublist[split_point:]

    # get a reference to the code we're currently running
    current_module = sys.modules[__name__]

    # get its source code using stdlib's `inspect` library
    module_source = inspect.getsource(current_module)

    # "pass an argument" by modifying the module's source
    new_list_slug = 'input_list = ' + str(left_portion)
    adjusted_source = re.sub(r'^input_list = [.*]', new_list_slug, 
                             module_source, flags=re.MULTILINE)

    # make a new module from the modified source
    left_path = "left.py"
    with open(left_path, "w") as f:
        f.write(adjusted_source)

    # invalidate caches; force Python to do the full import again
    importlib.invalidate_caches()
    if "left" in sys.modules:
        del sys.modules['left']

    # "call" the function to "return" a sorted sublist
    from left import sorted_sublist as left_sorted

    # clean up by deleting the new module
    if os.path.isfile(left_path):
        os.remove(left_path)

    new_list_slug = 'input_list = ' + str(right_portion)
    adjusted_source = re.sub(r'^input_list = [.*]', new_list_slug, 
                             module_source, flags=re.MULTILINE)
    right_path = "right.py"
    with open(right_path, "w") as f:
        f.write(adjusted_source)

    importlib.invalidate_caches()

    if "right" in sys.modules:
       del sys.modules['right']
    from right import sorted_sublist as right_sorted

    if os.path.isfile(right_path):
        os.remove(right_path)

    # merge
    merged_list = []
    while (left_sorted or right_sorted):
        if not left_sorted:
            bigger = right_sorted.pop()
        elif not right_sorted:
            bigger = left_sorted.pop()
        elif left_sorted[-1] >= right_sorted[-1]:
            bigger = left_sorted.pop()
        else:
            bigger = right_sorted.pop()
        merged_list.append(bigger)
    # there's probably a better way to do this that doesn't
    # require .reverse(), but appending to the head of a
    # list is expensive in Python
    merged_list.reverse()
    sorted_sublist = merged_list

# not entirely sure why we need this line, but things
# don't work without it!
sys.modules[__name__].sorted_sublist = sorted_sublist
"""

import random
import os
import time

random.seed(1001)

list_to_sort = [int(1000*random.random()) for i in range(100)]
print("unsorted: {}".format(list_to_sort))

mergesort = __doc__
adjusted_source = mergesort.replace('input_list = []',
                                    'input_list = {}'.format(list_to_sort))

with open("merge_sort.py", "w") as f:
    f.write(adjusted_source)

from merge_sort import sorted_sublist as sorted_list

os.remove("merge_sort.py")
finished_time = time.time()

print("original sorted: {}".format(sorted(list_to_sort)))
print("import sorted: {}".format(sorted_list))

assert sorted_list == sorted(list_to_sort)

That’s all we need.

Breaking it down

Madness itself

The body of madness.py is compact. All it does is generate a random list of numbers, grab our template implementation of merge sort from it’s own docstring (how’s that for self-documenting code?), jam in our random list, and kick off the algorithm by running

from merge_sort import sorted_sublist as sorted_list

The mergesort implementation

This is the fun part.

First, here is a “normal” implementation of merge_sort as a function:

def merge_sort(input_list):
    if len(input_list) < 2:  # it's a leaf
        return input_list
    else:
        # split
        split_point = len(input_list) // 2
        left_portion, right_portion = input_list[:split_point], input_list[split_point:]

        # recursion
        left_sorted = merge_sort(left_portion)
        right_sorted = merge_sort(right_portion)

        # merge
        merged_list = []
        while left_sorted or right_sorted:
            if not left_sorted:
                bigger = right_sorted.pop()
            elif not right_sorted:
                bigger = left_sorted.pop()
            elif left_sorted[-1] >= right_sorted[-1]:
                bigger = left_sorted.pop()
            else:
                bigger = right_sorted.pop()
            merged_list.append(bigger)
        merged_list.reverse()
        return merged_list

It has three phases:

  1. Split the list in half
  2. Call merge_sort recursively until the list is split down to individual elements
  3. Merge the sublists we’re working on at this stage into a single sorted sublist by interleaving the elements in sorted order

But since our rule says that we can’t use functions, we need to replace this recursive function with import.

That means replacing this:

left_sorted = merge_sort(left_portion)

With this:

# get a reference to the code we're currently running
current_module = sys.modules[__name__]
# get it's source code using stdlib's `inspect` library
module_source = inspect.getsource(current_module)

# "pass an argument" by modifying the module's source
new_list_slug = 'input_list = ' + str(left_portion)
adjusted_source = re.sub(r'^input_list = [.*]', new_list_slug, module_source, flags=re.MULTILINE)

# make a new module from the modified source
left_path = "left.py"
with open(left_path, "w") as f:
    f.write(adjusted_source)

# invalidate caches
importlib.invalidate_caches()
if "left" in sys.modules:
    del sys.modules['left']

# "call" the function to "return" a sorted sublist
from left import sorted_sublist as left_sorted

# clean up by deleting the new module
if os.path.isfile(left_path):
    os.remove(left_path)

# not entirely sure why we need this line, but things
# don't work without it! Might be to keep the sorted sublist
# alive once this import goes out of scope?
sys.modules[__name__].sorted_sublist = sorted_sublist

And that’s really it.

We just use the tools we learned about to simulate calling functions with arguments and returning values. And we add a few lines to trick Python into not caching modules and instead doing the full import process when we import a module with the same name as one that’s already been imported. (If our merge sort execution tree has multiple levels, we’re going to have a lot of different left.py’s).

And that’s how you abuse the Python import system to implement mergesort.

Many paths to the top of the mountain, but the view is a singleton.

It’s pretty mindblowing (to me at least) that this approach works at all. But on the other hand, why shouldn’t it?

There’s a pretty neat idea in computer science is the Church-Turing thesis. It states that any effectively computable function can be computed by a universal Turing machine. The thesis is usually trotted out to explain why there’s nothing you can compute with a universal Turing machine that you can’t compute using lambda calculus, and therefore there’s no program you can write in C that you can’t, in principle, write in Lisp.

But here’s a corollary: since you can, if you really want to, implement a Turing tape by writing files to the file system one bit at a time and importing the results, you can use the Python import system to simulate a Turing machine. That implies that,
in principle, any computation that can be performed by a digital computer can be performed (assuming infinite space, time, and patience) using the Python import system.

The only real question is how annoying a computation will be to implement, and in this case Python’s extreme runtime dynamism makes this particular implementation surprisingly easy.

The Python community spends a lot of time advocating for good methodology and “idiomatic” coding styles. They have a good reason: if you’re writing software that’s intended to be used, some methods are almost always better than their alternatives.

But if you’re writing programs to learn, sometimes it’s helpful to remember that there are many different models of computation under the sun. And especially in the era when “deep learning” (i.e. graph-structured computations that simulate differentiable functions) is really starting to shine,
it’s extra important to remember that sometimes taking a completely different (and even wildly “inefficient”) approach to
a computational problem can lead to startling success.

It’s also nice to remember that Python itself started out as (and in a sense still is!) a ridiculously inefficient
and roundabout way to execute C code.

Abstractions really matter. In the words of Alfred North Whitehead,

Civilization advances by extending the number of important operations which we can perform without thinking about them

My “import sort” is certainly not a useful abstraction. But I hope that learning about it will lead you to some good ones!

Note Bene

In case it’s not obvious, you should never actually use these techniques in any code that you’re intending to actually use for anything.

But the general idea of modifying Python source code at import time has at least one useful
(if not necessary advisable) use case: macros.

Python has a library called macropy that implements Lisp-style syntactic macros in Python
by modifying your source code at import time.

I’ve never actually used macropy, but it’s pretty cool to know that Python makes the simple things easy and the insane things possible.

Finally, as bad as this mergesort implementation is, it allowed me to run a fun little experiment. We know that
mergesort has good computation complexity compared to naive sorting algorithms. But how long does a list have to be before a standard implementation of bubble sort runs slower than my awful import-based implementation of mergesort? It turns out that a list only has to be about 50k items long before “import sort” is faster than bubble sort.

Computational complexity is a powerful thing!

All the code for this post is on Github

#A* Interview #19: Glyph Lefkowitz, Creator of The Twisted Networking Framework

According to his Linkedin Bio, today’s guest’s name is “Glyph Not Looking Don’t Contact Lefkowitz” and he created a powerful, event driven networking library called Twisted. In this episode, he tells us how he got started, what his most painfulre lessons have been, and how trying to measure the merit of a programmer is a fundamentally flawed exercise.

To learn more about the A* interviews or to find more episodes, look here.

For the impatient or textually inclined, here’s a summary of our conversation.

What are you working on right now? Why are you working on it?

  • Working on “Mimic” for Rackspace.
  • Mimic allows testing of cloud infrastructure provisioning.
  • Saves you a lot of time and money confirming that your deployment will work as expected.
  • Also working on Twisted, where he has created an “egalitarian process” that’s in charge of it. He still contributes when he can.
  • The Twisted project was a really early innovator in being a heavily-process-driven, 100% code review, 100% test coverage, 100% docstring project (started back in 2004).

What are your main non-technical interests? Who’s your favorite musical artist?

How old were you when you started programming? What motivated you to start? How have your motivations changed since then?

  • Very young. Father was a programmer who taught him a bit of APL. Wanted to be a writer and build a text-based RPG like Zork so he taught himself to program…slowly. Spent a couple years building elaborate Hypercard stacks. After a few years, learned about if-statements (and a year later about variables.)
  • Eventually jumped in C++. Went to computer camp, learned Java and LISP, and filled his high school notebook margins with programs.
  • Really loved the interactivity of games; realized that he has attention deficit disorder and computers were the only thing that could hold his attention.

What’s the most painful technical or career lesson you’ve had to learn? How did you learn it?

  • Building a gmail-before gmail product called “Quotient” at a company called Divmod. All the changes they tried to make increased complexity and decreased maintainability, making things move slower and slower. So the most painful lesson was that if you don’t have a process in place that considers the “people factors” you’ll end up with an unmaintainable mess.
  • You have to write stuff down, in a place that people will see it. And you need a process for testing and maintaining quality standards. Need to build that into your project early on.
  • Read Extreme Programming Explained, Embrace Change

What is your favorite programming language? What is your favorite tool (e.g. emacs, Postgres)? Why?

  • Favorite in terms of design – Smalltalk. Few moving pieces, hangs together conceptually. But it has no tools…strange technology from the future and can’t really practically do much with it.
  • Python compromises with reality. But has commandline interactive interpreter. Has a simple C implementation people can hack on. Garbage collection semantics that sort of make sense. The community values readability and there is a great community.
  • Favorite tool is pip. Let’s you do the most important thing, which is split up responsibility into different modules. Twisted’s test runner “Trial”. Tmux is great for managing terminal sessions.

How should we measure how “good” a technologist is? What are the key “virtues” of a technologist or of a solution?

  • 10x is a toxic myth.
  • We should stop trying to measure goodness because the idea of “merit” leads to bad things in the community.
  • The only good metrics measure teams and they look at outcomes (how many features did you ship) rather than the product.
  • When evaluating, worry about how people will fit into a team rather than how “good” they are.
    *If there’s one defining characteristic, it’s a desire for consistent self-improvement.

How structured is your problem solving approach (e.g. do you use TDD? Do you always pseudocode before coding)? How much of your problem solving comes from intuition/flashes of insight vs. conscious thought?

  • Used to be all about flashes on insight. Has ADHD “hyper focus”; can stay focused super hard on one task, until he loses it and then can’t focus. He’s a 10x programmer in a short window of time, then nothing for a long time.
  • Then over-corrected to all 100% TDD all the time.
  • Then heard a talk by John Cleese of Monty Python about scriptwriting and the idea of “closed vs open mode” – open is creative, flowing, no fear of failure. Closed is tight, structured, about getting things done. The split should be about 50/50.
  • Don’t bash your head against problems. Try switching methods.

Any sites or projects you want to plug? (Feel free to plug your own business.)

If you could give one piece of advice to yourself at age 15, what would it be?

  • Learn abstract programming concepts earlier.

And that’s all for today

Follow me on Twitter subscribe to my Youtube channel for future episodes!

A* Interview #16: StackOverflow’s all-time Python champ, Alex Martelli

Read some of Alex’s favorite monosyllabic books, Cod and Salt.

Listen to the Art of the Fugue on LinerNotes.

Check out Alex’s personal website.

And at his suggestion, sign up for Fitocracy and get yourself in shape!

Learn about the A* Series and see more interviews here.

/* change only these */
var youtubeID = “R6XSEfNqgZ4”;
var postID = “Alex Martelli”;
/* change only these */

// DON’T CHANGE BELOW
$(“.video”).html(“

“);
var tag = document.createElement(‘script’);
tag.src = “http://www.youtube.com/player_api”;
var firstScriptTag = document.getElementsByTagName(‘script’)[0];
firstScriptTag.parentNode.insertBefore(tag, firstScriptTag);

function onPlayerStateChange(event) {
if (event.data ==YT.PlayerState.PLAYING) {
mixpanel.track(“Video played”, {
“youtubeID”: youtubeID,
“postID”: postID

});
}
if (event.data ==YT.PlayerState.ENDED || event.data==0) {
mixpanel.track(“Video ended”, {
“youtubeID”: youtubeID,
“postID”: postID
});
}
}

var player;
function onYouTubePlayerAPIReady() {
player = new YT.Player(‘player’,
{
videoId: youtubeID,
events: {‘onStateChange’: onPlayerStateChange}
});
}

mixpanel.track(“Post viewed”, {
“id”: postID
});

The Idiomatic Guide to Deploying Django in Production

Idiomatic Django Deployment – The Definitely Definitive Guide

By George London and Adam Burns

Wow, this guide is long. Why not skip it?
George and Adam are both available to freelance.

Or, if you’re kind of excited by how long this guide is, consider following George on Twitter or subscribing to his newsletter.

Overview

By the end of this guide, you should be have a simple but stable actually deployed
Django website accessible at a public URL
. So anyone in the world will be
able to visit “www.yourapp.com” and see a page that says “Hello World!”

You’ll go through the following steps:

  1. Setting up a host server for your webserver and your database.
  2. Installing and configuring the services your site will need.
  3. Automating deployment of your code.
  4. Learning what to do when things go wrong
  5. Next time: Going beyond the basics with caching, monitoring etc.

Updated August 9th, 2015 to use Django 1.8.

Why This Guide Is Needed

A few years ago, George taught himself to program in order to
build his first startup, LinerNotes. He
started out expecting that the hardest part would be getting his head
around the sophisticated algorithmic logic of programming. To his
surprise, he’s actually had to do very little difficult algorithmic work.[1] Instead, the hardest part was getting proficient at using the
many different tools in the programmer’s utility belt. From emacs to
gunicorn, building a real project requires dozens of different
tools. Theoretically, one can a priori reason through a red-black
tree but there’s just no way to learn emacs without the reading the
manual. LinerNotes is actually a lot more complicated under the hood
than it looks on the surface and so he had to read quite a lot of
manuals.

The point of this guide is to save you some of that trouble. Sometimes
trouble is good – struggling to design and implement an API builds
programming acumen. But struggling to configure nginx is just a waste
of time. We’ve found many partial guides to Django deployment but
haven’t found any single, recently updated resource that lays out the
simple, Pythonic way of deploying a Django site in
production
. This post will walk you through creating such a set
up. But it won’t introduce you to basic DevOps 101 concepts. See the bottom for a glossary of acronyms and explanatory footnotes (because Github breaks my intra-page links).[2]

Disclaimer: We’re definitely not the most qualified people to write this post. We’re just the only one dumb enough to try. If you object to anything in this post or get confused or find something broken, help make it better.
Leave a helpful comment (or even better submit a pull request to the Github repo.) The full text of this post is available in the repo and we’ll update this guide as appropriate.

Second disclaimerIf you’re working on a small project, aren’t doing anything unusual or custom with Django, and don’t anticipate needing to handle a large volume of traffic or expanding the scope of your project then you should seriously consider using a PaaS (platform as a service) provider like Heroku or Gondor.io. For a monthly fee, they handle all of the messy configuration (i.e. this guide) for you (as long as your app is structured according to their specifications.) They’re not necessarily easier to get started with than this guide, but they do save you from a lot of down-the-road hassle of administering your own servers (e.g. doing security patches.)

Overview of the Final Architecture

Our example site is just a “hello world” app, but this is going to be the most well-implemented, stable, and scalable
“hello world” application on the whole world wide web. Here’s a diagram of how
your final architecture will look:

Architecture Diagram

Basically, users send HTTP requests to your server, which are intercepted and
routed by the nginx load balancer. Requests for dynamic content will be routed to
your WSGI[3] server (Gunicorn) and requests for static content will be served
directly off the server’s file system. Gunicorn has a few helpers, memcached and celery,
which respectively offer a cache for repetitive tasks and an asynchronous queue
for long-running tasks.

We’ve also got our Postgres database (for all your lovely models) which we run on a
separate EC2 server.[4]

See below for a more detailed description of what each component
actually does.

Set Up Your Host Servers

Set up AWS/EC2

Since this guide is trying to get you to an actual publicly accessible site,
we’re going to go ahead and build our site on the smallest, freest Amazon EC2 instance available, the trusty “micro”. If you don’t want to use
EC2, you can set up a local virtual machine on your laptop using
Vagrant or use your own existing server (you’ll have to tweak my scripts a little). The Docker project has been picking up steam lately but at this point we believe that running Django inside of Docker on EC2 (i.e. running a virtual machine inside a virtual machine) is an unnecessary complication. But don’t be Docker-sad! We will be using Docker to run our deployment tools in an isolated, virtual container on our laptops.

Anyway, we’re going to use EC2 to set up the smallest possible host for our webserver and another
one for our database.

For this tutorial, you’ll need an existing EC2 account. There are many tutorials on setting up an account so I’m not going to walk you through the account setup.

Python has a very nice library called boto for administering AWS
from within code. And another nice tool called Fabric for creating
command-line directives that execute Python code that can itself execute
shell commands on local or remote servers. We’re going to use Fabric
to definite all of our administrative operations, from
creating/bootstrapping servers up to pushing code. I’ve read that Chef (which we’ll use below) also has a plugin to launch EC2 servers but I’m going to prefer boto/Fabric because they give us the option of embedding all our “command” logic into Python and editing it directly as needed.

Start off by cloning the Github repo for this project onto your local machine.

git clone git@github.com:rogueleaderr/definitive_guide_to_django_deployment.git
cd definitive_guide_to_django_deployment

The github repo includes a fabfile.py[7] which provides all the
commandline directives we’ll need. But fabfiles are pretty intuitive
to read so try to follow along with what each command is doing.

First, we need to define AWS settings. In keeping
with the principles of the Twelve Factor App
we store configuration either in environment variables or in config files which
are not tracked by VCS. You can find your AWS access and secret keys on your
AWS security page.

echo '
AWS_ACCESS_KEY_ID=<your key here>
AWS_SECRET_ACCESS_KEY=<your secret key here>
AWS_DEFAULT_REGION=<your region here e.g. us-east-1>
AWS_SECURITY_GROUP_NAME=<your security group name>
AWS_INSTANCE_TYPE=t1.micro
AWS_AMI_ID=<see http:></see>
AWS_SSH_KEY_NAME=<your ec2 key name>
AWS_SSH_PORT=22' > deploy/environment
chmod 600 deploy/environment

Make sure you fill out the values between angle brackets with .

These settings will be exported as enviornment variables in the docker
container where both fabric and the AWS CLI will read them. We recommend using
an AMI for a “free-tier” eligible Ubuntu 14.04 LTS image.

If you already have an EC2 SSH key pair that you want to use, make sure you copy it to the deploy folder (otherwise skip this step and we’ll create one for you automatically):

cp -p <path to your ec2 key> deploy/ssh

We’re also going to create a settings file that contains all the configuration for our actual app.

echo '{}' > deploy/settings.json
chmod 600 deploy/settings.json

Now it’s time for Docker. Follow one of the guides for installing Docker and its pre-requisites on your development machine.

Docker runs a “container”, i.e. a lightweight virtual machine running (in this case) Ubuntu in an isolated environment on your development machine. So we can safely install all the tools we need for this guide inside the container without needing to worry about gumming up your development machine (or other things on your machine causing incompatibilities with these tools.)

If you’re on OSX then as of this writing the two ways to start a Docker container are using Boot2Docker or Kitematic. For either of those, you’ll need to open the app inside your Applications folder. This will start (or give you the opportunity to start) the VM that Docker uses to create its container VM’s. It may also push you into a new terminal, so be sure to

cd <your guide directory>

…if you’re not already there.

Now we can build and run the container for our deploy tools:

docker build -t django_deployment .
docker run --env-file=deploy/environment -tiv $(pwd):/project/django_deployment django_deployment /bin/bash

This step may take a while to download everything, so why not watch a PyCon video while you wait?

Once the process finishes, you’ll be at a bash shell in a container with all the tools installed. The
project’s root directory has been mounted inside the container so un-tracked
files like settings will be stored on your workstation but will still be
available to fab and other tools.

If Docker is giving you errors about being unable to connect and you’re on OSX, make sure that you’re running these commands in a terminal tab that was opened by Boot2Docker or Kitematic.

Warning: If you are running boot2docker on OS X you may need to restart
the boot2docker daemon on your host when you move to new networks (e.g. from
home to a coffee shop).

Now we’re going to use a Fabric directive to setup our AWS account [6] by:

  1. Configuring a keypair ssh key that will let us log in to our servers
  2. Setting up a security group that defines access rules to our servers

To use our first fabric directive and setup our AWS account, go to the directory where our fabfile lives and
do

fab setup_aws_account

Launch Some EC2 Servers

We’re going to launch two Ubuntu 14.04 LTS servers, one for our web host
and one for our database. We’re using Ubuntu because it it seems to be
the most popular linux distro right now, and 14.04 because it’s a (L)ong (T)erm (S)upport
version, meaning we have the longest period before it’s officially
deprecated and we’re forced to deal with an OS upgrade. Depending on what AMI you choose in your settings, you may end up with a different version. (If in doubt, choose a i386 ebs-ssd LTS version.)

With boto and Fabric, launching a new instance is very easy:

fab create_instance:webserver
fab create_instance:database

These commands tell Fabric to use boto to create a new “micro”
(i.e. free for the first year) instance on EC2, with the name you
provide. You can also provide a lot more configuration options to this
directive at the command line but the defaults are sensible for now.

It may be a few minutes before new instances are fully started. EC2 reports
them online when the virtual hardware is up but Linux takes some time to boot
after that.

Now you can ssh into a server:

fab ssh:webserver

If you create an instance by mistake, you can terminate it with

fab terminate_instance:webserver

Install and Configure Your Services

We need a project to deploy. Your easiest option is to use the sample project I’ve created. The deployment will clone it onto webservers automatically, but you might want your own local clone so you can follow along:

git clone git@github.com:rogueleaderr/django_deployment_example_project.git

If you don’t do that, then you need to…

Make sure your project is set up correctly:

This guide assumes a standard Django 1.5+ project layout with a few small tweaks:

  • Your settings should be comprised of three files:

    app
    --settings
        --__init__.py # tells python how to import your settings
         --base.py # your default settings
         --local_settings.py # settings that will be dynamically generated from your settings.json. don't track in git
    

    And your __init__.py should consist of:

    # application_python cookbook expects manage.py in a top level
    # instead of app level dir, so the relative import can fail
    try:
        from .<project_name>.<project_name>.settings.base import *
    except ImportError:
        from <project_name>.settings.base import *
    
    try:
        from local_settings import *
    except ImportError:
        pass
    

    Our bootstrapping process will create a local_settings.py but to develop locally you’ll need to make one manually with your database info etc. (Don’t check it into git.)

  • We’re serving static files with dj-static. To use dj-static, you need a couple project tweaks:

    In base.py, set

    STATIC_ROOT = 'staticfiles'
    STATIC_URL = '/static/'
    

    Modify your wsgi.py:

    from django.core.wsgi import get_wsgi_application
    from dj_static import Cling
    
    application = Cling(get_wsgi_application())
    
  • Your installed apps must contain your project app and also djcelery.

Understand the services

Our stack is made up of a number of services that run
semi-independently:

Gunicorn: Our WSGI webserver. Gunicorn receives HTTP requests fowarded to it from nginx, executes
our Django code to produce a response, and returns the response which nginx transmits back to the client.

Nginx: Our
load balancer (a.k.a. “reverse proxy server”). Nginx takes requests from the open internet and decides
whether they should be passed to Gunicorn, served a static file,
served a “Gunicorn is down” error page, or even blocked (e.g. to prevent denial-of-service
requests.) If you want to spread your requests across several Gunicorn nodes, it just takes a tiny change to your nginx configuration.

Memcached: A simple in-memory key/value caching system. Can save
Gunicorn a lot of effort regenerating rarely-changed pages or objects.

Celery: An async task system for Python. Can take longer-running
bits of code and process them outside of Gunicorn without jamming up
the webserver. Can also be used for “poor man’s” concurrency in Django.

RabbitMQ: A queue/message broker that passes asynchronous tasks
between Gunicorn and Celery.

Supervisor: A process manager that attempts to make sure that all key services stay
alive and are automatically restarted if they die for any reason.

Postgres: The main database server (“cluster” in Postgres
parlance). Contains one or more “logical” databases containing our
application data / model data.

Install the services

We could install and configure each service individually, but instead
we’re going to use a “configuration automation” tool called
Chef. Chef lets us write simple Ruby
programs (sorry Python monogamists!) called Cookbooks that automatically
install and configure services.

Chef can be a bit intimidating. It provides an entire Ruby-based
DSL for expressing configuration. And it
also provides a whole system (Chef server) for controlling the
configuration of remote servers (a.k.a. “nodes”) from a central location. The DSL is
unavoidable but we can make things a bit simpler by using “Chef Solo”, a stripped down version of Chef that does away with the whole central server and leaves us with
just a single script that we run on our remote servers to bootstrap our
configuration.

(Hat tip to several authors for blog posts about using Chef for Django[8])

Wait, a complicated ruby tool? Really?

Yes, really. Despite being in Ruby[9], Chef has some great advantages that make it worth learning (at least enough to follow this guide.)

  1. It lets us fully automate our deployment. We only need to edit one configuration file and run two commands and our entire stack configures itself automatically. And if your servers all die you can redeploy from scratch with the same two commands (assuming you backed up your database).
  2. It lets us lock the versions for all of our dependencies. Every package installed by this process has its version explicitly specified. So this guide/process may become dated but it should continue to at least basically work for a long time.
  3. It lets us stand on the shoulders of giants. Opscode (the creators of Chef) and some great OSS people have put a lot of time into creating ready-to-use Chef “cookbooks” for nearly all our needs. Remember, DRW-EWTMOTWWRS (Don’t Re-Invent the Wheel, Especially When the Maker of the Wheel Was Really Smart).

Okay, buckle up. We’re going to need to talk a little about how Chef works. But it’ll be worth it.

At the root, Chef is made up of small Ruby scripts called recipes that express
configuration. Chef declares configuration rather than executing a
series of steps (the way Fabric does). A recipe is supposed to describe all the resources that
are available on a server (rather than just invoking installation
commands.) If a resource is missing when a recipe is run, Chef will
try to figure out how to install that resource. If a configuration file has the wrong information, Chef will fix(/brutally overwrite) it. Recipes are
(supposed to be) idempotent, meaning that if you run a recipe and
then run it again then the second run will have no effects.

But which recipes to run? Chef uses cookbooks that
group together recipes for deploying a specific tool (e.g. “the git
cookbook”). And Chef has a concept called “roles” that let you specify
which cookbooks should be used on a given server. So for example, we
can define a “webserver” role and tell Chef to use the “git”, “nginx”
and “django” cookbooks. Opscode (the makers of Chef) provide a bunch
of pre-packaged and (usually well maintained) cookbooks for common
tools like git. And although Chef cookbooks can get quite complicated, they are just code and so they can be version controlled with git.

Chef, make me a server-which.

We use some tools that simplify working with Chef:

We’re going to have two nodes, a webserver and a database. We’ll have four roles:

  1. base.rb (common configuration that both will need, like apt and git)
  2. application_server.rb (webserver configuration)
  3. database.rb (database configuration)
  4. deploy.rb (to save time, a stripped down role that just deploys app code to the app server)

The role definitions live in chef_files/roles. Now we just need to tell Chef which roles apply to which nodes, and we do that in our chef_files/nodes folder in files named “{node name}_node.json”. If you use names other than “webserver” and “database” for your nodes, you must rename these node files.

Any production Django installation is going to have some sensitive
values (e.g. database passwords). Chef has a construct called data
bags
for isolating and storing sensitive information. And these bags
can even be encrypted so they can be stored in a VCS. Knife solo lets us create a databag and encrypt
it. Fabric will automatically upload our databags to the server where
they’ll be accessible to our Chef solo recipe.

Start by loading the values we need into our settings.json file. Be sure to
update settings anytime you create new servers.

APP_NAME should be the name of the project in your repo, as it would appear
in a Python import path.

echo '{
"id": "config_1",
"POSTGRES_PASS": "<your db password>",
"DEBUG": "False",
"DOMAIN": "<your domain name>",
"APP_NAME": "<name of python package inside your repo: e.g. deployment_example_project>",
"DATABASE_NAME": "<your database name>",
"REPO": "<your github repo name: e.g. django_deployment_example_project>",
"GITHUB_USER": "<your github username: e.g. rogueleaderr>",
"DATABASE_IP": "DB_IP_SLUG",
"EC2_DNS": "WEB_IP_SLUG"
}' 
| sed -e s/DB_IP_SLUG/`cat deploy/fab_hosts/database.txt`/ 
| sed -e s/WEB_IP_SLUG/`cat deploy/fab_hosts/webserver.txt`/ 
> deploy/settings.json

Now we need an encryption key (which we will NOT store in Github):

cd chef_files
openssl rand -base64 512 > data_bag_key
cd ..
# if you aren't using my repo's .gitingore, add the key
echo "chef_files/data_bag_key" >> .gitignore

Now we can use the knife-solo to create an encrypted data bag from our settings.json file:

cd chef_files
knife solo data bag create config config_1 --json-file ../deploy/settings.json
cd ..

Our chef_files repo contains a file Berksfile that lists all the cookbooks we are going to install on our server, along with specific versions. Knife solo will install all of these with a tool called Berkshelf, which I honestly assume is named after this. If a cookbook becomes dated, just upgrade the version number in chef_files/Berksfile.
Now we’re going to use Fabric to tell Chef to first bootstrap our database and then bootstrap our webserver. Do:

fab bootstrap:database
fab bootstrap:webserver

This will:

  1. Install Chef
  2. Tell Chef to configure the server

A lot of stuff is going to happen so this may take a while. Don’t worry if the process seems to pause for a while. But if it exits with an error please create an issue on Github describing what went wrong (or better yet, leave a pull a request to fix it.)

What is this magic?

Chef actually does so much that you might be reluctant to trust it. You may(/should) want to understand the details of your deployment. Or you may just distrust Ruby-scented magic. So here’s a rough walk-through of everything the bootstrap scripts do.

Database

The database role first installs the essential base packages specified in base.rb, i.e. apt, gcc, etc, and sets up our ubuntu admin user with passwordless sudo.

Then we run our custom database recipe that:

  1. Installs Postgres on the server

  2. Modifies the default postgresql.conf settings to take full advantage of the node’s resources (dynamically calculated using a cookbook called ohai.) Changes the Linux shmmax/shmall paramaters as necessary.

  3. Tells Postgres to listen on all ports and accept password authenticated connections from all IP’s (which is okay because we use Amazon’s firewall to block any connection to the Postgres node from outside our security group.)

  4. Creates our application database using a password and database name read in from our settings.json file.

  5. Restarts Postgres to pick up the configuration changes.

Webserver

Again, install base packages per base.rb.

Then runs our main application server setup recipe:

  1. Runs Opscode cookbooks to install basic packages, including: git, nginx, python-dev, rabbit-mq, memcached, and postgres-client.

  2. Reads configuration variables from our encrypted data bag (made from settings.json)

  3. Updates Ubuntu and installs the bash-completion package. Creates a .bashrc from a template.

  4. Creates an nginx configuration for our site from a template, load it into nginx’s configuration folder and restart nginx.

  5. Deploy our Django app, which consists of:

  • Create a folder called /srv/<app_name> that will hold our whole deployment

  • Create a folder called /srv/<app_name>/shared with will hold our virtualenv and some key configuration files

  • Download our latest copy of our Github repo to /srv/<app_name>/shared/cached-copy

  • Create a local_settings.py file from a template and include information to connect to the database we created above (details loaded from our data bag.)

  • “Migrate” (i.e. sync) the database with manage.py syncdb. The sync command can be overwritten if you want to use South.

  • Install all our Python packages with pip from requirements/requirements.txt.

  • Run manage.py collectstatic to copy our static files (css, js, images) into a single static folder

  • Install gunicorn and create a configuration file from a template. The config file lives at /srv/<app_name>/shared/gunicorn_config.py.

  • Bind gunicorn to talk over a unix socket named after our app

  • Install celery, along with celery’s built-in helpers celerycam and celerybeat. Create a configuration file from a template. The config lives at /srv/<app_name>/shared/celery_settings.py.

  • Create “supervisor” stubs that tell supervisor to manage our gunicorn and celery processes.

  • Copy the ‘cached-copy’ into a /srv/<app_name>/releases/<sha1_hash_of_release> folder’. Then symlink the latest release into /srv/<app_name>/current which is where where the live app ultimately lives.

  1. Create a “project.settings” file that contains the sensitive variables (e.g. database password) for our Django app.

Hopefully this list makes it a bit more clear why we’re using Chef. You certainly could do each of these steps by hand but it would be much more time consuming and error-prone.

Make it public.

If Chef runs all the way through without error (as it should) you’ll now have a ‘Hello World’ site accessible by opening your browser and visiting the “public DNS” of your site (which you can find from the EC2 management console or by doing cat deploy/fab_hosts/webserver.txt). But you probably don’t want visitors to have to type in “103.32.blah-blah.ec2.blah-blah”. You want them to just visit “myapp.com” and to do that you’ll need to visit your domain registrar (e.g. GoDaddy or Netfirms) and change your A-Record to point to the IP of your webserver (which can also be gotten from the EC2 console or by doing):

ec2-describe-instances | fgrep `cat deploy/fab_hosts/webserver.txt` | cut -f17

Domain registrars vary greatly on how to change the A-record so check your registrar’s instructions.

By default, EC2 provides a “public DNS” listing for instances which looks like ec2-54-166-68-245.compute-1.amazonaws.com

Any time you stop and then start an EC2 instance, the DNS address gets re-assigned. If you’re putting your website on one of these nodes, that’s not ideal
because you would have to update your A-Record every time you need to stop the instance for any reason. (And it’s even worse because A-Records updates can take 24-48 hours to
propogate through the internet so your site may be unreachable for a while.)

To avoid that problem, Amazon lets you associate an “Elastic IP” with a given server. An Elastic IP is a fixed IP address that will
stay with the the server even if it’s stopped.

To associate an elastic IP with your webserver, you can do

 fab associate_ip:webserver

Note: AWS has a soft limit of 5 elastic IP’s per account. So if you already have elastic IP’s allocated, you may need to delete or reassign them in your AWS management console.

After you change the IP, you’ll also need to change your ALLOWED_HOSTS django setting. The fab command updates the host settings in your config files in the local guide directory you’ve been working in,
but to push those changes to your server you’ll need to update your secret databag:

cd chef_files
knife solo data bag create config config_1 --json-file ../deploy/settings.json
cd ..

And re-bootstrap the server to update the settings:

fab bootstrap:webserver

Automatically Deploy Your Code

Well, this is simple. Just commit your repo and do

git push origin master

Then back in the deployment guide folder, do:

fab deploy:webserver

Debugging:

Deploy

  1. If you edit deploy/settings.json, remember to regenerate the chef data bag.
  2. If you terminate and re-launch an instance, remember to update the IP and DNS fields in deploy/settings.json.
  3. The vars defined in deploy/environment are read when the docker container starts. If you edit them, exit and re-run the docker container so they’re re-read.
  4. The cache in chef_files/cookbooks can get outdated. If you’re seeing Chef errors, try deleting the contents of that directory and starting over.
  5. By default Chef will try to roll back failed launches, but that can make it hard to figure out why the launch failed. To disable rollback, add rollback_on_error false to chef_files/site-cookbooks/django_application_server/recipes/default.rb in the same place as its repository and revision options.

Nginx

Nginx is the proxy server that routes HTTP traffic. It has never once gone
down for me. It should start automatically if the EC2 server restarts.

If you need to start/restart nginx, log in to the webserver and do:

sudo service nginx restart

If nginx misbehaves, logs are at:

/var/log/nginx/

If, for some reason, you need to look at the nginx conf it’s at:

sudo emacs /etc/nginx/sites-available/<app_name>.conf

If you need to edit it, avoid making changes to any conf files on the server, instead change:

chef_files/site-cookbooks/django_application_server/templates/default/nginx-conf.erb

And rerun the Chef bootstrap process. It’s idempotent so it won’t change anything else (unless you’ve been tinkering directly with the server in which case your changes will be “reeducated”).

RabbitMQ

Another service that’s started automatically. I have literally never had to interact
directly with it. But if can also be restarted by

sudo service restart rabbitmq

Memcached

Memcached is also a service and starts automatically if the EC2 node restarts. Unless you’ve designed a monstrosity, your site
should also continue to function if it dies (just be slow). Caching issues can sometimes
cause weird page content, so if something seems unusually bizarre try flushing the cache
by restarting memcached:

sudo service restart memcached

Memcached is pretty fire and forget…since it’s in memory it’s theoretically possible it
could fill up and exhaust the memory on the webserver (I don’t have a size cap and I make my TTL’s
very long) but that has never happened so far. If it does, just reset memcached and it
will clear itself out.

To start Gunicorn/Celery:

Gunicorn and the Celery Workers are controlled by Supervisor, which is a Linux process runner/controller. Supervisor will automatically restart them if
they’re terminated abnormally.

The Supervisor configuration is located at:

/etc/supervisor/conf.d/<app_name>.conf

To restart gunicorn and celery together, simply do:

fab restart

To restart manually, you can use a Supervisor utility called supervisorctl that lets you check the status of and
restart processes. So if you need to restart gunicorn or celery, you can do:

sudo supervisorctl restart <app_name>
sudo supervisorctl restart <app_name>-celeryd

Or to check process status, just do

sudo supervisorctl status

Supervisor routes all log output from managed processes to /var/log/supervisor/<process-name-and-long-character-string>. So if your server is behaving badly you can start there.

I keep a GNU screen active in the log directory so
I can get there quickly if I need to. You can get there with

screen -r <my screen>

Postgres

Postgres is a very stable program but can be a bit touchy on a small server under heavy load. It’s probably
the most likely component to give you trouble (and sadly basically your whole site becomes totally
non-operational if it goes down.)

Postgres runs as a service so if you need to restart it (try not to need to do this) you
can do:

sudo service postgresql restart

The disk can also fill (especially if something gets weird with the logging.) To check disk space:

df -h

If a disk is 99% full, find big files using

find / -type f -size +10M -exec ls -l {} ;

EC2 instances (larger than our micro type) all have “instance store” disks on /mnt, so you can copy obviously
suspicious files onto the instance store and let me sort it out later.

If that’s not enough, check the logs for the service at /var/log/postgresql/

Future Crimes of Configuration

This guide has gotten long enough for now, so I’m going to see how it’s recieved before delving into advanced topics. But here are a few quick suggestions:

Set Up Monitoring

There are a bunch of open and closed source solutions. They’re all a bit more complicated than I’d like to set up. But here’s what I personally use:

Datadog Monitoring

Datadog makes pretty metric dashboards. It will automatically monitor server CPU/memory/etc status. Datadog can send an
alert if there’s no CPU activity from the webserver or the database (probably meaning the
EC2 servers are down.) And it can also hook into a custom statsd library and lets you emit/graph whatever metrics you want from anywhere in your app. You just have to decorate your code by hand.

Notifications / PagerDuty

PagerDuty is a website that will call or email you if something goes wrong with a
server. I’ve configured it to email/SMS if anything goes wrong with my site.

Django by default automatically emits error emails, which I:

  1. route to PagerDuty so it automatically sets up an “incident” and SMS’s me
  2. sends an email to me with the details of the error

Occasionally these emails are for non-serious issues but there’s no easy way to
filter. It can be a bit chatty if you haven’t chased down all the random non-critical errors in your app, but it helps save you from being unaware your site was down for 12 hours.

Connection pooling

As of Django 1.5, Django opens a new Postgres connection for every request, which requires a ~200ms SSL renegotiation. Skip that overhead by using a connection pooler like django-postgrespool. You can also use PgBouncer on your Postgres server to make sure you don’t get overwhelmed with incoming connections.

Apparently Django 1.6 includes a built-in connection pooler.

Cache Settings

A lot of what Django does from request to request is redundant. You can hugely increase responsiveness and decrease server load by caching aggressively. Django has built in settings to cache views (but you have to enable caching yourself.) You can also use cache-machine to cache your models and significantly reduce your database load.

Backup

The nice thing about this Chef setup is that if anything goes wrong with your webserver, it might actually be faster to deploy a new one from scratch and fail over than to try to restore your broken server. But you’ve still got to back up your database. Nothing can help you with deleted data. Postgres has a number of options, including streaming replication and (w)rite(a)head (l)og WAL shipping to S3.

South migrations

Just use them. Also apparently baked into Django 1.6.

Gevent

In my experience, by far the biggest cause of slowness in Django is I/O or network requests (e.g. calling an external API to supply some data for a widget.) By default, Python blocks the thread making the call until it’s done. Gunicorn gives you “workers” which run in separate threads, but if you have four workers and they all block waiting for a long database query then your whole site will just hang until a worker is free (or the request times out.)

You can make things way faster by using “green threading” via gevent. Green threading is conceptually complicated (and occasionally buggy) but the basic idea is that one thread can contain many “green” threads. One one green thread runs at a time, but if it needs to wait for I/O it cedes control to another green thread. So your server can accomodate way more requests by never blocking the gunicorn worker threads.

Wrap up

And there you go. You’ve got a production-ready Django website. It’s reasonable secure, easy to update, cheap to run, fully customizable and should be able to easily handle the kind of traffic a new site is likely to get. If you need more power, just shutdown your EC2 instances and upgrade to a larger instance type. Or get fancy by spinning up more webservers and using the load balancer to route requests among them.

Anyway, thanks for making it this far! If you’ve got any suggestions for how to do anything in this guide better, please leave a comment or a pull request! And if you build any custom Chef code for your own deployment, please consider contributing it back to this guide or to the official application_python cookbook.

And if you enjoy this kind of material, consider following me on Twitter or subscribing to my newsletter.

Notes

[1] And Python has existing libraries that implement nearly any algorithm better than I could anyway.

[2] I’ll
try to be gentle but won’t simplify where doing so would hurt the
quality of the ultimate deployment. If you
don’t know what a load balancer or an SSH key is, you’re going to have
a hard time following along. But Google can help you with that. Don’t worry, I’ll be here when you get back.

[3] You can run Postgres on the same VM, but putting it on a
separate box will avoid resource contention and make your app more scalable. You also can run nginx and celery on their own VM’s which will make your site super scalable. But if you need this guide then you’re probably not seeing enough traffic to make that worth the added complexity.

[4] More about WSGI

[6] For development I enjoy VirtualenvWrapper which makes switching between venv’s easy. But it installs venvs by default in a ~/Envs home directory and for deployment we want to keep as much as possible inside of one main project directory (to make everything easy to find.)

[7] Hat tip to garnaat for
his AWS recipe to setup an account with boto

[8] Hat tip to Martha Kelly for her post on using Fabric/Boto to deploy EC2

[9] Chef/Django posts:

[*] Yes, there are other configuration automation tools. Puppet is widely used, but I find it slightly more confuing and it seems less popular in the Django community. There is also a tool called Salt that’s even in Python. But Salt seems substantially less mature than Chef at this point.

Glossary

AMI

– An "AMI” is an Amazon Machine Image, i.e. a re-loadable snapshot of a configured system.

CLI

– command line interface. Amazon gives us a set of new command line “verbs” to control AWS.

EC2

– Elastic Compute Cloud, Amazon’s virtual server farm.

AWS

– Amazon Web Services, the umbrella of the many individual cloud services Amazon offers

DSL

– Domain specific language. Aka a crazy mangled version of Ruby customized to describe service configuration.

VCS

– Version control system, e.g. git or SVN or mercurial

Bibliography

Randall Degges rants on deployment

Rob Golding on deploying Django

Aqiliq on deploying Django on Docker

How to use Knife-Solo and Knife-Solo_data_bags

Tumblr Eats a Pelican: The Problem With Static Website Generators

Tl;dr: Don’t bother switching. Tumblr is better.

Guys, my blog is failing.

I’ve been blogging (via Tumblr) at http://www.rogueleaderr.com for two years and the Google analytics data speaks for itself:

Except for one random post about subverting Megabus wifi, my posts hardly ever see more than 200 hits (few of whom actually read to the end). I’m grateful for the readers I have, but I’m only getting marginally more amplification than shouting from a literal soapbox.

So I’ve decided to step up my blogging game

I’ve got some ideas I want to spread. And it’s time to find out whether no one else cares about those ideas or whether I’m just doing a bad job of broadcasting.

So I’m taking a few steps to step things up:

  1. Creating an email newsletter. (You can sign up here)
  2. Getting serious about promoting my blog posts.
  3. Redesigning my blog.

1 & 2 are topics for a different day. This post is about #3, my decision to switch from trusty Tumblr to a static website generator.

Spoiler: I will be wrecked by “what you know you know that just ain’t so…”

I should have listened to Mark Twain. That’s the main moral of this story. The longer story is that I let bad assumptions waste quite a lot of my time.

If you spent any time on Hacker News et al., you’re probably familiar with the classic “hacker profile page”, as best exemplified by Tom Preston-Werner or Kenneth Reitz or Paul Graham. These inimitably classy and minimalist sites generally contain a short tagline-style bio of the hacker, a list of projects, a list of blog posts, and maybe a contact form or easter egg.

I may not have a tenth the skill of any of those chaps, but I know how to steal a good idea when I see one. I wanted my blog to look like theirs and so I assumed (falsely!) that I needed to build my blog the way they built theirs. That meant a static website generator since, after all, Tom Preston-Werner popularized the very idea of static website generators with his tool Jekyll.

If you’re not familiar, static website generators (hereafter “SWG’s”) let you store all the components of your website (including each individual blog post) as files in a directory (often in a markup language like Markdown). You run a script and presto everything is compiled to interlinked HTML files which can be uploaded and served directly from a webserver or even for free from Github.

Static is the move.

As I started researching SWG’s, I quickly realized that there are a lot of options. The most popular seems to be the Ruby-based Jekyll, the original heavy weight. I read Ruby at a 3rd grade level and SWG’s in general seem to require a lot of tweaking, so I was inclined to prefer a Python-based option.

There are plenty of those as well, but I found a glowing post about one called Pelican so, in the absence of any clear compelling differences among the options, I decided Pelican was the way to go. It’s pretty simple to setup a basic Pelican site and even to deploy it publicly on Github Pages (probably 30 minutes work.) But a few minutes later, the problems started.

First, I wanted to create an “About Me” page. That’s quite easy, if you already know how to do it. If you don’t the Pelican docs (and limited tutorials available online) require some pretty close study.

Next, I wanted to make my blog pretty. Pelican ships with some off the shelf themes. IMHO, they’re all somewhere between ugly and “meh.” So I hoped over to CreativeMarket (with a big discount from AppSumo burning a whole in my pocket) and looked for a theme. There are hundreds of WordPress and Tumblr themes for sale, but nothing for Pelican and only a few for “generic HTML site”. Before accepting the cost of customizing one of those to fit Pelican, I googled and found a generic Svbtle ripoff theme for Pelican.

Sure it’s not responsive, but only 7% of my traffic is mobile. So…goodbye forever iPhone losers!

I need to build a link to the past

Next problem: my blog gets a small but non-trivial amount of inbound search traffic. Simply shutting down the old site would kill all traffic to those archives. So I needed to import my posts from Tumblr. But as I’ve learned the hard way on my current project LinerNotes, Googlebot ain’t too bright. Even if I redirect the rogueleaderr.com domain to my pretty new static site, any change in the URL’s of my old posts will at least break any external links to my posts and at worst will cause Google to penalize my posts as both broken links and duplicate content. Plus, cool URI’s don’t change.

A site deployed behind a webserver like nginx could use URL rewrites to redirect incoming URL’s (assuming the URL slugs could be preserved). But…Github pages doesn’t allow redirects (except perhaps via Javascript; I never got that far.)

And I’m back to Tumblr

While googling for how to import from Tumblr to Pelican, I stumbled upon a Tumblr manual page and had a massive facepalm moment.

It turns out that Tumblr has a feature called “Pages” that lets you include and link to static pages right inside your Tumblog.

All you have to do is press a button and drop in a little HTML and you’ve got your own “About Me” page. In other words, basically all the customization power I was hoping to get from a static site is already available in Tumblr with none of the hassle of learning how to use a generator, managing the migration, and maintaining a separate site.

With about 45 minutes of work and the help of the Tumblr Svblte Theme, I was able to create a site that looked and worked better than my Pelican site and preserved all my lovely links and Tumblr followers.

And it’s even responsive.

Moral of the story

Tumblr is a lot more powerful than I realized (as probably is WordPress, which I haven’t used). And Tumblr gives you an awful lot of bang for the setup buck. I’m now struggling to think of a use case that SWG’s allow and Tumblr doesn’t. (Maybe if you want lots of highly customized static pages, or if you want to run a webserver with complicated URL trickery.) If you’re publishing a lot of varied types of content, you’ll probably be better served by a fullblown CMS. SWG’s are only marginally less complicated to learn.

I’ve read worries about “what if Tumblr dies off like Posterous” – well, Tumblr allows export of your posts via an API, so you can deal with learning Pelican as soon as Yahoo writes off its $1bln investment.

One good argument I’ve heard is that storing a blog in a directory allows version control. That’s true, but I basically never change old posts. And it’s easy enough to save a copy of the HTML of a Tumblr theme before changing it.

All in all, the hassles of a Static Website Generator don’t outweight the costs.

Think I’m crazy? Let’s talk it out in the comments.

Did you like this post?

Then upvote it on Hacker News. Follow me on Twitter. Or subscribe to my newsletter.