Scheduled Downloads using AWS EC2 and Docker

Using AWS for automated downloads

This post describes a setup for automated batch retrieval of intraday stock prices from Google Finance followed by upload to Amazon Web Services Simple Storage Solution (AWS S3), using Docker for deployment of the application.

The main focus is on the DevOps aspects of identifying the separate system components and how to connect them together. This should enable reuse of this approach for more general data analysis applications.

System Components

The system consists of the following data flow components:

In addition there are security components present:

Application

The application considered in this post is the first stage of an Extract-Transform-Load (ETL) pipeline for analysis of intraday time series data. It takes a list of tickers from a csv file, queries the Google Finance API to get price and volume information over 1 minute intervals for the previous 10 days for each ticker, and then uploads the retrieved data to S3.

The language used for this task is Python, making use of a number of useful libraries such as:

A common problem with deploying Python applications is packaging the code with the libraries it depends on, so that the code can be run on the deployed application host correctly. A number of packaging solutions exist, however here we have chosen to use Docker for constructing a packaged application.

Docker

Docker is a tool for constructing software containers which make use of OS level vitualization to allow packaged applications to run in isolated user spaces inside the host OS. What this means in practice is that it can be used as a form of virtual environment construction where the container not only contains the virtual environment with a set of installed libraries but also contains the platform allowing the construction of virtual environments, in this case the Python Anaconda distribution.

Deployment of the application code that runs locally against a set of installed packages is achieved by constructing a set of Docker images replicating the local environment. These Docker images are constructed by creating a Dockerfile specifying a base image to use, then a set of commands that populate the new image with packages, set up environment variables, create users, etc.

For this deployment two docker images have been created:

These images are automatically built by Travis CI whenever a new commit is made to their respective GitHub repository.

AWS

As ticker data is downloaded from Google Finance it is saved to a local file. When running the analysis application locally this data will persist.
However when running the Docker container the container (and its file system) is deleted after each run so this data must be copied to a persistent storage location, in this case S3. The code for this (from mda/finance.py) is shown below:

def copy_to_s3(download_date=None):
    """Copy downloaded files to S3
    :return: <none> Creates files on S3
    """
    dest_bucket = 'ftse100'
    files = os.listdir(os.path.join(dataLoc, download_date))
    s3 = boto3.resource('s3')
    bucket = s3.Bucket(dest_bucket)
    if dest_bucket not in map(lambda b: b.name, s3.buckets.all()):
        bucket.create(CreateBucketConfiguration={'LocationConstraint': 'eu-west-1'})
    for fname in files:
        bucket.upload_file(os.path.join(dataLoc, download_date, fname),
                           'raw/' + download_date + '/' + fname)

Policy

When running locally the application code makes use of the user’s $HOME/.aws directory for authentication to use the AWS services. While it is possible to include these credentials in the Docker image (NB: particularly not a good idea for public GitHub repositories) it is better to use the AWS Identity and Access Management (IAM) infrastructure to achieve the same effect. Create a policy called s3_upload that allows upload access to S3:

{
"Version": "2012-10-17",
"Statement": [
    {
        "Effect": "Allow",
        "Action": [
            "s3:PutObject",
            "s3:ListAllMyBuckets",
            "s3:ListBucket",
            "s3:CreateBucket"
        ],
        "Resource": "*"
    }
  ]
}

Role

Create a Role called DataProcessor and attach the s3_upload policy. This role is applied on launch to the EC2 instance that runs the Docker image.

Machine

An example Vagrantfile below uses the vagrant-aws plugin to launch an EC2 instance with the correct role attached, using a Ubuntu 14.04 AMI and a security group that allows SSH access only via the private key specified.

Vagrant.configure("2") do |config|
  config.vm.box = "aws-box"
  
  config.vm.provider :aws do |aws, override|
    aws.access_key_id = "INSERT KEY ID"
    aws.secret_access_key = "INSERT SECRET KEY"
    aws.keypair_name = "INSERT KEYPAIR NAME"
    aws.elastic_ip = false 
    aws.region = "eu-west-1"
    aws.instance_type = "t2.micro"
    aws.subnet_id = "INSERT SUBNET ID"
    aws.ami = "ami-f95ef58a"
    aws.iam_instance_profile_name = "DataProcessor"
    aws.tags = { 'Name' => 'mda-cron-vm' }
    override.ssh.username = "ubuntu"
    override.ssh.private_key_path = "INSERT PATH TO .pem FILE"
  end

  config.vm.provision "docker",
    images: ["mattmcd/pyanalysis"]
end

Once the machine has started the scheduled download job is created by connecting to the machine over SSH, installing Docker, and adding a run_intraday.sh script that retrieves and runs the Docker image:

#!/bin/bash
docker run --rm -v /home/ubuntu/Work/Data:/home/ubuntu/Work/Data mattmcd/pyanalysis

The crontab is then edited to schedule when the job should run:

# m h  dom mon dow   command
MAILTO=""
11 5 * * SAT /home/ubuntu/run_intraday.sh > /home/ubuntu/cron.log 2>&1

For automated setup Fabric can be used for installing Docker on the EC2 instance, uploading the run script and editing the crontab. However a better option may be to use a dedicated AMI designed for use with AWS Elastic Container Service (ECS). This will be covered in a future post.

Conclusion

This post has covered deployment of Python analysis code to a cloud platform using Docker. The main focus has been on identifying the system components and how they interact. The linked resources should serve as a starting point for anyone intending to do a similar deployment.

About Me

I am Matt McDonnell and am currently working as a Data Scientist for a Cambridge (UK) based startup. Previously my main language has been MATLAB used in Quantitative Analyst and Quantitative Developer roles for a London based asset management firm, as well as in a Technical Consultant role for MathWorks.