Google Analytics to Redshift

This page provides you with instructions on how to extract data from Google Analytics and load it into Amazon Redshift. (If this manual process is a bit more involved than you’d prefer, check out Stitch, which can do all the heavy lifting for you in just a few clicks.)

Pulling Data Out of Google Analytics

Unfortunately, this process is made exceptionally complicated by the fact that Google’s APIs do not allow you to extract event-level data for popular metrics such as Page Views and Visitors. Unless you are purchasing the pricey professional edition of Google Analytics, most data that is retrieved from GA is “rolled up” into some aggregated format.

What this means is that calls to the Google Analytics Core Reporting API (your means of getting data off of Google’s servers and into your hands to place into redshift) will yield aggregated data rather than raw data.

Sample Google Analytics Response Format

The GA API returns JSON-formatted data. Below is an example of the kind of response you might see when querying this API.


{
  "kind": "analytics#gaData",
  "id": string,
  "selfLink": string,
  "containsSampledData": boolean,
  "query": {
    "start-date": string,
    "end-date": string,
    "ids": string,
    "dimensions": [
      string
    ],
    "metrics": [
      string
    ],
    "samplingLevel": string,
    "sort": [
      string
    ],
    "filters": string,
    "segment": string,
    "start-index": integer,
    "max-results": integer
  },
  "itemsPerPage": integer,
  "totalResults": integer,
  "previousLink": string,
  "nextLink": string,
  "profileInfo": {
    "profileId": string,
    "accountId": string,
    "webPropertyId": string,
    "internalWebPropertyId": string,
    "profileName": string,
    "tableId": string
  },
  "columnHeaders": [
    {
      "name": string,
      "columnType": string,
      "dataType": string
    }
  ],
  "rows": [
    [
      string
    ]
  ],
  "sampleSize": string,
  "sampleSpace": string,
  "totalsForAllResults": [
    {
      metricName: string,
      ...
    }
  ]
}

Preparing Google Analytics Data for Redshift

Once you’ve decided what aggregated data you’d like to include in your Redshift database, you then need to design tables that can hold that data so you can insert it. This means that, for each value in the response, you need to identify a predefined datatype (i.e. INTEGER, DATETIME, etc.) and build a table that can receive them.

Inserting Google Analytics Data into Redshift

Once you have identified all of the columns you will want to insert, you can use the CREATE TABLE statement in Redshift to create a table that can receive all of this data.

With a table built, it may seem like the easiest way to add your data (especially if there isn’t much of it), is to build INSERT statements to add data to your Redshift table row-by-row. If you have any experience with SQL, this will be your gut reaction. But beware! Redshift isn’t optimized for inserting data one row at a time, and if you have any kind of high-volume data being inserted, you would be much better off loading the data into Amazon S3 and then using the COPY command to load it into Redshift.

Keeping Data Up-To-Date

So, now what? You’ve built a script that pulls data from GA and loads it into Redshift, but every day the data just keeps on coming!

The key is to build your script in such a way that it can run on a regular basis to capture new data points and also cross-check recent values for changes due the the rolled-up data “settling” over time (which is known to happen in GA in certain circumstances). At a minimum you should put your script on some kind of cron job or scheduler to keep the data flowing.

Other Data Warehouse Options

Redshift is totally awesome, but sometimes you need to start smaller or optimize for different things. In this case, many people choose to get started with Postgres, which is an open source RDBMS that uses nearly identical SQL syntax to Redshift. If you’re interested in seeing the relevant steps for loading this data into Postgres, check out Google Analytics to Postgres

Easier and Faster Alternatives

If all this sounds a bit overwhelming, don’t be alarmed. If you have all the skills necessary to go through this process, chances are building and maintaining a script like this isn’t a very high-leverage use of your time.

Thankfully, products like Stitch were built to solve this problem automatically. With just a few clicks, Stitch starts extracting data from your Google account, structuring it in a way that is optimized for analysis, and inserting that data into your Amazon Redshift data warehouse.