This post was inspired by great article on on Bostata.com called “Client-side instrumentation for under $1 per month. No servers necessary.“
Sounds pretty good, isn’t it?
In brief the author, Jake, describe how you can build data warehouse collecting events from your website using Snowplow event tracker on Amazon Web Services stack:
- Cloudfront to record events
- S3 to store row and processed data
- Lambda to enrich row data
- Athena to process it and provide access to the data in SQL format
The end result is data stored in S3, which is essentially just text storage and processed on fly with AWS Athena. So you pretty much own your business data, which is very in line with the philosophy of this blog.
Plus it should be cheap – no always running servers used, S3, Lambda and Athena are inexpensive, especially on comparatively small data that small & medium businesses generate.
Hence I can’t miss the opportunity to give this approach a try with helps from my colleague from Magenable.
I will not repeat Jake, you need to read an original post, just stress several things that was not so obvious.
Snowplow is a platform to collect events data. Think about your own Google Analytics, more precisely tracking part.
It can be both self-hosted and hosted by Snowplow for public. We used public one, for the experiment, but setting up own isn’t that hard, you can have it with S3, so now need to have real server.
Data enrichment (Lambda)
Without enrichment you get very basic data from your tracker and you need to clean it to make it easier for further processing. Hence we need to do data enrichment. In original Snowplow architecture enrichment is done using one of 3 applications. They are open-source and can be hosted on AWS EC2. However we try to do it cheaper and use serverless Lambda function. Unfortunately the details of this process was omitted in Jake’s post. So that was the hardest part in our experiment.
Lambda supports number of programming languages including Python, which we decided to use for our simple enrichment script. It took some time, but eventually Lambda function was ready, deployed and started to work. It takes raw the data from one S3 folder, processes it and put to other S3 folder.
You may find the function on Gitgub – https://github.com/ownyourbusinessdata/snowplow-s3-enrich
It contains number of file mainly because we need to work with Maxmind Lite database of geoIP data.
Another source of enrichment is Cloudfront logs.
At the moment the function covers around 60 attributes, but we have just started.
Plugging in analytic tools
Once the data collected and Athena is configured you can get access to it in SQL form and plug almost any decent analytical tool. We’ve tried AWS own Quick Sight, Mode Analytics and R, but probably cover how it was done in another blog post.
Just add there couple charts quickly produced from the data collected.