If data is so important why is it so difficult to access? How I built a data warehouse without any experience

Robin Glen
21 min readJul 6, 2022

--

We are constantly told how important data is and how our companies make data driven decisions but in reality, this couldn’t be further from the truth.

Innovation is about moving quickly, asking questions and proving hypotheses. Engineering, Product and Business & Marketing (referred to as EPBM from now on) all need access to data and insights to do this but it’s not always available. If a company’s data is not democratised answering these questions has to be outsourced to… the datakeepers.

The datakeepers

I posed a hypothetical question to friends and ex-colleagues who also work in EPBM:

Engineering

“If you wanted to know which features on your website lead to the least conversion and had the least engagement, so you could remove them and measure the performance impact, what would you do?”

Product

“If you wanted to know the number of female shoppers from the UK on your website who had five or more items in their basket but who had not logged-in in the last month, what would you do?”

Business & Marketing

“If you wanted to know the cost for getting new users acquired from TikTok who add an item to their cart but never checkout, what would you do?”

All the answers followed a similar pattern:

  • “I would get some of the information from different places then try to piece it together”
  • “I’m not sure if this data is tracked, I would need to ask the tech team”
  • “I would ask someone in the data insights team to produce a report for me”

If teams can’t understand their domain and answer questions themselves they will need to go to someone who can, the datakeepers. This introduces two problems:

  1. The team who needs the insights becomes blocked so they either can’t proceed or they proceed without knowing if they are doing the right thing.
  2. These ad hoc requests either block the data and insights team from their own priorities and backlog, or they block the requester from being able to proceed.

Both of these outcomes are detrimental to the company and can lead to poor relationships between EPBM and the data and insight teams.

After spending most of my career dependent on the datakeepers, things took a turn when I started a new role. I was previously an engineer in a company that had roughly 1400 people in the technology department alone, a department where there was a different domain for every discipline you can imagine. In 2021 I moved to be the VP of Engineering at a start-up with 7 full-time employees in technology.

Joining a start-up tech team was a world apart from my previous experience. As well as building the consumer facing product, the tech team also needed to provide business critical data and I needed to decide what took priority.

Understanding the problem

I was now on the other-side of the datakeepers problem, EPBM needed access to data and we were suffering from the exact lack of democratisation I previously mentioned.

While the product team were experienced when it came to analytics there was data distributed across multiple systems and only some of it accessible without tech. However more problematic was that all business critical reports were generated from the production database.

These reports were csv files generated by a bespoke backend service that could be run by anyone in the business at any time. This introduced a few issues:

  1. If a new report was needed it might take an engineer a week of development time to add them.
  2. If the business user wanted new data added to an existing report this would again require development time.
  3. These reports would generate a lot of load on the production database, they could be triggered at any point by a business user. This could potentially put the platform’s availability at risk.
  4. As the company got more data the reports took longer to generate and the service could run out of memory. This required more monitoring, alerting and DevOps support to ensure the reports were always available.

These issues meant from a company perspective we were not using our data effectively and from a technology perspective we didn’t necessarily have the people to support both customer and business feature requests.

I now knew what success looked like:

  • Anyone in the company could access all of our data and get insights without help from the technology team.
  • Decommission complex and expensive infrastructure.
  • The technology team would no longer be context switching from their roadmap to answer data and insight related questions.
  • The technology team doesn’t inherit a new system to manage and monitor.

Breaking apart the problem

While the report generation was specific to the company, the main issue I could see was the lack of decentralisation. I wanted to understand how I could solve that problem and hope this would unlock the EPBM teams.

I had zero experience with big data before I started this project, I had only experienced the problem of someone who wanted data insights. So I approached the problem in the same manner rather than an engineer trying to build a data warehouse. I knew what I wanted at the end, I had to try and break it down into the composite parts so I could build it.

A high level diagram to understand what a data warehouse might look like

Data sources

Companies have data stored all over the place:

  • In-house databases
  • Analytics tools
  • CRM systems
  • Marketing tools
  • … The list goes on and on

All of these tools and data sources need to be understood as you will need to get access to them.

Data collection

Once you know all the sources you have to get data from, you actually then to need to get it. This might require API access or there might be direct integrations but you are going to need to fetch this data, push it to your data storage while ensuring you are not duplicating or missing any records.

Data storage

You’re going to need a database that’s designed for big data, create a structure and schema to hold all of your company information in one place and then manage it.

Data processing

You might have already done some normalising of your data before you stored it but you’re also going to need to ensure it’s in a format that can be used by other tools but also find ways to link data across data sources that might have previously not been linked.

Data insights

This is going to be the interface for your data and how your company will access it, this might be via visualisation tools, APIs or reports.

Designing a solution

There was no business buy-in, budget or people available to work on this project, if I was going to solve the datakeepers problem I would need to do it for next to nothing and in my spare time. I knew if it was possible it would be a success but I first needed to build a very feature rich proof of concept.

The company was already part of the AWS start-up program and used the credits but GCP offers a similar program. I didn’t have the knowledge or the time to build a data warehouse with open source software but maybe I could use these credits, GCP tools and partners to show the business it was worth investing in.

Data sources

From my investigation I could find 14 different data sources and this was at a start-up, a larger companies are likely to have data in even more places. To give you some ideas and maybe help your search, I would look to integrate:

  • Facebook Ads
  • Snapchat Ads
  • Stripe
  • Two internal Postgres AWS RDS databases
  • Amplitude (analytics)
  • Shopify
  • Apple Search Ads
  • AWS S3
  • Google Play
  • Apple App Store
  • Tiktok Ads
  • Google Analytics

Data collection

The data store didn’t need to be updated in real time, it could be updated once or twice a day and still provide great insight for the company. My initial idea was to write a simple NodeJS application to run inside GCP which could call the data sources and push to our new data store. This initially sounded straight forward but some complexities arose:

  • The application and each data source would need its own API connector that would need to be managed and monitored.
  • Ensuring there were no duplicates or missed data.

This quickly put an end to the idea of writing a data collection application. I decided to look if there was a SaaS product that could offer this functionality but also tick the “costing nothing” box.

Effortlessly centralize all the data you need so your team can deliver better insights, faster. Start for free.
- FiveTran

FiveTran is a SaaS product offering all of the connects we needed; push to multiple data stores, configurable sync times, delta updates, a whole load of other options and best of all it is an official partner so I could include it in our start-up credits.

Data storage

Now I could collect the data I needed somewhere to store it, GCP offers lots of different proprietary databases and I took inspiration from The CrUX report.

The Chrome User Experience Report provides user experience metrics for how real-world Chrome users experience popular destinations on the web.

The CrUX report uses BigQuery to store huge amounts of data from Chrome users performance and UX metrics. It lets developers use SQL to write complex queries, while being very performant and you can also plug multiple tools on-top of BigQuery for visualisation and reports.

This made it an obvious choice for our data storage.

Data processing & data insights

The final piece of the puzzle was how we were going to process the data and empower the company to access it themselves without knowing SQL.

While having no real experience with big data or insights I had read a lot about Looker and it seemed really exciting. I don’t believe in silver bullets but this looked as close as it gets. It could handle all the data processing, insights, I could set it up alone and again it was covered by the GCP start up credits.

Data warehouse architecture

I now had a rough idea of all the components and how I would go about putting them all together.

The final design for the data warehouse

Building the solution

As soon as all the contracts were signed with GCP I could start with the confidence that all of the costs would be covered by our credits.

FiveTran

The first thing I needed to set up was the BigQuery database but to make things even easier FiveTran allows you to do this from their onboarding.

Select an existing destination or create a BigQuery instance
Configure your database

This will however create the database in your FiveTran account but you can easily transfer the billing into your GCP account. Once BigQuery was set up I could start adding my connectors.

FiveTran has loads of connectors to all different data sources
Setting up the connector

Setting up connectors are all relatively similar, you need to perform some authentication, name the schema for the destination database and some configuration. The instructions (on the right) are the best I’ve ever seen for this type of onboarding - all up to date and easy to follow. For 90% of the connectors you don’t need any technical knowledge.

Once a connector is set up, you can check its status, logs, schema, usage and configure things like sync frequency.

Facebook Ads connector overview

I then followed this process for all of our data sources. This process only takes a couple of hours and then all of the company data was being pushed to BigQuery.

Setting up all the data source connectors

BigQuery

If you created the BigQuery instance using the FiveTran setup, there is not much for you to do here. You will find your connectors have already created tables, schemas and started to populate your database.

Facebook Ads schema inside BigQuery

From here you can check the schemas, preview data and run queries using SQL. You can also export queries directly to Google Data Studio to generate data visualisations or reports.

For a basic setup you will never really even need to look at BigQuery again. As your Looker requirements become more advanced you might need to use Persistent Derived Tables, if this happens you will need to create temporary datasets within BigQuery.

Looker

Looker is used as both the data processing layer and the interface for the company to access the data. First thing you have to do is set up the database connection to BigQuery.

Setting up your BigQuery database connection

You are now ready to start a new Looker project.

Setting up a new Looker project

LookML
LookML is Lookers Modelling Language and is how you will process, map and link your data. This will enable the company to query the data without any SQL knowledge.

Let’s take an extremely basic example, we have three tables in BigQuery:

Very simple tables for an e-commerce platform

Within the develop section and the new project we are going to make a simple view for the orders table.

Creating a view

First we will create the dimensions, which are just the attributes within the table.

view: orders {
sql_table_name: `fivetran-my_new_warehouse.my-database-table.orders`
;;
dimension: id {
primary_key: yes
type: number
sql: ${TABLE}.id ;;
}
dimension: user_id {
type: number
sql: ${TABLE}.user_id ;;
}
dimension: total_price {
type: number
sql: ${TABLE}.total_price ;;
}
dimension_group: created_at {
type: time
timeframes: [
raw,
time,
date,
week,
month,
quarter,
year
]
sql: CAST(${TABLE}.created_at as TIMESTAMP) ;;
}
}

Basic references in Looker are done by filename - see in the view attribute how we point to our BigQuery instance - the table name, then the filename of new view we created. Above also shows a few simple dimensions where you give it a name, select a type and the table field.

Looker also can make this easier for us by using “Create view from Table”.

Create a view from a table in BigQuery

This will automatically scaffold a view from the database schema like the code snippet above but it will also add something called a measure.

measure: count {
type: count
label: "Count"
}

A measure is an aggregation of a dimension. The example above is a count, this could be used to show how many orders are in the orders table. You can also write your own measures:

measure: total_revenue {
type: sum
sql: ${total_price} ;;
value_format_name: gbp
label: "Total order revenue"
}
measure: average_order_revenue {
type: average
sql: ${total_price} ;;
value_format_name: gbp
label: "Average order value"
}

These will allow our business users to find total revenue or average order values in GBP from the dimension total_price. This seems pretty basic but once we start linking other data and adding filters this becomes really powerful, for example using the created_at dimension we can answer the question:

  • “Show me the total revenue and average order value per day for the last month”

Next we create views for the users and countries tables using Lookers “Create view from table” and then we can move onto models. You create a model in the same way you create a view but this time there are no shortcuts.

connection: "looker-bigquery"include: "/views/*.view"explore: users {
label: "Users"
}
explore: orders {
label: "Orders"
}
explore: countries {
label: "Countries"
}

First we use our database connection, then we are importing in all the views within the views directory. This exposes all of the views with a model to Lookers Explore interface which is great but what if the business wants to know:

  • “Which user has the highest Average Order Value”
  • “Which registered user country has generated the most revenue”

We have three tables but none of them are related to each other, lets extend the orders explore:

explore: orders {
label: "Orders"
join: users {
relationship: one_to_many
sql_on: ${users.id}=${orders.user_id} ;;
}
join: countries {
relationship: one_to_one
sql_on: ${countries.id}=${users.country_code} ;;
}
}

The orders explore now has joins on both the users and countries views. Something you will need to ensure is the relationships on dimensions are using compatible types. I know in this instance they are all numbers so we have now successfully added relationships between three different tables.

Relationships between tables for a simple e-commerce platform

Once published to production our users will now be able to filter, sort and run different queries using these three tables without the need for any code, just the Looker Explore interface.

This is of course a very simple example to explain some of the LookML concepts. I recommend taking this free training course for you can understand what is possible: https://www.cloudskillsboost.google/course_templates/327

I asked the business what they wanted to understand, this included:

  • Number of orders.
  • Orders with vouchers.
  • Orders for returning customers vs new
  • Average minutes of livestreams watched
  • Livestream engagement and sales
  • Popular products
  • …many many more.

I then created dimensions, measures and models to support all of these in the Looker Explore interface.

Looker analysing and visualising data
The complexity in Looker is setting up the LookML code, once this is in place I was able to give access to our business users. They can now use this data to perform insights using both the explorer and dashboard.

Going back to our example e-commerce database, suppose someone wants to know the number of orders per day over a certain date range. They select the dimension Created at and the measure Count then filter Created at over the required range.

Orders by day report

We can now see a breakdown of the orders per day in the selected timeframe. Just by clicking a few UI elements we were able to generate this report while in the background Looker created the following SQL query.

-- raw sql results do not include filled-in values for 'orders.created_at'SELECT
(DATE(CAST(orders.created_at as TIMESTAMP) )) AS orders_created_at,
COUNT(*) AS orders_count
FROM `fivetran-my_new_warehouse.my-database-table.orders`
AS orders
WHERE ((( CAST(orders.created_at as TIMESTAMP) ) >= (TIMESTAMP('2022-06-07 00:00:00')) AND ( CAST(orders.created_at as TIMESTAMP) ) < (TIMESTAMP('2022-06-14 00:00:00'))))
GROUP BY
1
ORDER BY
1 DESC
LIMIT 500

Your users might then want to see this visualised, this is as simple as just opening the visualisation accordion.

Orders by day visualisation

Next you might want to save this for a later date or add it to a dashboard.

Saving explores

You have three options; saving to an existing or new dashboard or also as a Look. A Look will store this report while a dashboard is purely the visualisation.

Orders by day on a dashboard

You can see the dashboard has created a date dropdown for us, users can then easily change the parameters of this filter to different presets or custom values.

Date filters for the dashboard

Finally you might want the company to get weekly updates of the last week’s orders every Monday morning. You can do this via email, Slack or other integrations directly from the Dashboard.

Schedule dashboard reports

This again is just scratching the surface of what’s possible with Lookers Explore and visualisation layer. I recommend taking this free training course for you can understand what is possible: https://www.cloudskillsboost.google/course_templates/323

Reviewing my solution

We now have a complete data warehouse that contains all of the company’s data, adding new data sources is as simple as a few clicks and writing the required LookML.

FiveTran

Collecting data from multiple data sources and pushing them to your data store.

The good
I love FiveTran - something I worried was going to take up most of the time for this project took me an afternoon. It handles your connectors, schemas, data deltas, duplicates, schema changes, error handling, sync frequencies and everything you can imagine for collecting your data.

The bad
It is expensive… or is it? When you look at the cost per month it might make a few eyes water but is it cheaper than developing it yourself? Yes, vs a developers salary for writing and managing it it might be the cheaper option. Are there cheaper alternatives? probably, I didn’t do a deep dive into alternatives and their pricing as the goal was to build a proof of concept and see the value. If your goal is about time to market and ease FiveTran is worth it, otherwise it might be worth investigating other solutions.

If cost or vendor lock-in is a concern for you, do consider it in your appraisal. While it is not impossible it would be an unpleasant task to switch this out at a later date as your database schema and LookML will be somewhat coupled to FiveTran.

The cost
You can find a breakdown on their website to give you a rough idea of real world pricing. This project consumed roughly $2000 GCP credits per month, this however was not at all optimised, lots of that data is not being used and could be removed bringing the price down.

Alternatives
One of the reasons for selecting FiveTran was the ease and it being included in the GCP credits, so for that it selected itself. I’m not going to include a breakdown of all the other players in the market however when researching cheaper alternatives there is an open source offering. This is great for the connectors but it still would add some infrastructure and tech management: https://www.singer.io/

BigQuery

Storage for the data warehouse.

The good
BigQuery is well known and one of the industry leaders for this type of big data. It’s very easy to set up and get going, the documentation is good and the fact FiveTran can do most of the handwork for you makes it even better.

The bad
I don’t really have many complaints other than GCP web UI is a bit clunky. I would like to better understand the cost breakdown as billing can be confusing when trying to understand the cost of certain queries.

The cost
Storage is cheap, the BigQuery database has many terabytes of data and for storage was never over $100 a month, what gets more expensive are the queries. Some of the analytics tables are massive and while I was learning Looker I accidentally spent $500 in a day running unoptimised queries.

You can lower these costs by learning about Persistent Derived Tables and implementing Caching Policies.

Alternatives
There are lots of different databases you can use for a data warehouse, I won’t go into them all but if you’re going to use Looker it needs to be an SQL database.

Looker

Data processing and visualisation.

The good
Again I love Looker, once I got it going I found it incredibly addictive, having the power to pivot over all companies data in one place was incredible. Answering complex or never answered before questions was taking me seconds. I automated people’s daily tasks in a couple of hours, freeing them up to focus on bigger things. The explorer and dashboard are extremely intuitive once you learn your way around.

The bad
While LookML is not that complex the documentation is horrible. It’s near impossible to find examples of anything that’s not the most basic implementation. Code snippets are ok but sometimes you want to know the structure of the data it’s working with to make it easier to visualise and this is never documented.

You might get lucky with some forums but to be honest they are not great and not exactly buzzing with engagement. Something that might also catch you out is on the forums someone might include an SQL statement in their LookML but they don’t say what database they are using. If you’re not using the same database this might give you a syntax error as the function is not available - this happened to me a lot with BigQuery.

Luckily in our credits we had access to professional services from Google so when I got blocked I could ask them directly although I did sometimes get blocked for a week as I was waiting for our next session. One complex measure I wanted to create needed sum values from an array of json objects inside a table field and without their help I honestly don’t know how I would have solved it with just the existing Looker documentation.

There is also no real way to learn Looker without having access to Looker. I feel extremely lucky to have undertaken this project because based on the cost and there being no free tier I’m not sure if I would have ever got the chance to use it otherwise. This also means there is some technical debt as I was teaching myself as I was building the proof of concept.

I would love it if Looker offered the ability to have a personal Looker account for learning but also to build some simple projects of our own on top of it. I think this would really increase the size of the community and improve the documentation.

The cost
Again Looker is expensive… or is it? Looker roughly costs $5000 a month and you need to pay for users on top of that. This seems like a high price but is it when you can start getting value out of your data and starting this democratisation process? Personally again, I think it’s worth it.

The professional services were an additional $10,000 which was really worth the money and thankfully was also included in the credits.

Alternatives
There are many different Business intelligence tools, Tableau for example. I didn’t look too far down this direction as my architecture was about using the tools I didn’t need to manage and were included in the GCP credits.

You could however implement Google Data Studio for free if you didn’t want to go down the Looker route, however this would require your users to understand SQL so it doesn’t solve the democratisation problem.

Conclusion

As I mentioned I had zero experience with big data when I started the project and I only have slightly more than that now. I worked on this in my lunch hour and went from nothing to my first Looker dashboard in less than two weeks. From nothing to the business using it in less than two months and what took most of that time was I only had access to the professional services for a hour, once a week for five weeks.

What I did was pretty straight forward and I don’t pretend to know or really understand this discipline, I would say I don’t even know enough to know what I don’t know. However my goal was to build a data warehouse and I achieved that, so let’s look back at the success criteria for the project:

Anyone in the company could access all of our data and get insights without help from the technology team.

Looker was handed over to the business and with some simple training and with access to the Google training course the team were able to start building their own reports and automating their workflows. The data democratisation process has started with EPBM being able to answer more complex questions and get powerful insights.

Decommission complex and expensive infrastructure.

Services that were generating weekly updates and reports can now be decommissioned and replaced by Looker.

The technology team would no longer be context switching from their roadmap to answer data and insight related questions.

Some of the data has yet to be modelled in LookML. Once this is completed there is no real need for the technology team to be involved or alternatively someone with Looker experience could be brought in to own this.

The technology team doesn’t inherit a new system to manage and monitor.

All of the tools and services are SaaS products so the technology team is not needed, however there will be people needed to manage these services from an IT perspective (users, permissions etc). I also recommend someone keeping an eye on billing for complex queries and performance optimisations also FiveTran syncing in case there are issues.

From this success criteria the project ticked all the boxes but what about the cost of investment? Overall the project is going to take up all of the $100,000 in GCP credits over the course of a year, with a rough breakdown of:

  • FiveTran: $24,000
  • BigQuery: $4000 (including a buffer for poorly optimised queries)
  • Looker: $62,000
  • Professional services: $10,000

This might look expensive at first and as the company grows so will the cost but I know of data and insight teams spending more than this per week. Also consider the fact this was done by one person, part-time in less than two months. With someone full-time who knew what they were doing I think this could realistically have been completed in a week.

So far the investment for the company is essentially zero and going forward the infrastructure is the equivalent of one developer’s salary. To get the most out of it I would also look to hire someone even if it’s part time to rewrite my models, model new data and evolve the platform.

Even with this added cost for me I think it is more than worth it. This architecture allows for data to be democratised while being grown, run and managed by a small team of experts in Looker and data scientists who work directly with EPBM. I feel this will encourage innovation, help make data driven decisions and most importantly put an end to the concept of the datakeepers.

I have since left the VP of engineering position, if you are interested in this or working with me on other technical solutions you can reach out to me at gullwing.io.

--

--