Snow Plow Dimensions
The Snowplow pipeline is built to enable a very clean separation of the following steps in the data processing flow:
We will cover each of those stages in turn, before considering
At data collection time, we aim to capture all the data required to accurately represent a particular event that has just occurred.
At this stage, the data that is collected should describe the events as they have happened, including as much rich information about:
- The event itself
- The individual/entity that performed the action - that individual or entity is a “context”
- Any “objects” of the action - those objects are also “context”
- The wider context that the event has occurred in
For each of the above we want to collect as much data describing the event and associated contexts as possible.
Often there are opportunities to learn more about an event that has occurred, if we combine the data captured at collection time with third party data sources. To give two simple examples:
- If we capture the IP address of a user who has carried out a particular action at analysis time, we can infer that user’s geographic location if we’re able to lookup the IP address in a GeoIP database
- If we know where the user who carried out the action was located, geographically, and the point in time where the event occurred, we will be able to infer the weather where the user was, if we have a database of weather conditions over time by geography
Both the above are examples of ‘enrichments’. Enrichments are sometimes referred to as ‘dimension widening’: we’re using 3rd party sources of data to enrich the data we originally collected about the event so that we have more context available for understanding that event, enabling us to perform richer analysis.
Snowplow supports the following enrichments out-of-the-box. We’re working on making our enrichment framework pluggable, so that users and partners can extend the list of enrichments performed as part fo the data processing pipeline:
- IP -> Geographic location
- Referrer query string -> source of traffic
- User agent string -> classifying devices, operating systems and browsers
The data collection and enrichment process outlined above generates a data set that is an ‘event stream’: a long list of packets of data, where each packet represents a single event.
Whilst it is possible to do analysis directly on this event stream, it is very common to:
- Join the event-stream data set with other data sets (e.g. customer data, product data, media data, marketing data, financial data)
- Aggregate the event-level data into smaller data sets that are easier and faster to run analyses against
- Apply ‘business logic’ i.e. definitions to the data as part of that aggregation step. We might have a particular approach to identifying users (and events that belong to users) across channels, group series of events for each user into sessions or group streams of actions performed by a particular user with a specific object (e.g. a video) into a single line of data
Examples of aggregated tables include:
- User-level tables. These are generally much smaller than the event-level tables because they only have one line of data for each user tracked by Snowplow. User classification is often carried out as part of the generation of this table: for example, which cohort a user belongs to. In addition any user-level data from other systems (external to Snowplow) e.g. CRM systems is typically pulled into these user-level tables.
- Session-level tables. These are also much smaller than the event-level tables, but typically larger than the user-level table, because often users will have more than one session. The session-level table is typically where sessions are attributed to specific marketing channels, users are classified based on how far through different funnels they have progressed and classification by device, operating system and browser takes place
- Product or media-level tables. It is common for retailers to aggregate over their event-level data to produce tables aggregated by SKUs or products. Similarly, it is common for media companies to aggregate data over articles, videos or audio streams. These tables can be used to conveniently compare the performance of different SKUs / media items / content producers / writers / brands and categories against one another. In Snowplow we typically refer to these types of analytics as merchandise or catalog analytics.
The above are all illustrative examples of aggregate tables. In practice, what tables are produced, and the different fields available in each, varies widely between companies in different sectors, and surprisingly even varies within the same vertical. That is because part of putting together these aggregate tables involves implementing business-specific logic, including:
- Joining Snowplow data with 3rd party data sets
We call this process of aggregating ‘data modeling’. At the end of the data modeling process, a clean set of tables is available to make it easier for to perform analysis on the data - easier because:
- The volume of data to be queried is smaller (because the data is aggregated), making queries return faster
- The basic tasks of defining users, sessions and other core dimensions and metrics has already been performed, so the analyst has a solid foundation for diving directly into the more interesting, valuable parts of the data analysis
Once we have our data modeled in tidy users, sessions, content items tables, we are ready to perform analysis on them.
Most companies that use Snowplow will perform analytics using a number of different types of tools:
- It is common to implement a Business Intelligence tool on top of Snowplow data to enable users (particularly non-technical users) to slice and dice (pivot) on the data. For many companies, the BI tool will be the primary way that most users interface with Snowplow data.
- Often a data scientist or data science team will often crunch the underlying event-level data to perform more sophisticated analysis including building predictive models, perform marketing attribution etc. The data scientist(s) will use one or more specialist tools e.g. Python for Data Science or R.