Raw data is valuable data

It's hard to know what information will be useful ahead of time, so try to store it all


Intellectual property has always struck me as a bit of a murky term. What property is actually owned by internet companies, besides employee laptops? In the popular conception it’s often a “brilliant algorithm” developed in isolation by a lone genius. The story goes: Larry and Sergey created PageRank and Google was an inevitability; Mark Zuckerberg created Facemash and that somehow morphed organically into the Facebook data trove.

These narratives are simplistic. The early intellectual property developed by Google and Facebook were mostly creative applications of existing concepts, packaged in user-friendly tools to bootstrap data collection. Over time, the interaction data left behind by early users, combined with technologies developed to store and analyze that behavior, cemented dominance by setting an informed product development cycle loose.

Algorithms operate on data, and without that data lose much of their value. A search result page ranking is a shot in the dark without measurement of what links are actually clicked through. A brilliant algorithmic trading approach cannot be back-tested without data on historical performance. A beautiful e-commerce checkout flow is just a series of statistically insignificant opinions when you lack traces of thousands of customers checking out.

Perhaps the popular attribution of intrinsic value to algorithms comes from the fact that engineers build those algorithms, and companies pay those engineers; a direct transfer of value that accountants can easily reason about. Your users, on the other hand, mostly don’t engage in direct monetary transfers with you. The data they produce for you is merely a side effect of their consumption of your service.

It’s no surprise that large tech companies are quick to open-source their algorithms and data-processing software, or publish academic literature describing the same, and yet (with very few exceptions) rarely publish their data. Not publishing data is perfectly sensible when protecting customer information, but there are innumerable “privacy safe” data sets of aggregate user behavior that will never see the light of day: they are just too valuable to inform further product development and understand customer needs.

A famous exception occured 10 years ago, when movie-watching data was published for enterprising data scientists in the eponymous Netflix Prize, but in retrospect seems like more of a PR stunt (especially since Netflix reportedly didn’t bring the winning algorithm to production). Kaggle has more recently emerged as a repository of interesting data from various companies, but it’s clear that the datasets here are toys with an eye to recruiting.

No wonder that there is so much inertia now behind the data rights and customer data ownership discussion happening with the GDPR. Regulators have caught on to the fact that the real value is in the data, and that end users could reach new levels of empowerment if they had greater rights over the data that they produce.

Knowing this, how do you maximize the value of your data as a company?

Creating valuable raw data

A strong data practice is necessary (but not sufficient) to create a valuable technology practice. As a company, how does one create the most valuable data? Methodically store as much atomic user data in a raw, yet still semantically meaningful, form, even if that makes at-a-glance analysis slightly more difficult.

This isn’t a new concept. A few years ago, the lambda architecture was much talked-about in big data discussions. In it, instead of only storing the current state of the system, you store the original state, in addition to all of the mutations applied to it (a kind of historical ledger). To get the current state of the world, you play back all of history. Derivativations of that (such as a cache of the current state, or quick-update paths) are only necessary for performance. How to beat the CAP theorem is a delightful set of ideas in this space.

Query = f(All data)

Google Docs works this way: a document is just the blank document plus all of the edits applied to it. This same concept is at play with the popular redux state container used by so many frontend developers. You get a lot of value from adopting this pattern:

Let’s explore a practical example:

Tracking logins

Every service with accounts has some notion of logging in. It’s a common desire to identify accounts who have recently logged in (to analyze active users) or to segment out dormant users who haven’t, potentially to try and re-engage them through marketing outreach.

To facilitate answering these questions, one approach would be to have an `account` table or collection which contains a last_login_at time stamp field.

CREATE TABLE account (
  id INTEGER PRIMARY KEY
  last_login_at TIMESTAMP
)

Then, in the code that handles logging in, you’d have something like:

UPDATE account
SET last_login_at = NOW()
WHERE id = /* account_id */

At a glance, this seems fine. Any analyst with rudimentary SQL knowledge can find your active or inactive users with a basic SELECT statement. But that’s pretty much the only question they can answer. Indeed, you’re losing any record of a previous login every time a user logs in again since you’re overwriting it (and consequently burning away value from your company or project).

Storing each login event

When a user logs in, they perform an atomic, semantically meaningful event: something specific has happened out there in the world that we can talk about. What if we embrace that and stored every single login event ever in a separate table that referred back to accounts?

CREATE TABLE account_event (
  id INTEGER PRIMARY KEY,
  account_id INTEGER REFERENCES account(id),
  created_at TIMESTAMP
)

Then, in your login handling code, you’d have something closer to:

INSERT INTO login_event (account_id, created_at)
VALUES ({account_id}, NOW())

For the query-writer, finding the equivalent of last_login_at becomes a little bit more complicated for the query writer due to the need for a JOIN, but it’s not that bad:

SELECT
  DISTINCT ON (account_id)
  account_id,
  login_event.created_at AS last_login_at
FROM
  account
INNER JOIN
  login_event
ON account.id = login_event.account_id
ORDER BY account.id, login_event.created_at DESC

(In practice, you could just create views for your analysts or use dashboarding tools like Looker to hide some of these internals from less SQL-savvy folks).

Flexibility of analysis

The beauty is that now you can retroactively use that same raw data to answer a great many more questions!

Disk is so cheap these days (consistently less than $0.05 per GB in 2018) that you really shouldn’t worry about the cost of storage for your valuable data. For particularly performance-sensitive situations you might redundantly store the latest value in a cache or directly on account for fast lookup, but premature optimization should be avoided.

Evolution over time

Storing the raw events separately also allow you to later extend information collected about logins without bloating the main account schema. Add a field to store whether the login event happened on phone or desktop: immediately you can start using that as a parameter to further break down the above analyses.

That said, if you don’t record an event, or an attribute of an event, at the time of its occurence, you will not be able to analyze it for the past. If we didn’t capture whether an event happened on desktop or mobile, and later add it to login_event, we will only be able to see that information from that point forward.

For that reason, it’s best to log as much as possible when adding new functionality, even if you do not have an analysis in mind. Avoid the feeling of hopelessness that comes with the realization that customers have been doing something for months without any record of that to look back on.

Raw, but not too raw

An extreme approach would be to just log every single user event and every single server request, and leave that as your analysis solution. But that might be a bit too far in the direction of rawness (even though you should have this level of logging to allow for debugging of complex bugs and as a backup to backfill data). Analysis becomes tedious, and the specific technical implementation of a given piece of user behavior affects the meaning of the data.

Say you were using HTTP request logs to determine when people were logging in. All is well, until you decide to do something like change your URL for that from /accounts/login to /users/login.

If storing the raw login_event, your new API endpoint would just call the same code and you’d have continuity out-of-the-box.

But if you only had the raw server logs, you’re in a bind. You could go back and rewrite all of the old request log messages to the new path. But that doesn’t make sense: those requests did go to the old path at the time, and rewriting a historical ledger of what happened is not a good idea. You’d probably end up with a complicated bunch of ad-hoc comments and code littered around in the form of:

if (event.timestamp < v2_rollout_timestamp) {
  // Before v2 accounts were at /accounts
  do_old_way()
} else {
  // After v2 accounts are at /users
  do_new_way()
}

There is literally NO WAY that this will stand the test of time, as institutional knowledge fades away and newer people reuse analysis snippets without awareness of these nuances. You might be thinking as you write the code “oh I’ll just remember this”, but human memory is quite fallible, particularly since schemata inevitably evolve over time: v2 is almost definitely not the final version.

To avoid spreading technical debt organization-wide and ensure your data remains consistent and valuable, it’s best to try and store data in as raw of a form as possible while still preserving semantic meaning that’s distinct from the specific technical implementation.

Don’t (only) outsource data

The explosion of SaaS tools for data storage, processing, and visualization makes it tempting to depend on third-party vendors for data. But just throwing in a JavaScript snippet for Google Analytics or Mixpanel and claiming that you have page views covered outsources some of the value that you could be capturing.

At Better Mortgage we proactively stored each page view ourselves (in addition to using easy-to-use third party tools). We didn’t know exactly what that data would be useful for at the beginning. But by storing atomic, semantic events about page views, later we were able to repurpose page views for myriad purposes, such as understanding how many different staff members were viewing different pages, or how particular types of customers were navigating around the portal. If we had dependent solely on Google Analytics, we’d have aggregate information across our pages, but we would not have been able to reprocess the events as our understanding of what was important or useful to analyze grew over time.

Never should you come across the answer “oh, we don’t log that” when someone asks a question about users; the answer might be “hmm, that will be complicated to get” but that is far better than having nothing, and it will make what you’re working on so much more valuable.


Thanks to Blake Chasen reviewing a draft of this post.