Connecting data retrospectively

The problem

These days it’s easy to find the answer to many of the questions we have by doing a quick Google search, asking Alexa or querying a database. For example, you could probably find an answer to all of these questions almost immediately:

  • What are all flights into LAX today?
  • What were daily high temperatures in Sydney for the last 12 months?
  • Who are all the people in your organization’s sales team?

However, if we made these questions slightly more complex they become significantly harder to answer:

  • What are all the flights into LAX today, which originated from an airport in South America?
  • Excluding public holidays and weekends, what were daily high temperatures in Sydney for the last 12 months?
  • Who are all the people in your organization’s sales team that haven’t yet completed their privacy training?

You can definitely get answers to all these questions, but it’s unlikely it could done immediately unless the data happens to already be collated and ready to use. Instead, you would probably have to do something like this:

  1. Find all the datasets required to answer the question.
  2. Get all the datasets into a common format (such as a spreadsheet).
  3. Join the datasets together (maybe using some VLOOKUP tomfoolery).
  4. Query the combined dataset to answer to the original question.

Most of your time would be spent just getting the data together, not actually answering the question you had. Worse still, this would be a one off solution to the problem. What if you need to ask the same question again a month later? And what if someone else has a similar question? Would they be able to make use of the work you've already done or would they need to repeat exactly the same steps themselves?

We hope to eliminate these problems by ensuring data is always available in the format you need it and already connected to the related data you care about.

The solution

ColonyDB helps you keep your data together so it's always easy to answer complex problems. Let's consider this questions from above:

Who are all the people in your organization’s sales team that haven’t yet completed their privacy training?

The root cause of this problem probably began a long time ago when someone started keeping track of all the employees in the organization:

At some point someone might've also started keeping a list members of sales team members:

Separate to both of those efforts, someone could have started keeping track of people who had completed the privacy training:

Despite the fact all 3 lists were made without any intention of using them together, it's simple to reconcile them ColonyDB using a feature called "canonical relationships":

Once the canonical relationships have been set up we can use another ColonyDB feature called "rules" to find the answer our original question:

The result can be saved and all the references back to to source data kept intact so others can reuse our work answer their own questions.