Achieving high performance with large amounts of data is a challenge. It requires great technical innovation and knowledge as well as clean code. However, they all have a limit. Often, thinking outside of the box and using basics in programming and mathematics leads to the simplest and most elegant solutions.
Imagine working with great number of people data with a common unique identifier (social security number, national identification number, etc.). Your goal is to record a number of parameters for each person (tall, wears hats, plays football, smokes, etc.) and later use to extract knowledge through statistics. Standard industry approach will be to use the unique identifier and store each parameter in a separate column as bit (true/false) in order to save space in the database. This is a good approach. However if you need to find all people that are tall, wear hats and smoke you must have three different indexes (which will take additional storage) and send queries to all of them. It works well for a country of short non-smokers or low population. As parameters and the population grow, one query could take tens of minutes to hours.
This is where new technologies stop working and new approaches start to. Let us consider using only one column instead of 50. Ridiculous, how would you store billions of records? Use simple Bit logic – record each characteristic of type bigint.
The basic idea is that all characteristics are assigned a value, equal to the powers of 2 from 0 on –> 1,2,4,8 and so on. What is great about this approach is that a sum of those values is unique. The sum of 1, 2, and 8 is 11 and could not be achieved with any other powers of 2 (since different bits of the number are equal to 1 –> hence the term bit logic). We assign numbers (powers of 2) to the different occurrences of the searched observation IDs and make a sum of the observation values for each person, found with an “in-clause”. Then if we have an “and” condition (all tags should be present) we select the people, whose tag value sums are equal to the sum of the bit values of all sent tags, which is pre-calculated. If we are looking for “any” of the parameters, we just select the people with sums, greater than 0.
A simple and elegant solution, which uses only one column and has shown that a 5-minute query can turn into a 2 second one.
ScaleFocus Performance Optimization Team