The Physical Properties of Data (Part 1)
We often treat data as a purely digital or virtual thing; an ineffable signal that lives somewhere within (and perhaps between) our computers, our devices, and the cloud (AKA other people's computers). However, despite the seemingly clean abstraction, data is fundamentally physical; it is rooted in real-world material components, whose physical characteristics and constraints shape and control everything that we can and cannot do with our data. Storage, access, and manipulation are all actions that require a physical operation in the real world. Energy must be applied, states get changed, and signals are emitted, propagated, and received.
In a way, a much better metaphor for data than a cloud is grains of sand in an hourglass, with the end user at the bottom, waiting for a sequence of data operations to finally complete.
This post is part 1 of an exploration of the physical properties of data from an operator's perspective, and how they influence performance, cost, and operational limits.
Introducing the six physical properties of data (this post)
Compositions and limits - how individual properties compose to form the limits and determine the parameters in the Universal Scalability Law.
Case studies and examples
Property 1: Inertness
Just like the sand in the hourglass (and like sand in general), data is inert. It doesn't do anything on its own. You cannot read it, write it, or move it without first applying energy to it. In other words, data is a pay-to-play element; it has an activation cost.
This inherent inertia means that every operation in data–whether it's reading a file, writing a new entry to a database, or transferring data across servers–requires energy input.
For example, moving data involves at least two physical operations: one to read it from its original location, and another to write it to the new location. And if any intermediate components are involved, the operation will be more complex, e.g. with multiple read/write operations to buffering stages in between, as well as the signal propagation between them.
Every interaction with data, no matter how small, requires energy, and the larger the amount of data involved in an operation such as a search query or an update, the larger the amount of energy required to complete it.
Property 2: Perfect Reproducibility
One of the most powerful attributes of digital data is its perfect reproducibility. Given sufficient space, you can make an exact copy of any data without any degradation. This ability to replicate data means that digital information can exist in many copies at different locations simultaneously, whether across multiple computers or even in static storage media such as CDs and flash drives.
Making perfect copies
The flip side, however, is that the more copies there are, the more energy is required to complete an update, since each copy must be updated separately. In many ways, handling the complexity of managing these copies is one of the major logistical challenges that modern distributed data systems must tackle.
Property 3: Finite Compressibility
Data is compressible, but only to a certain extent. In his work on the source coding theorem, Claude Shannon showed that there is a mathematical limit to how much we can compress data without losing information. This limit, known as Shannon Entropy, governs the minimum space required to store data losslessly.
For practical purposes, this means that while we can reduce the footprint of data using compression algorithms, there is a hard limit to how small we can make it, and consequently, to how much information can be stored in a given amount of storage volume.
By combining Property 1 (Inertness) and Property 3 (Finite Compressibility), you can determine the minimum amount of energy required to operate on a given amount of information. In reality we often use more energy than the minimum, but knowing the lower bound is a great start when designing systems to operate over the long term.
Property 4: Gravity
Data tends to gravitate toward other data with which it is frequently accessed. As a result of the energy cost of accessing data and the latencies of moving it across space, over time we tend to see data that is regularly read or written together migrating closer together in physical space–whether within the same storage medium, across servers, or even between data centers.
An example of this from the database world is the clustered table index, which stores the table rows in the order of the index keys, ensuring that rows which are close together in the index are also close together in storage. The proximity reduces the amount of disk page reads required to fetch the row records that are close together, helping us optimize performance by reducing both the energy and time required to access those records.
While moving commonly accessed data closer together makes it cheaper to access as a set, this also means that operating on a data set that isn't stored in close proximity will take longer and require more energy.
Property 5: Geometry - the Meaning is Structural
Fractal geometry
Source: Robert Webb's Stella software: http://www.software3d.com/Stella.php
The meaning of data doesn’t lie in the individual ones and zeros, but in their structure and relationships. Since each bit is fungible—any one is identical to any other one, and any zero is identical to any other zero—the meaning is entirely dependent on how a set of bits are organized that gives data its meaning. This geometrical nature of data is essential to how we interpret it and extract information.
A rather contrived example is the difference between
0b1100 (12)
and
0b0011 (3)
The same number of 1s and 0s in a different order have different meanings when representing a binary value.
Not only is the geometry important, but it also has far reaching implications on the energy and time it takes to work with that data.
For example, the choice of Endianness can allow your program to read fewer registers and thus save both time and energy when operating over large data sets, a fact that plays an important role in latency sensitive systems.
Attribution: Aeroid, CC BY-SA 4.0, via Wikimedia Commons
Another example where geometry provides a powerful efficiency boost is with vectors and columnar storage: by storing records of the same data type in columns or vectors, we can achieve a large degree of compression by eliminating type metadata from the individual records, only needing to store it once in the column or vector header instead. The tricky part is that while this approach works well for fixed-width data types, which can be stored and accessed efficiently using offset reads and writes, it's much harder to employ for variable width data types, often requiring more complex logic and weakening the efficiency gains.
However, the geometry property cuts both ways: if the bits describing the structure or meaning (e.g. the type metadata) of a column are damaged or corrupted, the data stored in that column could become uninterpretable. Likewise, the more the data is compressed, the likelihood is higher that a small number of changes corrupts the data irrecoverably.
All in all, this makes geometry or structure an important aspect to consider when designing data systems and processes.
Property 6: The Fundamental Scaling Limit of Data
When we combine the volume of data with the properties of finite compressibility and inertness, we get a new property: a fundamental scaling limit. The physical limits on how much data we can store in a given system, how fast we can access it or move it, and how much energy is available to perform these operations combine to create an upper bound on the total amount of information that is practically accessible to a user of the system.
No matter how efficient and optimized we code our applications, this limit means that even though the real data storage capacity of a system might be larger, the pragmatics of a business (i.e. the constraints of the reality in which it exists and operates) and the physical limits of the system may mean that only a fraction of that capacity can be converted to business value.
Limits
These six properties form the parameters which define an upper bound on how much data a system can grow to handle before… well, before it just can't grow anymore; where adding more system resources causes a loss of capacity, rather than a gain.
In his Universal Scalability Law (USL), Neil Gunther describes four phases, which can be labelled as:
1. Roughly Linear - early scaling, no/low contention
2. Slowing down - increasing contention reduces efficiency
3. Plateau - the cost of contention cost rises to the point of "eating up" the value added by new resources
4. Negative returns - where the operational cost of new resources actually diminishes the overall system capacity
Source: Neil Gunther, http://www.perfdynamics.com/Manifesto/USLscalability.html#tth_sEc1.1>
The limiting dynamic is important for any data system in the real world. What the system can do today, and what it can grow to do in the future are both determined by these properties and the law of scaling. There is a data size, and a workload capacity, and a minimum latency that your current system design could never surpass, no matter how much time, engineering effort, or money you sink into it.
There's a real advantage in knowing these limits upfront: as your competitors' systems fall off performance cliffs, and their engineering teams get lost in development plateaus, you can cruise on by, strategically deploying your resources to take your business to the next level.
What's next?
While the first part of this article focused on describing each of the properties in isolation, the next part of this post will cover how they compose together to form the performance bottlenecks and ceilings that our systems could never escape. The third and final part of this series will take a look at how these limits play out in real-world examples.
Stay tuned for the second chapter in this 3-part series!