The world today is a complex, non-uniform tapestry of regulations, compliance laws and privacy restrictions designed to protect society and humanity in different cultures. Multi-national enterprises spend enormous amounts of time and money legally navigating this mesh of rules to do business globally. Analyzing data to achieve competitive business insight requires accessing sensitive data across international borders, across national state-lines, and weaving through differing country’s regulations, to access critical data. Sensitive data must be protected, traced, and tracked in order to keep the information within that data, safe.
Data Lineage involves the curation of the original data – the original truth. What happens to this data, where this data is moved or copied, how many times this data is moved or copied, its origins and finally the confirmation that this data, in all its locations and forms, is expunged. Now, multiply this requirement by petabytes of data globally within an enterprise and the challenge can seem daunting, yet successful enterprises endure through these requirements, but it can be costly.
As much as an enterprise in regulated industries must prove that they have a certain type of sensitive data persisted, they must also prove that, when permitted, this data no longer exists – anywhere. If data, that has been deleted from an enterprise’s records, suddenly shows up in any form or location, this data can still be recalled and used in litigation cases for or against the enterprise. Whether the enterprise is aware of the existence of the rogue data or not. The sequence of the creation, use, retention, and eventual erasure of data is called Data Life Cycle Management, and in some industries, the duration of this life cycle can span decades or more.
The burden of managing data, tracking its movements, replication, and its locations, and all the associated costs, can place a strain on the ability of an enterprise to conduct the business they need to do. The sticky issue is the “use” phase of the data’s life cycle. How do you make data useful, accessible, and analyzable across a vast web of multi-national regulations, without losing track of it? The answer is simpler than expected, leave the data in place. Where it is safe and where it is controllable. A single, manageable, and immutable version of the truth. Leave the original as the original.
Analyzing data-in-place sounds like an easy solution to this industry problem. Unfortunately, this is not the first time this approach has been tried. It is not sufficient to just access the original data where it persists from anywhere. The problem is network latency. The race between network latency and data size has been a back-and-forth struggle throughout the history of computer networking. Even as the world gets smaller, network latencies can make accessing data seem too far away to be efficiently analyzed with high performance analytical databases engines like Teradata’s Vantage and VantageCloud Lake systems, and Teradata’s ClearScape Analytics.
Network latency can come in three primary flavors; latency caused by distance, latency caused by congestion, and latency caused by the network design itself; intentionally or by accident. Combinations of these latency flavors in the same network, compounds the issue. All three flavors can cause analytic access to data to be too slow to be useful, which reduces the usable throughput required to gain insight from critical data, to outright intolerable.
The instinctive solution by IT is to place the data near the processing engines, where it is needed. This means copying data to local storage locations to give the data, local performance access. As described, this creates another set of issues. The tasks of keeping track where all these copies are located, when use of the data is completed and expunging the data from all locations, can be difficult and costly. This includes tracking down potential locally backed up copies and any off-site media copies, and local disaster recovery replicas in those remote locations.
The simplest solution usually is the best and most practical solution; leave the original data in place. This is possible today with the combination of Teradata’s Vantage analytics database systems and Vcinity technology to optimize the latencies in Wide Area Networks. Vcinity technology performs better as latency or data size grows, increasing throughput by over seven-times when directly compared to the same WAN by itself.
This is based on testing queries from Teradata Vantage to Native Object Store (NOS) storage containers as external foreign tables and Vcinity technology accelerating the WAN connection at up to 230 milliseconds. That’s the raw network equivalence of doing analytical queries between New York and Singapore.
Imagine being able to utilize as much as 95 percent of your WAN connection to analyze data where it is stored versus copying and staging the data closer to your analytic engines. That’s global analytics with data-in-place—at scale. This frees up IT to solve bigger issues rather than keeping track of where sensitive data is being copied, if the enterprise is even allowed to do that. They can manage and control data where they need to. This also potentially minimizes some regulatory requirements as some regulations allow the transient use of data versus the persistence of data in other countries.
The powerful combinations of Teradata VantageCloud Lake and Vcinity technology is perfect for on-premises, private, hybrid, public and multi-cloud solutions where long network latency might keep an enterprise from fully leveraging access to their sensitive data. The cost savings and deduced management overhead spent on curating important data could also play a role in architecting data access methodologies.