Amazon Web Services
Use AWS infrastructure with Teradata Vantage
Harness the Full Potential of Your Data Lake
Democratizing data access across the enterprise
Data lake architecture can be difficult to navigate at first because it lacks the familiar tabular structure of data warehouses. Early data lakes were often left for data scientists to operate and maintain. However, the advent of data preparation and management tools that allow self-service access to data lakes changed this. Now, enterprise staff who aren't experts in data science or data engineering can still take advantage of data lakes' vast possibilities for analytics insights, in industries ranging from healthcare to finance. Also, easier data access encourages data sharing.
Enabling a new approach to data integration
While the data management field hasn't moved entirely away from tools that use traditional extract, transform, and load (ETL) methods, mitigating data transformation costs has become a major priority. Data lake solutions, which typically represent the newer extract, load, and transform (ELT) approach, offer greater scalability and flexibility than strictly ETL-based systems at a lower cost. Organizations employing the most leading-edge best practices for data management are rebalancing hundreds of data integration jobs across the data lake, data warehouse, and ETL servers, as each has its own capabilities and economics.
Strengthening—and simplifying—corporate memory retention
Archiving data that has not been accessed in a long time with the help of a data lake can benefit an enterprise's data warehouse by saving storage space. Until the data lake design pattern came along, there was no other place to put "colder" data for occasional access except the high-performing data warehouse—or truly archaic, offline backup systems such as tape storage. With the aid of virtual query tools and data lakes, users can easily access cold data—in conjunction with the warm and hot data in the data warehouse—through data Lake Design Pattern: Realize Faster single query.
Embracing new forms of analytics
Technologies ranging from Hadoop to Spark Streaming—along with the cloud era as a whole—have given rise to new forms of analytics. Data lakes allow these new forms of business analytics to be efficiently processed at scale, using graphing methods and machine learning algorithms.
Finding insights in non-curated data
Prior to the big data trend, data integration normalized and aggregated critical enterprise information using a standardized repository system—such as a database or data warehouse—and the structure of this methodology allowed analysts to derive the value from key data. But this alone is no longer enough to manage all of the data across any enterprise.
Attempting to structure what was once called "dark data" and homogenize it with a uniform format can actually undermine its value. By contrast, data lakes are an ideal environment for data scientists and analysts to mine this information in its original format for insights. The unstructured, unprocessed raw data is more malleable than its structured counterpart, making it especially valuable for machine learning projects.
Self-service business intelligence (BI) tools and similar resources that make data lakes more accessible can be a double-edged sword: The ease of use might make some users believe that lakes can be set up ad hoc in the cloud. Although that's technically true, it's known as cluster proliferation if it becomes too common. Clustering easily leads to redundancy, inconsistency, synchronization problems, and difficulty reconciling any two lakes. In other words, it's got the potential to be just as bad as data siloing.
Lack of end-user adoption
Conversely, if your organization doesn't have those self-service tools, fewer users will reap the data lake's advantages. Non-experts may think getting answers from data lakes requires premium coding skills, which isn't true as long as these users have the right complementary solutions.
Limited commercial off-the-shelf tools
Many vendors of data lake solutions claim that their products are compatible either with Hadoop or cloud object storage tools like Amazon S3 and Microsoft Azure Blob. However, a significant number of these offerings lack deep integration capabilities and thus don't provide the democratized access to data that lakes are supposed to facilitate. Moreover, a great deal of these products were built to work with data warehouses rather than data lakes, limiting their ability to maximize value from unstructured data.
Conflicting objectives for data access
In all aspects of data management—including oversight of data lakes—it's critical to strike the proper balance between keeping strict security measures in place and facilitating agile access. One can't be seen as more important than the other, but because data lakes themselves don't necessarily have native security features, it's not uncommon to see organizations err on the side of greater caution. Stakeholders should align on best practices that address both sides of this issue. Zero-trust security tools are a possible solution, as they are extremely protective against unauthorized access but can be programmed to allow unfettered data lake access to authorized users.
Cost and resource concerns
Enterprises basically must choose between adopting a managed data lake solution or building their own from scratch using open-source tools like Hadoop and its derivatives. With the former, organizations tie themselves to vendors that can increase subscription fees at any time. The latter requires plenty of time and technical expertise to set up and maintain—and will likely be fairly expensive to pull off.
Management and governance issues
For organizations with data teams that have used traditional databases and data warehouses until now, managing a data lake can be difficult at first. Data scientists, engineers, and analysts must work together to carefully manage data partitioning, metadata tagging, data integrity, and infrastructure upgrades to support scalability. Some believe the data lakehouse framework to be an effective workaround for these management issues, but this may not always be the case: See the Frequently Asked Questions section for more information.
Proper data governance is also essential for an effective data lake. This may require the use of programmatic administration tools due to the data lake's sheer volume, and these aren't always a part of organizations' existing data governance frameworks. But without that method of governance, a data lake can quickly become a data swamp—difficult to access and near-impossible to navigate.
Making the most of a data lake isn't about adopting it as a be-all, end-all approach. Instead, data teams should consider it another valuable part of their enterprises' data ecosystems, using the data lake alongside the data warehouse and leveraging each for its key strengths.
Additionally, the design pattern of the data lake is far more important than the technology upon which it's built. Hadoop isn't a prerequisite, nor is any other single cloud data lake platform. Multiple technologies can be used in tandem to form the data lake. Meanwhile, a proper data lake design pattern offers a framework for workloads and data management expectations that will guide successful implementation.
VantageCloud, formerly Vantage in the Cloud, offers two deployment options on AWS: Teradata VantageCloud Lake and VantageCloud Enterprise. VantageCloud Lake’s next-generation, cloud-native architecture enables your teams to experiment and innovate while saving money and maintaining governance. With VantageCloud Enterprise, you can leverage the same industry-leading analytics and fast, secure data access for managing your enterprise-level needs.
Amazon Web Services
Use AWS infrastructure with Teradata Vantage
Combine Azure resources with Teradata Vantage
Leverage Google Cloud with Teradata Vantage