Not long ago, the whole data world was still discussing how to create centralized data storage to maximize data availability and achieve the purpose of advanced analysis. Bloggers shouted against the data lake and supported well-organized databases. The open source community united around the Hadoop ecosystem, and big data technology developed rapidly. This paper reviews some assumptions that promote the data lake, and pays attention to the stability of these assumptions.
Hypothesis 1: “data storage is very expensive, so establishing your own Hadoop data lake looks more attractive for economic benefits.”
In hindsight, what about this assumption?
To be sure, the TCO per GB of storage in Hadoop can be 5% or even lower than the cost of traditional RDBMS systems. However, even the most experienced enterprises soon understand how difficult it is to operate an enterprise cluster. The continuous updating of open source software, the scarcity of skills in managing the environment, and the relative immaturity of the ecosystem have all caused unmanageable technical failures and dependencies. In addition, once Hadoop completes data replication for three times, the administrator needs snapshots and replicas to overcome the limitations of Hadoop update. 1TB of RDBMS data may become 50tb in the lake. That’s all the money saved.
Emerging reality: cloud and cloud data warehouse
Amazon, Microsoft and Google are eager to fill these productivity gaps with hosted, cloud based environments that simplify management and enable data scientists to increase productivity faster. Next, the consumption pattern replaces the capital cost of Hadoop on pre environment, which means that people are reluctant to simply dump all large data sets into a central environment. Instead, they load data as needed for analysis. Therefore, this has the effect of transferring from large on Prem data lakes to small cloud based data ponds, which are established for the purpose. Furthermore, the new cloud warehouse makes it easy to access and query these data through SQL based tools, which further releases the value of data to non-technical consumers.
Hypothesis 2: “big data is too big to move. Move the data once and move the computer to the data.
In hindsight, what is this assumption?
A key assumption of the data lake is that the limitations of network and processing speed mean that we cannot move large copies of data such as log files to the cluster for data analysis. Hadoop is also batch oriented, which means that mass processing of these types of data is very impractical. Facts have proved that the improvement of data replication and streaming media, as well as the huge benefits of the network, lead to this situation not as real as we think.
Emerging reality: data virtualization and streaming media
Technological improvements mean that enterprises can choose how to access data. Perhaps they want to unload queries from transactional systems to cloud environments; Data replication and streaming media are now simple solutions. Perhaps, the trading system is built for high-performance query; In this case, the data virtualization function can make the data available on demand. Therefore, enterprises can now choose to make data more available to dataops processes on demand, which means that it is not always necessary to physically centralize all enterprise data in one location.
Hypothesis 3: “the data Lake mode at read time will replace the data warehouse mode at write time.”
In hindsight, what about this assumption?
People are tired of the time it teams spend writing ETL into the data warehouse, and are eager to simply release the processing of raw data by data scientists. There are two main sticking points. First, data scientists often can not easily find the data they are looking for. Second, once they have the data, the person in charge of analysis will soon find that their ETL is only replaced by data entanglement tools, because data science still needs to be cleaned up, such as standardization and foreign key matching.
Emerging reality: data catalog and data operation
Intelligent data catalog has become the key to find the required data. Now, enterprises are trying to establish the same search method as Google search enjoyed by users at home in the workplace through simple solutions to find and access data, regardless of the physical location of the data storage where the data is stored. Dataops process has also emerged. It is a way to build domain based data sets. These data sets can achieve maximum analytical productivity after careful planning and management. Therefore, data scientists should be able to easily find and trust the data they use to discover new insights. The integration of thoughtful technologies and processes should enable the data pipeline and analysis pipeline to run quickly to support these new discoveries. This process can realize real-time analysis.
stayQlikWhen seeking a modern data analysis architecture, these key emerging realities are the focus they need to think about:
- Cloud based application and analysis architecture
- The re emergence of data warehouse / RDBMS structures in the cloud to maximize value (think snowflake).
- Data flow to reduce latency of critical data
- Data virtualization to reduce data replication until needed.
- Data directory, carefully inventory and manage the access of enterprise data.
- The emergence of dataops process has created a rapid time to market for data and analysis pipelines.
Qlik’s vision is a world of data literacy, where everyone can use data to improve decision-making and solve their most challenging problems. Only qlik provides end-to-end real-time data integration and analysis solutions to help organizations access all data and turn it into value. As an official Chinese partner of qlik, Huidu provides qlik Chinese users with Product Authorization and implementation, customized analysis scheme, technical training and other services, aiming to enable each qlik user of Chinese enterprises to explore the value of data and form an analysis culture