New development prospect of big data science: four trends we have to know


Since 2012, almost everyone (at least in the Internet world) has said big data. It seems that they are embarrassed to chat with others without touching big data. Since 2016, the big data system has gradually entered the deployment stage in the enterprise, and the hype of big data has gradually dissipated, followed by the vigorous development period of application, and some iconic IPOs representing mature technology have been emerging in the domestic and foreign capital markets. In the twinkling of an eye, the bubble experienced by big data a few years ago is unquestionably transferred to artificial intelligence. It can be said that in the past year, AI experienced a common sense “Big Bang” compared with the big data of that year. Recently, the tuyere has been transferred to the blockchain, which to some extent has become an incentive for the industry to worry.

However, no matter how technology hot spots change, what we can see is that as the industry sinks to the ground, the big data ecology is becoming more and more subdivided. Today, I’d like to talk about some new changes and trends in the field of big data.

1、 Data governance and security

As far as the development trend is concerned, this can be put first.

Over the years, data has been accumulating rapidly in enterprises. Internet of things (IOT) is accelerating the generation of data.

For many enterprises, the solution of big data is to use technologies like open source Apache Hadoop as the basic support to create a data lake, that is, to create a data management platform for the entire enterprise, which is used to store all the data of the enterprise in a native format. The data lake will eliminate information silos by providing a single data repository, which can be used by the whole organization for business analysis, data mining and other applications. When there is a data lake, people tend to think that it will become an all-round and omnipotent big data set. For example, click stream data, Internet of things data, log data, etc. will be required to enter the lake, and the problems that are difficult to deal with these data will be ignored.

However, unless you know what’s in the data lake and can access the appropriate data for analysis, no matter how big the data lake is, it doesn’t make sense. Therefore, in the end, everyone will realize that many data lakes are underperforming resources, and people don’t know what is stored in them, how to access them, or how to gain insight from these data.

However, it’s not easy to find what you want and manage permissions at the same time. In addition to the data lake, another theme of governance is to provide anyone with convenient access to reliable data in a secure and auditable manner.

Therefore, from the perspective of managing and using corporate data assets well, data governance, like the top-level system and declaration of a company, needs to be valued and implemented with corresponding strategies and processes. The ultimate goal is to improve data management, ensure data quality and form a new situation of open sharing through data governance. In addition, data governance is also an organic combination of decision-making, function and operation process, and people are responsible for these data assets.

2、 Data workbench development for collaboration

In most large enterprises, the adoption of big data starts from a few independent projects, and so does the individual push: for example, do a little Hadoop cluster here, use an analysis tool there, run a simple business model, and realize the need to set up some new positions (data scientist, chief data officer), etc.

Now, business scenarios are more and more rich, heterogeneity is more and more prominent, and a variety of tools have been used throughout the enterprise. Within the organizational scope of the company, the centralized “data science department” is gradually giving way to a more decentralized organization, because the centralized department is becoming more and more bottleneck, which is also more likely to cause the loss of resources.

This group of data scientists, data engineers and data analysts is increasingly embedded in different business units. Therefore, for the platform, the demand is obvious, that is to make everything work together, because the success of big data is based on the establishment of an assembly line composed of technology, people and processes.

As a result, some new types of collaboration platforms (such as jupyter) are emerging rapidly, leading the development of so-called dataops (corresponding to Devops).

3、 Data science automation

Data scientist is still a hot target in the market. But we rarely see such people around us. Even the top 1000 companies are bothered by the inability to recruit more “data scientists”. In some organizations, the data science sector is evolving from an enabler to a bottleneck.

At the same time, the popularity of AI and the spread of self-service tools make it easier for data engineers and even data analysts with limited data science skills to perform some basic operations, which until recently remain the domain of data scientists. With the help of automation tools, a large number of big data work, especially those simple and boring work, will be handled by data engineers and data analysts without bothering data scientists with deep technical skills. Even so, of course, data scientists don’t need to be too “scared” at the moment.

In the foreseeable future, self-service tools and automation models will “enhance” data scientists rather than eliminate them, liberate them, and let them focus on tasks requiring judgment, creativity, social skills or vertical industry knowledge, so as to better reflect the name of scientists.

4、 The rise of Big Data Administrators

The big data administrator (BDA) is also benchmarked against the database administrator (DBA). Although the two English letters only change the order, their connotations are far from each other. A very obvious trend is that enterprises will have a demand for a new role, that is, big data administrator. DBA is very familiar to everyone, but it is very different from the data administrator in the era of big data.

The data manager is between the data user and the data engineer. In order to achieve success, data administrators must understand the meaning of data and master some technologies applied in data besides the maintenance of big data system.

Data administrators need to be clear about the type of data analysis that needs to be performed throughout the organization, which data sets are suitable for this work, and how to transform data from the original state to the form and form that data users need to perform this work. Data administrators should use systems such as self-service data platforms to speed up the end-to-end process of data users accessing basic data sets without making countless copies of data.


The above four aspects are the new requirements put forward by data science in the practical development. Whoever can get good results in these aspects will get a leading position in this era of big data.

Recommended Today

PHP Basics – String Array Operations

In our daily work, we often need to deal with some strings or arrays. Today, we have time to sort them out String operation <?php //String truncation $str = ‘Hello World!’ Substr ($STR, 0,5); // return ‘hello’ //Chinese string truncation $STR = ‘Hello, Shenzhen’; $result = mb_ Substr ($STR, 0,2); // Hello //First occurrence of […]