Eight trends of big data analysis


Loconzolo, deputy director of data engineering at intuit, has already stepped into the data lake. Dean Abbott, chief data scientist of smarter remarketer, also pointed out a shortcut for the development of cloud technology. The two of them agreed that big data and the frontier of analytics is an activity goal, which includes data lakes and cloud computing for storing raw data. Although these technologies are not mature, waiting is not the best policy.

“The reality is that these tools are just emerging, and they’re not building enough platforms for businesses to rely on,” says loconzolo. However, the development of big data, analysis and other disciplines is very rapid, so enterprises must strive to keep up, or they will be thrown away. ” “In the past, it used to take about ten years for emerging technologies to mature, but now it’s very different. People can come up with solutions in months or even weeks,” he added So, what emerging technologies should we pay attention to, or what researchers are focusing on? Computerworld interviewed some it elites, consultants and industry analysts to see the major trends they listed.

1. Cloud big data analysis

Hadoop is a set of tools with a certain framework structure, which is used to deal with large data groups. It used to be used for machine clusters, but now things have changed. Brian Hopkins, an analyst at Forrester Research, said that there are more and more technologies available for cloud data processing. For example, Amazon’s Bi database, Google’s bigquery data analysis service, IBM’s bluemix cloud platform and Amazon’s kinesis data processing service. The analyst also said that big data in the future will be a combination of internal data deployment and cloud data.

Smarter remarketer is a SaaS retail analysis, market segmentation and marketing service provider. It recently transferred its indoor Hadoop and mongodb database infrastructure to Amazon redshift, a database based on cloud technology The Indianapolis based company collects online and physical sales data, customer information and real-time behavior data, and then analyzes them to help retailers make consumer specific decisions, some of which are even real-time.

Abbott said redshift could save the company’s costs because it has a powerful summary reporting function for structural data, and it is quite large and relatively easy to use. It’s always cheaper to use these virtual devices than those physical devices.

In contrast, intuit, located in mountain view, California, should be more cautious in the process of cloud analysis, because it needs a safe, stable and controllable environment. So far, the data of this financial software company is still in its own database, intuit analytics cloud. “At present, we are cooperating with Amazon and cloudera to build a highly stable cloud that can span the virtual and real worlds shared by several companies, but this problem has not been solved yet,” said loconzolo However, it can be said that for companies like intuit that sell cloud products, it is inevitable to move into cloud technology. In the future, we will reach a stage where it will be wasteful to put data in the private cloud.

2. Hadoop: a new enterprise data operation system

Hopkins said that distributed analysis frameworks such as MapReduce are gradually evolving into distributed resource managers, which are gradually turning Hadoop into a multi-purpose data running system. “With these systems, you can do a variety of operations and analysis.”

What does this mean for enterprises? SQL, MapReduce, in memory, pipelining, chart analysis and other work can be carried out on Hadoop. More and more enterprises will regard Hadoop as an enterprise data center. Hopkins also said: “Hadoop can do all kinds of data processing work, so Hadoop will gradually become a multi-purpose data processing system.”

Intuit has started to build its own data base in Hadoop. “Our strategy is to use Hadoop distributed file system, because it is closely related to MapReduce and Hadoop, so that all kinds of interactions between people and products can be realized,” said loconzolo.

3. Big data Lake

Chris Curran, chief technology expert of PWC in the United States, said that the traditional database theory holds that people should design data sets first, and then input the data. “Data Lake”, also known as “enterprise data Lake” or “enterprise data center”, subverts this concept. “Now, we collect the data first, and then store them in the Hadoop warehouse. We don’t have to design the data model in advance.” This data lake not only provides people with tools to analyze data, but also tells you clearly what kind of data there is. Curran also said that in the process of using Hadoop, people can increase their understanding of data. This is an incremental, organic large-scale database. Of course, in this way, the technical requirements for users will be relatively high.

According to loconzolo, intuit has its own data lake, which contains not only the data of user clicks, but also the data of enterprises and third parties. All these are part of intuit’s cloud analysis, but the key is to make the tools around this data Lake available to people. Loconzolo also said that for the establishment of data Lake in Hadoop, a problem to be considered is that the platform is not completely set for the needs of enterprises. “We also need some functions of traditional enterprise databases that have existed for decades, such as monitoring access control, encryption, security, and the ability to trace data from source to destination.

4. More forecast analysis

Hopkins said that with big data, analysts not only have more data to use, but also have more powerful ability to deal with data with different attributes.

“The data analysis used by traditional machine learning is based on a sample in a big data set, but now we have the ability to process a large number of digital records, and even each data has many different attributes, so we can deal with it freely,” he said

The combination of big data and computing also enables analysts to mine people’s behavior data in a day, such as the websites they visit or the places they have been. Hopkins calls this data “sparse data” because to get the data you are interested in, you have to filter out a lot of irrelevant data. “It’s almost impossible to use traditional machine algorithms against this kind of data from a computational point of view. Because computing power is a very important problem, especially the speed and memory storage capacity of traditional algorithms are deteriorating rapidly. Now you can easily know which data is the easiest to analyze. I have to say that the venue has changed

“What we’re most interested in is how to do both real-time analysis and predictive modeling in the same Hadoop kernel,” says loconzolo. The biggest problem here is speed. Hadoop takes 20 times longer than the existing technology, so intuit is also trying another large-scale data processor Apache spark and its supporting spark SQL query tool. “Spark has fast query, tabulation service and grouping capabilities,” said loconzolo. It can keep the data inside Hadoop and process the data very well. “

5. Hadoop’s structured query language (SQR): faster, better

If you’re a good coder or mathematician, you can throw data into Hadoop and do whatever analysis you want, according to an analyst at Gartner. It’s good, but it’s also a problem. “Although any programming language works, I need someone to input the data in a form or language I’m familiar with, which is why we need Hadoop’s structured query language. Tools that support query languages similar to SQR, so that enterprise users who understand SQR can apply similar technologies to data. Hopkins believes that the SQR of Hadoop opens the door for enterprises to Hadoop, because with SQR, enterprises do not need to invest in high-end data and business analysts who can use Java, JavaScript and python, and these investments are indispensable in the past.

These tools are not new. Apache Software, cloud pivot, and other vendors used to provide users with a faster query speed. This technique is also suitable for “iterative analysis”, that is, analysts ask one question first, get an answer, and then ask the next question. In the past, this kind of work needed to build a database. “Hadoop’s SQR is not to replace databases, at least not in the short term, but for some analysis, it lets us know that there are other options besides high-cost software and applications,” Hopkins said

6. Not just SQR (NoSQL, notonly SQR) – faster, better

Curran said that in addition to the traditional database based on SQR, we also have NoSQL, which can be used for specific purpose analysis. Now it is very popular, and it is estimated that it will become more and more popular. He roughly estimated that there are about 15-20 similar open resources, NoSQL, each of which is unique. For example, arangodb, a product with icon analysis function, can analyze the relationship network between customers and sales staff more quickly and directly.

Curran also said that open source NoSQL databases have been around for some time, but they are still gaining momentum because people need their analysis. A PricewaterhouseCoopers customer in an emerging market places a sensor on the counter of the store, so that they can monitor what products are there, how long customers will fiddle with them and how long people will stand in front of the counter. “Sensors will produce a lot of data similar to exponential growth, NoSQL will be a development direction in the future, because it can analyze data for specific purposes, with good performance and light weight.”

7. Deep learning

Hopkins believes that as a kind of mechanical learning technology based on neural network, although it is still in the process of development, it has shown great potential in solving problems. “Deep learning It enables the computer to identify useful information in a large number of unstructured and binary data, and it can eliminate those unnecessary relationships without special models and program instructions. “

For example: a deep learning algorithm learned that California and Texas are two states of the United States through Wikipedia. “We no longer need to model programs to understand the concepts of States and countries, which is one of the differences between the original machine learning and the emerging deep learning.”

Hopkins also said: “big data uses advanced analysis technologies, such as in-depth analysis, to deal with a variety of unstructured texts. We are only now beginning to understand the ideas and ways of dealing with these problems.” For example, deep learning can be used to identify all kinds of data, such as shapes, colors, objects in videos, and even cats in pictures – as Google’s neural network does. “The cognitive concept and advanced analysis demonstrated by this technology will be a trend in the future.”

8. In memory analysis

Beyer said that using the in memory database to improve the analysis processing speed has become more and more popular, and as long as it is used properly, there are many benefits. In fact, many enterprises are now using HTAP (hybrid transaction / analytical processing), which can transform and analyze in the same memory database. But Beyer also said that the promotion of HTAP has gone too far, and many companies have overused the technology. For those systems where users need to view the same data in the same way many times a day, such data does not change much, so in memory analysis is a waste.

With the help of HTAP, people can analyze faster, but all the transformations must be stored in the same database. According to Beyer, this kind of feature causes a problem. The current work of analysts is to aggregate and input data from different places into the same database. “If you want to use HTAP for any analysis, all the data have to be in one place. We need to integrate diverse data. “

However, the introduction of memory database also means that there is another product waiting for us to manage, maintain, integrate and balance.

For intuit, they are already using spark, so the desire to introduce in memory database is not so strong. “If we can handle 70% of the problems with spark and 100% with an in memory database, we’ll choose the former,” said loconzolo. So we are also weighing whether to stop the internal memory data system immediately. “

Stride forward

With so many emerging trends in the field of big data and analysis, it organizations should create conditions for analysts and data experts to show their skills. “We need to evaluate and integrate technologies to apply them to business,” Curran said

I’d like to recommend that I build it myselfBig data learning exchange group: 784557197If you are learning big data, you are welcome to join Xiaobian. We are all software development parties. We share dry goods from time to time (only related to big data software development), including the latest big data advanced materials and advanced development tutorials in 2018 compiled by myself. We welcome small partners who want to go deep into big data to join us.

“It managers and executives can’t use immature technology as an excuse to stop experimenting,” says Beyer At first, only some professional analysts and data experts need to experiment. Then these more advanced users and the IT industry should jointly decide whether to introduce these new resources to other people in the industry. There’s no need for the IT industry to control the motivated analysts. On the contrary, Beyer thinks that we should strengthen cooperation with them.

Recommended Today

Third party calls wechat payment interface

Step one: preparation 1. Wechat payment interface can only be called if the developer qualification has been authenticated on wechat open platform, so the first thing is to authenticate. It’s very simple, but wechat will charge 300 yuan for audit 2. Set payment directory Login wechat payment merchant platform( pay.weixin.qq . com) — > Product […]