Characteristics of big data in Internet of things and industrial Internet


With the rapid decline of data communication cost and the emergence of various sensing technologies and intelligent devices, from bracelets, shared travel, intelligent electricity meters, environmental monitoring equipment to elevators, CNC machine tools, excavators, industrial production lines, etc., a large number of real-time data are continuously generated and sent to the cloud. These massive data are valuable wealth of society and enterprises, which can help enterprises to monitor the operation of business or equipment in real time, generate reports of various dimensions, and predict and warn business through big data analysis and machine learning, help society or enterprises to make scientific decisions, save costs and create new value.

Gartner reports that the number of connected devices has exceeded 14.2 billion in 2019, and is expected to reach 25 billion in 2021. This is a huge amount, generating massive data. However, compared with the Internet that everyone is familiar with now, the Internet of things data has its distinct characteristics. This paper analyzes its characteristics.

  • The data is sequential and must be time stamped: networked devices generate data continuously according to the set period or triggered by external events. Each data point is generated at a time point, which is very important for data calculation and analysis and must be recorded.
  • Data is structured: the massive data of web crawler, microblog and wechat are unstructured, which can be words, pictures, videos, etc. However, the data generated by Internet of things devices are often structured and numerical. For example, the current and voltage collected by smart meters can be represented by 4-byte standard floating-point numbers.
  • Data rarely updated: the data generated by networking equipment is machine log data, which is generally not allowed and there is no need to modify. There are few scenarios, and the original data collected needs to be modified. But for a typical information or Internet application, records can be modified or deleted.
  • Data sources are unique: the data collected by one IOT device is completely independent of the data collected by another device. The data of a device must be generated by this device, which cannot be generated by human or other devices. That is to say, the data of a device has only one producer, and the data source is unique.
  • Write more and read less than Internet applications: for Internet applications, a data record is often written once and read many times. For example, a microblog or a wechat public number article, written at a time, may be read by millions of people. However, the data generated by Internet of things devices are different. Generally, the generated data is automatically read by calculation and analysis programs, and the number of calculation and analysis is not many. Only when analyzing accidents and other scenarios, can people take the initiative to see the original data.
  • Users focus on trends over time: for a bank record, or a microblog or wechat, each is very important for its users. But for the data of the Internet of things, the change of each data point and data point is not big, generally gradual, and people are more concerned about a period of time, such as the trend of data change in the past five minutes and an hour, generally not about the data value of a specific time point.
  • Data has a retention period: generally, the collected data has a retention policy based on the duration, such as only one day, one week, one month, one year or even longer. In order to save storage space, the system is better to automatically delete.
  • Data query and analysis are often based on time period and a group of devices: when calculating and analyzing Internet of things data, it must be within a specified time range, not only for one time point or the whole history. And it is often necessary to analyze the data collected from a subset of IOT devices according to the analysis dimension, such as devices in a geographical area, a model, a batch of devices, a manufacturer’s devices, etc.
  • In addition to storage queries, real-time analysis and calculation operations are often required: for most Internet big data applications, offline analysis is more important. Even if there is real-time analysis, the requirements of real-time analysis are not high. For example, users’ portraits can be accumulated after a certain amount of user behavior data is accumulated. Drawing one day earlier and one day later will not affect the results much. But for the application of Internet of things, the real-time calculation of data is often very high, because the real-time alarm is needed according to the calculation results to avoid accidents.
  • Steady and predictable flowGiven the number of Internet of things and the frequency of data collection, the required bandwidth and traffic can be estimated accurately, and the newly generated data size can be calculated every day. Instead of e-commerce, during the double 11 period, the traffic of Taobao, tmall, Jingdong, etc. increased dozens of times. Unlike 12306 website, during the Spring Festival, the website traffic is dozens of times of growth.
  • Particularity of data processing: compared with the typical Internet, there are different data processing requirements. For example, if you want to check a certain amount collected by a device at a specific time, but the actual acquisition time of the sensor is not this time point, you often need to do interpolation at this time. There are also many scenarios, which need to do complex mathematical function calculation based on the acquisition volume.
  • Huge amount of dataTake smart meters as an example. A smart meter collects data every 15 minutes and automatically generates 96 records every day. There are nearly 500 million smart meters in the country and nearly 50 billion records every day for optical smart meters. A connected car collects data every 10 to 15 seconds and sends it to the cloud. A car can easily generate 1000 records a day. If China’s 200 million vehicles are all connected to the Internet, 200 billion records will be generated every day. Within five years, the data generated by Internet of things devices will account for more than 90% of the world’s total data.

The data of the Internet of things and the industrial Internet are streaming data, such as video stream, and the value of a single data point is very low, even the loss of data for a short period of time does not affect the analysis conclusion, nor does it affect the normal operation of the system. But the seemingly simple things, because of the huge number of data records, lead to the bottleneck of real-time data writing, query analysis is very slow, become a new technical challenge. The traditional relational database, NoSQL database and flow computing engine do not make full use of the characteristics of the Internet of things data, so the performance improvement is very limited. They can only rely on the cluster technology and invest more computing resources and storage resources to deal with, and the cost of operation and maintenance of the system rises sharply.

In the face of this high-speed growth of Internet of things data market, in recent years, there have been a number of companies focusing on time series data processing, such as influxdata in the United States, whose financing has exceeded $130 million, and its product influxdb has a considerable market share in it operation and maintenance monitoring. OSI soft, an old real-time database company in the field of industrial control, received an investment of US $1.2 billion from Softbank in May 2017, hoping to become the leader of databases in the emerging Internet of things field. The open source community is also very active, such as opentsdb developed based on HBase. In China, Alibaba, Baidu and Huawei all have products based on opentsdb.

Founded in 2017, Beijing Taosi Data Technology Co., Ltd. is optimistic about this market and does not rely on any third-party software or open-source software. After absorbing the advantages of many traditional relational databases, NoSQL databases, streaming computing engines, message queues and other software, it independently developed tdengine, a complete sequential big data processing engine. The performance of tdengine is far superior to that of influxdb, and its installation, deployment and maintenance are simple. With SQL interface, the learning cost is almost zero, and it is expected to become a black horse in the sequential data processing market.