Why is general big data architecture not suitable for Internet of things data processing?


In order to deal with the growing Internet data, many tools begin to appear, and the most popular one is the Hadoop system. In addition to the familiar Hadoop components such as HDFS, MapReduce, HBase, hive, the general big data processing platform often uses Kafka or other message queuing tools, redis or other caching software, Flink or other real-time streaming data processing software. Mongodb, Cassandra or other NoSQL databases are also selected for storage. Such a typical big data processing platform can basically handle the reference of the Internet industry well, such as typical user portraits, public opinion analysis, etc.

Naturally, when the Internet of things, the Internet of vehicles and the industrial Internet come into being, everyone thinks of using a general big data processing platform to process their data. Now the popular big data platforms such as the Internet of things and the Internet of vehicles are almost without exception such architectures. This method proves to be fully working. But how effective is this universal approach? It can be said that there are many deficiencies, mainly in several aspects.

  • Low development efficiency: because it is not a single software, it needs to integrate at least four modules, and many modules are not standard POSIX or SQL interfaces, all of them have their own development tools, development languages, configurations, etc., which requires a certain learning cost. Moreover, because data flows from one module to another, data consistency is easy to be destroyed. At the same time, these modules are basically open-source software, and there will always be various bugs. Even with the support of technical forums and communities, once stuck by a technical problem, it will take a lot of time for engineers. Generally speaking, it needs to build a good team to assemble these modules smoothly, so it needs a lot of human resources.
  • Low operation efficiencyThe existing open-source software is mainly used to deal with unstructured data on the Internet, but the data collected by the Internet of things are sequential and structured. Using unstructured data processing technology to deal with structured data, whether it is storage or computing, the consumption of resources are much larger. For example, the smart meter collects two quantities of current and voltage. If HBase or other kV databases are used for storage, the row key is often the ID of the smart meter, plus other static tag values. The key of each collection quantity consists of row key, column family, column qualifier, timestamp, key value type, etc., and then follows the specific collection quantity value. In this way, the data can be stored in a large amount of overhead, which wastes storage space. And if you want to do the calculation, you need to analyze the specific collection amount first. For example, to calculate the average value of voltage for a period of time, we need to first parse the voltage value from the kV storage, put it into an array, and then calculate it. The over head of analytical kV structure is very large, which leads to a significant reduction in the calculation efficiency. The biggest advantage of kV type storage is schemaless. Before writing data, you don’t need to define the data structure. You can record how you want. This is a very attractive design for Internet applications that are updated almost every day. But for the Internet of things, the Internet of vehicles and other applications, there is not much attraction, because the schema of the data generated by the Internet of things devices is generally the same, even if changed, the frequency is very low, because the corresponding configuration or firmware needs to be updated.
  • High operation and maintenance cost: each module, whether Kafka, HBase, HDFS or redis, has its own management background and needs to be managed separately. In the traditional information system, a DBA only needs to learn how to manage MySQL or Oracle, but now a DBA needs to learn how to manage, configure and optimize many modules, with a lot of work. And because of too many modules, the problem of location becomes more complex. For example, the user finds that there is a piece of collected data lost. Is the loss Kafka, HBase, spark, or the application lost? It often takes a long time to find a way to associate the logs of each module to find the reason. And the more modules, the lower the overall stability of the system.
  • Slow application launch and low profitBecause of the low efficiency of R & D and the high cost of operation and maintenance, the time for products to be put into the market becomes longer, which makes enterprises lose business opportunities. Moreover, these open-source software are all in the process of evolution, and it takes a certain amount of manpower to use the latest version synchronously. Except for the head companies of the Internet, the human resource cost of small and medium-sized companies in big data platform is generally far more than the product or service cost of professional companies.
  • For small data volume scenarios, privatization deployment is too heavy: in the scenarios of the Internet of things and the Internet of vehicles, because of the security of production and operation data, many of them still adopt privatization deployment. However, for each privatization deployment, the amount of data processed varies greatly from hundreds of networking devices to tens of millions of devices. For the scenario with small data volume, the general big data solution is too bloated and the input-output is not proportional. So some platform providers often have two schemes: one is for big data scenarios, the other is for small data scenarios, and MySQL or other DB is used to do everything. But this leads to the increase of R & D and maintenance costs.

General big data platform has the above problems, is there a good way to solve them? Then we need to make a detailed analysis of the scenarios of the Internet of things. After careful study, it will be found that the data generated by all machines, equipment and sensors are sequential, and many of them also have location information. These data have obvious characteristics: 1. The data is time-series and must be time stamped; 2. The data is structured; 3. The data is rarely updated or deleted; 4. The data source is unique; 5. Compared with Internet applications, the data is written more and read less; 6. The user focuses on the trend of a period of time, rather than the value of a specific point in time; 7. The data has a retention period; 8 The query and analysis of data must be based on time period and geographical area; 9: in addition to storage query, various statistical and real-time calculation operations are often needed; 10: the flow is stable and predictable; 11: interpolation and other special calculations are often needed; 12: the amount of data is huge, and the number of data collected in a day can exceed 10 billion.

If we make full use of the above features, we can develop a special big data platform for optimizing the Internet of things scenarios. This platform will have the following characteristics: 1. Make full use of the data characteristics of the Internet of things, make various technical optimizations, greatly improve the performance of data insertion and query, and reduce the cost of hardware or cloud services; 2. It must be horizontally expanded, with the increase of data volume, just increase the capacity of the server; 3. It must have a single management background, which is easy to maintain The quantity must be zero management; 4: it must be open, with the industry’s popular standard SQL interface, providing python, R or other development interfaces, so as to facilitate the integration of various machine learning, artificial intelligence algorithms or other applications.

Tdengine of Taosi data is a full stack big data processing engine developed by making full use of the 12 features of the Internet of things data. It has the above features and is expected to solve the shortcomings of the general big data platform in processing the Internet of things data. According to the design idea of Taosi data, the use of tdengine should greatly simplify the structure of the Internet of things big data platform, shorten the research and development cycle, and reduce the platform operating costs.

At present, tdengine has been open-source on GitHub. You are welcome to download the experience. If you have any questions, you can put them up on GitHub. We have special R & D personnel to answer them.