Data scheme of Internet of things

Time:2021-10-19

1、 Foreword

It can often be seen in science fiction movies / CG that a commander points out the country to a large data screen in front of him. The big data screen has various indicators and summary data, which are displayed through various graphics.
From the perspective of products, the determination of indicators and summary items determines the value of the data screen. Of course, you can also support user-defined indicator management. This is not the focus of this time.
From the front-end perspective, how to use the least system resources to render many data into various charts is the top priority. Technically, Alibaba antv can be used. This is also not the focus of this time.
From the back-end perspective, the key point of this scenario is how to ensure the high performance of data requests (low latency, high throughput, high frequency, etc.) on the premise of meeting the content of data requests. That’s the focus of this time.

2、 Background

As an Internet of things company, it is necessary to display data on a large screen to multiple power companies. The large data screen needs to display the status of hundreds of wind turbines of the power company and various monitoring indicators. The monitoring indicators include inclination value, vibration waveform, structural stress, etc. The vibration belongs to high frequency acquisition, and the frequency is thousands of Hz. Even for the tilt sensor used to calculate the tilt value, the minimum is 10 / s. Monitoring the data on the large screen requires real-time performance. The requirement of the product is that it is best to synchronize in real time, and the slowest indicator should also ensure a refresh rate of 5S. Most importantly, each monitoring indicator needs to be able to view historical data and real-time data.

According to the vibration data, a record of vibration data in the database is about 100 bytes, a vibration sensor has 1000 records per second, a fan has 4 vibration sensors, and a power company currently has about 100 fans at most. As a result, a power company has 40W records of vibration data per second, about 40m. Vibration is the sensor with the largest number of records, but it is not the type with the largest bandwidth occupation. Audio and video occupy more bandwidth.

Of course, this is a case without any processing, so it looks very bad and the front end can’t handle it at all. Light vibration has to render 40W data per second, and front-end development will kill you. Especially in the case of historical data, it is simply unimaginable.

Don’t say if it can be rendered. I have to wonder if the database and bandwidth are affordable. After all, there are other companies, other scenarios, and other sensors.

3、 Data writing

1. Data cleaning

First, clear up some obviously non-conforming data. For example, the tilt value jump caused by human touch, vibration fluctuation caused by human operation, etc. The filtering ratio of this part is very low, mainly for noise reduction.

2. Edge calculation

A certain index of the fan is often calculated by multiple sensors. For example, the tilt value is often calculated by a group (at least three) sensors.
Therefore, we need to analyze the protocol and calculate the initial index at the edge gateway of the fan.
The processing of this part often turns multiple data into one data record. The filtering effect is still good.

PS: edge computing. It is also used for immediate alarm of edge gateway and machine shutdown.

3. Sample upload

There are many original data indicators of the sensor. If you don’t pay attention, the hard disk will explode, even the bandwidth.
In fact, we often don’t need so many original data indicators. After all, if the wind turbine is going to collapse, it will not be normal one second before it suddenly collapses the next.
Therefore, we need to sample and upload the original data indicators. And the acquisition frequency of the sensor needs to be determined.
After communication and negotiation with business and algorithm, the sampling rate of tilt sensor is set at 5%, and that of stress sensor is set at 1%.

So, what is the difference & relationship between the acquisition frequency of the sensor and the sampling rate of data upload?
The former means that the data is collected and may fall on the disk. The latter means that a part of the collected data is extracted and uploaded to the Internet of things platform.
So why not directly reduce the acquisition frequency?
On the one hand, the acquisition frequency of some sensors has its upper and lower limits, which may not meet the upload requirements. On the other hand, the original data with reasonable frequency is convenient to further determine the problem after finding the problem.
Therefore, our processing at that time was to save the original collected data directly to the local disk in sequence. And sample the original data, save and upload the local database.

4. Eigenvalue extraction

In some vertical scenarios, we need to calculate the eigenvalues (mean value and variance) of some indicators, and draw conclusions about the target entity through algorithm calculation.
For example, according to the average value and variance of the tilt value within ten minutes, I can know the current fan tilt status, and through the variance, I can determine the stability of the tilt data.

5. Hierarchical collection

After the above series of operations, the data has been filtered. Is there any way to save resources (power consumption, storage, bandwidth, etc.)?
As mentioned above, even if the fan is about to collapse, it will not collapse at once. Rome wasn’t built in a day. Cough, cough, pull away. In other words, most of the monitoring data we collect daily are invalid. So in order to improve resource utilization. After communicating with businesses and algorithms, I put forward the concept of collection level. That is, the sensor of the fan usually only maintains the low-frequency acquisition state, and will enter the full power state only when the index is suspicious.
For example, because the vibration sensor cannot reduce frequency and continuously sample (because the vibration information is hidden in continuous high-frequency data), its acquisition cost is the highest, so it is set to continuously collect for 10 minutes in a random time period every day. The tilt sensor collects data for 1 minute every hour. When a suspicious condition is detected, such as the tilt value exceeds the target threshold, the full power state is. If there is no suspicious condition after continuous monitoring for 1 hour, it will return to the low-frequency state again.

In fact, there are only two acquisition states: low frequency acquisition and full power state. In fact, multiple acquisition levels can be extended. In addition, the triggering conditions of state transition can be further refined.

Through hierarchical acquisition, the waste of system resources can be greatly reduced, but the system goal is guaranteed. Why not.

4、 Data storage

In terms of data storage, as long as we pay attention to the characteristics of Internet of things data: large amount, orderly and the more recent, the higher the access frequency. In terms of scheme, consider more data heterogeneity.

1. Storage mode

Simply put, an indicator record of the cloud platform will appear on four types of storage: cache, database, statistics table and archive table

a. Cache

Because the data just inserted is often displayed by monitoring large screens, web pages and other display terminals, the access frequency is still very high. In particular, various algorithms often need to scan these new data.
The first is the design of cache key. Our design at that time was binding relationship (company scenario target indicator) + timestamp. Such a query is still quite fast.
The second is the setting of failure time. Our setting was 1 hour – 1 day. The decision criteria are the access frequency and storage space of the corresponding index.
Finally, there is the consistency problem. Since these data are invariant data, there is no consistency problem.

However, one thing to note about caching is that the indicator cache needs to be isolated from other business data. Our design at that time was placed in different redis clusters.
As for the issues of clustering, availability and persistence, we won’t expand them here.

b. Database

On the one hand, the database is for the persistence of indicator records, on the other hand, it is for the data query one day ago.
There’s really nothing to say about this part. Everyone is familiar with MySQL.
This part will involve field design, index design (especially joint index and overlay index), sub database and sub table (routing rule design). Although there is master-slave synchronization before database and table splitting, it is really not necessary in our scenario. After all, write more and read less. Most of the reading pressure is put on the cache. Our master-slave synchronization is also to serve sub databases and tables and improve availability.

Field design

In terms of field design, on the one hand, it should be split according to the paradigm as much as possible. This is not an e-commerce scenario. It needs to be as large and wide as possible. The two scenarios are completely different. For example, the record may contain ID, creation time, company, scenario, target, indicator and indicator value. However, you can add an indicator binding table to the indicator ID, creation time, indicator ID and indicator value. This can save a lot of storage and greatly improve query performance (because one data page can hold more records, thus reducing the overall IO cost).
In addition, if you really need to save fields like company in the record, you’d better convert them into company code and save them.

c. Statistical table

During data query, we often view the data of the past month and the past year, and then observe the data trend.
This data cannot be pulled from the database every time you view it. The server will crash.
Therefore, statistical tables such as day, week, month and year will be built in the early morning of each day, and multiple search results can even be generated by combining the search conditions of other dimensions.
There are three ways to realize it:

  1. The application pulls the standby database data for statistics, and writes the results into the statistics table.
  2. The standby database regularly triggers execution statistics by scheduling tasks, and writes the results into the statistics table.
  3. Use big data technology, such as MapReduce technology such as ODPs, to pull & statistical data regularly and write the results into the statistical table.

d. Archive table

Over a year, the query volume is very low. Even if there are queries, it is often aggregated data statistics like average. However, as time goes on, more and more data will be collected over a year. Therefore, some places directly refuse such detailed query of data. Another solution is to put it in an archive database. It does not have to have good performance, as long as it can be queried.
At that time, our plan was to put the data in a new database in units of one year. That is, every new year, the data of previous years will be written into a new database as an archive table. The real-time database only retains the data of the latest 13 months at most.

Of course, the data of archived tables can also be placed in BigTable such as HBase, especially after reaching a certain volume. The kV acquisition of rowkey and the scan acquisition of rowkey meet the requirements of the archive table. As for the full table scan of HBase, forget it…

5、 Data query

In fact, in the data storage part, many ideas of data query have been mentioned.
Here, the analysis is conducted from the perspective of two query dimensions.

1. Detailed data query

Detailed data query is to query data records directly rather than statistics.

a. Short term data

Short term data, such as data within a day, can be obtained directly from the cache.
Of course, you can also adjust the expiration time of some types of data according to the actual situation.

b. Interim data

Interim data, such as data within a year, can be obtained directly from the database.
Similarly, the medium-term data can be adjusted to half a year according to the actual situation.

c. Long term data

Long term data, such as data beyond one year, can be obtained from the archive database (implementation basis, which can be an independent cold data MySQL instance).

d. Summary

For detailed data query, you must ensure that the data in each range can be processed effectively.
What if a data query range is data from the last half year to the last year and a half? On the one hand, it can be queried directly from the archive database. On the other hand, you can decompose the query range, such as decomposing the above range into the last half year to the last year (database) and the last year to the last year and a half (archive database). The implementation of the former is simple, and the user experience of the latter will be better. The specific needs to be determined according to the business needs.

2. Statistical data query

Here, I give several of our schemes at that time, as well as their advantages and disadvantages.

a. Scheme I

Scheme: the application server directly pulls the data in the target range, and then samples & averages the data.
Example: for 10W data in the target range, the application server directly pulls 10W data, then samples 1W data, and then averages every ten data, finally obtains 1K data and gives it to the front end.
Advantages: simple implementation
Disadvantages: as the number of targets increases, the query efficiency decreases linearly. That is, if the target quantity increases by one order of magnitude, the query time increases by one order of magnitude.
PS: in daily life & work, it is inevitable to negatively affect growth. It is necessary to eliminate index, avoid linearity and pursue logarithm. In particular, many techniques can convert linearity to logarithm. Such as process abstraction.

b. Scheme II

Scheme: the application server obtains the data of the start time_ ID, and the data max of the end time_ ID to get count = max_ id – min_ ID, and step pace = count / 100 (100 indicates the number of data pieces returned to the front end). Pass min_ ID and pace, calculate the ID set of the target data, and then query the result set through mysql.
Example: query min according to the start time of the target range_ id = 2000000,max_ id = 3000000。 Then, count = 1000000 and pace = 10000 are obtained. Therefore, the target ID set is 20000002010100020300000… 2990000. Then obtain the corresponding result set in MySQL.
Advantages: the implementation is not complex and only involves code writing. And the query performance will not decline with the increase of the amount of data.
Disadvantages: query criteria beyond time cannot be added; It is unable to process the data collected discontinuously (for example, the time range of seven days is not collected for six days, but such a scheme cannot meet the conditions)

c. Programme III

Scheme 3: on the basis of scheme 2, scheme 3 adds filtering for other conditions (some are similar to MySQL secondary index results, and back to the table to verify other conditions).
Example: query min according to the start time of the target range_ id = 2000000,max_ id = 3000000。 Then, count = 1000000 and pace = 10000 are obtained. Therefore, the target ID set is 20000002010100020300000… 2990000. Then obtain the corresponding result set in MySQL. Then, filter the result set by other conditions
Advantages: the implementation is not complex and only involves code writing. And the query performance will not decline with the increase of the amount of data. You can add query criteria outside of time
Disadvantages: the number of query results cannot be determined, and the number of returned results may be 0 (this can be searched twice, and the min_id and spell probability can be adjusted); It is unable to process the data collected discontinuously (for example, the time range of seven days is not collected for six days, but such a scheme cannot meet the conditions)

d. Programme IV

Scheme 4: on the basis of scheme 2, change the ID to the time range.
Start time min according to target range_ Time and end time Max_ Time, get the time difference time_ range = max_ time – min_ Time, and then obtain the time step time_ pace = time_ Range / 100 (100 indicates the number of data pieces returned to the front end). Pass min_ Time and time_ Pace, get the target time range set: Min_ time ~ min_ time + time_ pace、min_ time + time_ pace ~ min_ time + time_ pace * 2 …
Example: omitted
Advantages: the implementation is not complex and only involves code writing. And the query performance will not decline with the increase of the amount of data.
Disadvantages: query criteria beyond time cannot be added; Low performance

e. Programme V

Scheme: establish a statistical table, and calculate the characteristic values such as the average value of each index data of the previous day and multiple search conditions every morning. It is convenient for subsequent data query within the time range of “past month”, “past year”, etc.
Advantages: for common queries, you can quickly return results.
Disadvantages: some query criteria cannot be counted, which makes it impossible to query. In the calculation of the statistical table, there is a large resource loss. Moreover, due to the time length characteristics of the statistical table, the number of data points on the presentation layer may be inconsistent (sometimes there are many data points, and sometimes the data points are very sparse. However, it can be greatly optimized by automatically raising and lowering the data dimension).

6、 Summary

To sum up, the core of data architecture is cold and hot isolation and hierarchical processing.
Before designing the data architecture, you need to confirm the data. For example, in the Internet of things scenario, you write more and read less, and the written data are inserts, so there is no consistency problem. When designing data solutions, the focus will tend to be.

In fact, the scheme is consistent with business, application architecture and other scenarios. Everyone has heard of the management leverage ratio, 28 law, good steel should be on the blade. In fact, it can be summed up in one sentence:
According to the input-output ratio, resources are allocated, so as to achieve the highest overall output under limited resources.