What are the advantages of ppio distributed storage in data distribution?

Time:2019-11-21

Ppio is a decentralized storage and distribution platform for developers to make data storage cheaper, faster and more private. The official website is https://pp.io. Ppio is not only a storage platform, but also a distribution platform. We have written many articles about ppio’s storage technology before, this article will focus on ppio’s distribution technology.

What is data distribution

Distribution refers to the rapid delivery of the same data to many people while ensuring the delivery experience. These people are distributed in many places in a certain region (maybe a country), and the experience of data transmission should be ensured. Common distribution scenarios include: static web pages, large file downloads, large picture viewing, streaming media on demand, streaming media live broadcast, etc. There are also some business scenarios, such as multi-channel video call, video conference, etc., whose essence is also a two-way distribution.

Key technology and scenario application of data distribution

#1. CDN & P2P

The traditional distribution is called CDN, content delivery network (CDN), which is a kind of content distribution network built on the network. Its technical basic principle is to push data from the source station to the server closest to the user, and then the user obtains data directly from the server closest to himself, so as to obtain the best user experience. Relying on the edge servers deployed in various places, through the load balancing, content distribution, scheduling and other functional modules of the central platform, users can get the required content nearby, reduce network congestion, and improve the user access response speed and hit rate. The key technologies of CDN mainly include content storage and distribution.

Distribution is the oldest application of P2P technology. P2P networks, from the earliest Napster to the later eDonkey and BitTorrent, all use the same content to more people. Because the more people use the same content, the more nodes are uploaded, the faster the speed will be. This is a distribution scenario.

Although it is the same distribution technology, P2P and CDN are implemented in different ways. Every distributed node in CDN is a server, and the CDN network finally forms a tree structure, one level distribution data. But P2P network is different. Every client in P2P has the ability to upload. When the client downloads the data, it also uploads the data to other clients. If every client follows this logic, it will form an ecosystem where one person is me and I am everyone.

The advantages of P2P over CDN are as follows:

  1. P2P is multi-point download, which can make full use of its own network and download faster. Especially when the server (including the CDN node) is far away from the client, this is the network distance.
  2. In the P2P multi-point download mode, the download and read jitter of a single node will not cause the fluctuation of the overall download speed
  3. Save resource publisher bandwidth

The disadvantages of P2P compared with CDN are as follows:

  1. P2P implementation is complex and CDN is simple. P2P services are not as fast as CDN in the face of business changes.
  2. When P2P starts, there is a cold start problem that needs to be solved, and it will take some time to find other high-quality nodes; all the services that require quick start, P2P is not as convenient as CDN.
  3. For operators, the controllability of P2P is not as good as CDN. Operators will study how to restrict P2P and reduce the user experience of P2P.

P2P and CDN are not contradictory. P2SP technology is the combination of P2P and CDN technology, that is to say, for clients, they can download data from CDN nodes or P2P networks. Services built with P2SP are also known as pcdn services.

#2. Video application is very common in distribution

The distribution application is a heavy streaming media application. Video on demand, such as youtube, Netflix, and live video, such as Hulu, etc., as well as short video, such as tiktok, etc., are application scenarios for distribution. According to the report in October 2018, video applications account for about 58% of Internet traffic downloads. Therefore, ppio will spend a lot of energy to do a good job in video quality of service QoS when doing distribution technology.

#3. Distribution itself is inseparable from storage

The essence of storage and distribution is the reading and use of data, which cannot be separated. When a data is stored in the ppio network, if only one person can read and use it, it is storage; how many people read or use it is distribution. It’s just that the storage scenario and the distribution scenario have different designs and different requirements for service quality.

Ppio technology design for distribution scenario

The core team of ppio has been working on pptv, which was the largest P2SP Chinese video platform ever, with 450 million users. Our team has accumulated rich technology, product and operation practice of P2P and distribution projects with nearly ten years of P2P video product experience. We have a good understanding of the diversity and even tricky requirements of distribution scenarios. These experiences enable us to make a technical framework for distribution products that meets the actual needs. The following is the technical design of ppio for the distribution scenario.

#1. Overlapping network

Ppio supports overlay network. Each storage node / miner will take the storage node with faster physical connection as its own neighbor. In the process of data transmission and information interaction, it will give full play to the advantages of neighboring nodes, so as to greatly improve the network efficiency.

#2. Optimization of streaming media transmission

As mentioned earlier, streaming media is the most important application of distribution scenarios. The support of streaming media and the quality of service (QoS) are very important. Ppio implements a special data-driven download algorithm for streaming media to ensure smooth play of real-time streaming media.

#3. P4P technology support

P2P will generate a lot of cross ISP traffic between networks. Generally speaking, there is no extra charge for the traffic in ISP network, but the traffic generated by the transmission between operators will be charged according to the traffic. Is there any way to not only retain the advantages of P2P technology, but also reduce cross ISP traffic? This is P4P technology.

The full name of P4P is proactive network provider participation for P2P, which not only strengthens the network traffic within the same service provider (ISP), but also reduces the transmission pressure and operation cost of the backbone network, thus improving the transmission performance of P2P files. Different from P2P random selection, P4P can coordinate network topology data, select nodes effectively, and improve network routing efficiency.

Before doing pptv, the ppio team had rich experience in dealing with operators, and had a unique way to reduce the cross ISP traffic of operators. Before the emergence of P4P technology, operators were trying to limit the use of P2P technology.

#4. Adaptive scheduling of hot content

Ppio supports p2p-cdn. In p2p-cdn, the adaptive scheduling of popular content is very important, and it is also an important means to improve the quality of service (QoS). The adaptive scheduling of hot content is that when a file becomes popular in the network, the system will automatically trigger the scheduling mechanism to allow more storage nodes to store the file. This design can not only improve the user experience, but also improve the benefits of more storage nodes. Conversely, when a popular file loses its heat, the system adaptively reduces the number of storage nodes that store the file. This creates a dynamic balance. Ppio has put a lot of effort into popular content scheduling algorithms.

#5. Artificial preheating mechanism

In addition to the adaptive scheduling of hot content, ppio also provides a set of artificial preheating mechanism, so what scenarios is the artificial preheating mechanism applicable to?

For example, when you are watching a TV play, there are already a lot of people watching the previous episode, so roughly predict that there will be a lot of people watching the next episode. So when the publisher updates the new episode, it can push the resources of the new episode to more miners in advance. In this way, when you watch the next level of TV series, there are enough storage nodes to do it. In this way, we can make full use of the advantages of P2P network and greatly improve the viewing experience before the content is released. There are many similar scenarios, as long as the hot content that can be predicted by human can be preheated to improve the experience in the cold start process.

The content publisher can pay a fee to specify the content to be preheated, and can specify the area, ISP, and time period for preheating. Depending on the region, ISP, and time period, the price of storage will vary. The implementation of preheating in ppio is basically consistent with the principle of decentralized storage, because miners do not know whether this content is really hot, so they need to charge fees to hedge risks. But the difference between preheating and storage is that preheating uses full copy, while storage mainly uses erasure correction code. I will explain why it is different later.

#6. Consideration of P2P Live Broadcasting

Ppio considers not only the download of streaming media on demand, but also the live streaming media. In essence, live broadcast is the distribution of a batch of continuous small files. However, their life cycle is relatively short and they will not be used after a period of time. At the same time, the distribution efficiency of these small files is required to be very high and they should be distributed to as many nodes as possible very quickly. The overall architecture of live broadcast is consistent with the streaming media system of ppio, except for the way of file segmentation and different download algorithms.

There are two types of live broadcast, one is high delay live broadcast, which is mainly used for events, news, etc. the feature of this kind of live broadcast is a live channel, which may watch a lot of users, but everyone is not so sensitive to the delay of the program. There is also a kind of low delay live broadcast, which is mainly used for the mode of host and show. The characteristic of this kind of live broadcast is that it involves the interaction with the host, which requires very low delay, generally within 5S, that is, when the action occurs to see the screen, the time is only 5S at most, but the user scale is generally small.

Faced with these two live scenes, ppio uses the scheme of one push, two pull and three compensation to make consistency and compatibility. Only the parameters are different, it can well support the two modes. The founding team of ppio started from P2P Live Broadcasting. It has been the world’s largest P2P live broadcasting platform, pptv. Its accumulation in the field of live broadcasting is also very rich.

#7. Design of ppio pcdn

Pcdn, or CDN acceleration with P2P, uses P2P technology and a large number of tenant nodes’ bandwidth and disk resources to accelerate CDN distribution. Ppio is designed to support pcdn, and provides DAPP development interface. It is easy for developers to use pcdn interface to speed up their content services.

The storage content in the application will be published in the publishing source node first, and the download service can be provided continuously on the premise that the source node is not offline. However, when the number of users downloaded from the same source node increases, the bandwidth of the node will be consumed, and the download speed of each user will be reduced. Through pcdn, a large number of tenant nodes in the network start to save and provide the same content download. As a result, users can download content from multiple nodes, greatly improving the user experience.

There are two ways to realize pcdn in ppio network:

  1. Make full use of the popular content in 3.2.4 according to the mechanism of predictive scheduling. Ppio itself has the function of predicting scheduling according to heat. When a hot content is found, other tenants also take the initiative to provide services, thus increasing the number of copies of the network. The more copies, the better the effect of end-user P2P download.
  2. When the source node publishes content, you can specify the number of pcdn directed cache copies. DAPP developers can force caching of a content through pcdn API interface according to their own needs. They can set which network area and the number of geographic areas (country, ISP, state, city quadruple) to cache. Ppio looks for tenant nodes within these directed areas to store data and provide download services. Because this is the tenant specified through API, the source node will pay the corresponding storage space-time fee, scheduling fee and space-time certification fee to this part of tenants in this case. In this case, the data flow driven by pcdn in the network is shown in the figure above.

What are the differences in ppio design for distribution and storage

Ppio is positioned for storage and distribution. What’s the difference between storage and distribution in technology? There are mainly the following points.

#1 full copy and erasure code

The ultimate goal of distribution and storage is different. Distribution focuses on how to get content quickly. In general distribution scenarios, the source node is in, so don’t worry about data loss. Even if it is accidentally lost by the storage node, it can be found from the source node. So in the distribution scenario, we choose the full copy algorithm.

Storage is different. First of all, storage should ensure that data cannot be lost and improve the content loss rate. If we adopt the method of full copy algorithm, we don’t know how many copies to save in order to achieve the no loss rate of 11 nines. But in the application of erasure correction code technology, to achieve 11 9 no loss rate, the redundant space needed will be much less, which is the most useful scheme to improve the content no loss rate.

Simply speaking, the distribution pursues high-speed, so the full copy scheme is used as the main scheme; while the storage pursues high content loss rate, so the erasure correction code technology is used as the leading technology.

#2 memory cache

There is a big difference between distribution and storage. Distribution often has a strong head effect while storage does not.

The head effect of distribution, also known as the 28 principle, is that 20% of the content has 80% of the traffic. However, in-depth study may find that 20% of the content in the head also applies the principle of 28. So in distribution, we usually divide the content into header content, middle content and tail content: the traffic in the header is very centralized, the traffic in the middle is less, and the traffic in the tail is very scattered. In the distribution scenario, consider from a cost perspective. The head and middle head contents are suitable for using memory cache, the middle and tail contents are suitable for using SSD and other high-speed storage media, while the tail contents are more suitable for mechanical hard disk considering the cost.

Storage has no head effect, it’s all tail content, because few people have the same content. Storage can also be divided into hot storage, warm storage and cold storage. Hot storage refers to data that is often read after being written; warm storage refers to data that is rarely read after being written, and may never be read, such as the old data of a private network disk; cold storage refers to data that will not be used after being written, even if it is not required to be used in a timely manner, such as monitoring data 。

In the ppio network, the hot storage mainly uses the scheme of co-existence of full copy and erasure code. The full copy ensures proper high-speed transmission. Erasure code can reduce the loss rate of data to a very low level. At the same time, it is recommended that the storage nodes carrying hot storage use SSD and other high-speed hard disks. For warm storage and cold storage, pure erasure code scheme is used, because the number of reads is not much, it is recommended to use mechanical hard disk, which can reduce the cost to the lowest.

So for storage nodes, if they want to get the maximum benefit, the parameters of their machine configuration should also match with the services they provide.

Ppio pays more attention to distribution scenarios

Ppio projects are equivalent to other decentralized storage blockchain projects, such as filecoin and storj, which pay more attention to distribution scenarios. Other projects focus on storage scenarios. Here is a simple comparison table to analyze the three storage chains and give the comparison information.

To sum up, these are the advantages of ppio in the field of data distribution. If you want to know more, welcome to join our developer community!