When cloudquery encounters big data

Time:2021-11-30

“There is an unimaginable huge amount of digital information in the world, and it is growing at an extremely fast speed. The impact of this huge amount of information has been felt in many aspects, from the economic community to the scientific community, from government departments to the art field. Scientists and computer engineers have created a new word for this phenomenon: ‘big data‘.”
——Kenneth kucker, data, ubiquitous data

“Mankind is moving from the IT era to the DT era.”. In the DT era, people can collect more abundant data than ever before. Data is transforming our lives and giving birth to the development of big data industry, and the rapid growth of data has also brought severe data processing problems.

In the era of big data, traditional software has been unable to process and mine information in a large amount of data. The most important change is Google’s “three carriages”. Around 2004, Google successively released Google distributed file system GFS, big data distributed computing framework MapReduce and big data NoSQL database BigTable. These three papers laid the cornerstone of big data technology.

Next, with the continuous development of big data related technologies, the open source approach has gradually formed a big data ecology. Because MapReduce programming is cumbersome, Facebook contributes hiveql syntax to provide great help for data analysis and data mining. Elasticsearch, Splunk and other search engines for searching data content have also stepped on the stage, which are mainly used for real-time processing and analysis of massive data.

As a data control platform, cloudquery plans to support all types of data sources in its growth plan. During the 1.4 iteration, hive and elasticsearch, which are the most popular among users, will be added.

Hive

Speaking of hive, we have to mention Hadoop. Hadoop is almost a supplement to the existing database system. It provides users with unlimited space for data storage. It is good at storing arbitrary, semi-structured data, or even unstructured data. It supports users to store and obtain data at the right time, and makes classification optimization for large file storage, batch access and streaming access.

This makes the user’s data analysis simple and fast, but the user also needs to access the final data after analysis. This demand does not need the batch mode, but the random access mode. This mode is equivalent to a full table scan and use index for the database system.

Hive is a data warehouse framework built on Hadoop, which is generated and developed in response to the needs of managing and (machine) learning the massive emerging social network data produced by Facebook every day. Hive is designed to enable analysts who are proficient in SQL but have relatively weak Java programming skills to execute queries on large-scale data sets stored in HDFS by Facebook. Today, hive is already a successful Apache project, and many organizations use it as a common and scalable data processing platform.

As one of the mainstream search engines of Hadoop, hive supports the use of SQL to read, write and manage large-scale data sets. When connecting hive data sources, cloudquery first considers the query performance in the case of large amount of data, and controls the amount of data returned each time to be displayed in the current viewpoint. Secondly, in the big data or data warehouse, in order to facilitate data analysis, it is usually stored in a wide table, so a variety of display mode switching will be added during rendering, including list format and single format. The list format can provide batch data preview, and the single format can display the details of the wide table in the form of columns.

The old version of hive only supports data query and loading, but the subsequent versions support insertion, update, deletion and streaming API. Therefore, cloudquery not only performs data operation and permission control coverage, but also takes into account the database’s native operation characteristics and adds a variety of API support. The synchronization supports partition and bucket division. The partition table sets different storage paths for data storage paths to generate multiple data files. The bucket table divides a data file into several parts that are easier to manage.

Elasticsearch

Unlike hive, elasticsearch is a search engine for data content search. Elasticsearch, as an independent search server, provides very convenient search functions. Users do not care about the details of the underlying Lucene at all. They can add, delete, modify and query index data only through the standard HTTP + restful API. The input and output of data adopts JSON format, which is very convenient to understand and express domain data in a document and object-oriented manner.

At the same time, elasticsearch implements a distributed Lucene directory based on fragmentation and replica. Combined with the concept of map reduce, elasticsearch implements a simple strategy of distribution and consolidation of search requests, which can easily solve the problems of massive index and distributed high availability.

Today, elasticsearch is basically the first product in the search engine market. From the ranking of DB engines website, elasticsearch is basically a unique product, far from the second place.

When cloudquery encounters big data

As mentioned above, the main difference between ES and the current mainstream database on the market is that at the beginning, it is not even a database, but appears in the public view as a search engine. Later, with the maturity and wide coverage of various technologies, full-text retrieval, data analysis and distributed technology are combined to form the ES in our view. Therefore, it can have the advantages of distributed and fast query at the same time.

Elasticsearch is document oriented, which means that it can store an entire object or document. Considering the particularity of data types stored in ES, cloudquery divides them into documents and indexes, and can index, search, sort and filter documents. This way of understanding data is completely different from the traditional two-dimensional table form, which is one of the reasons why elastic search can perform complex full-text search.

In terms of presentation form, we chose the most general “JSON” format. Because of the differences between data, few objects in the application are just a simple key value list. More often, it has complex data structures, such as dates, geographical locations, sub objects and even groups. Although almost all languages have corresponding modules to convert any data structure into JSON format, the processing details of each language are different, so cloudquery also covers the processing language and object compatibility horizontally, giving priority to the sequencing and deserialization of mainstream languages and objects.

The acceleration of “new infrastructure” has created favorable conditions and huge development opportunities for the digital economy. The market will further embrace cloud, big data and business intelligence. Through the accelerated integration of cloud digital intelligence, it will accelerate the maximization of enterprise data value and efficiently complete the transformation and implementation of industrial intelligence.

Official website address:https://cloudquery.club/

When cloudquery encounters big data