Decrypt Uber data team’s way to optimize basic data architecture


Decrypt Uber data team's way to optimize basic data architecture


If you’ve ever used Uber, you’ll notice how easy it is to operate. You call a car with one click, then the car comes to you, and finally the payment is automatically completed, and the whole process is flowing. However, behind this simple process, it is actually supported by a complex basic big data architecture such as Hadoop and spark.

Uber has an enviable place at the crossroads of the real world and the virtual world. This makes hundreds of thousands of drivers flocking to each city every day. Of course, this will also be a relatively simple data problem. However, as Aaron schildkrout, director of Uber’s data Department, said: the simplicity of the business plan gives Uber a huge opportunity to optimize services with data.

“It’s essentially a data problem,” schildkrout said recently in a note from Uber and databricks. “Because things are so simple, we want to automate the car experience. To some extent, we are trying to provide intelligent, automatic and real-time services for passenger drivers all over the world and support the scale of service. “

Whether it’s Uber pricing at peak hours, helping drivers avoid accidents or finding the best profit location for drivers, all Uber’s computing services rely on the data from the Internet. These data problems are a real crystallization of mathematics and global destination prediction. “It makes the data here very exciting and drives us to fight with spark to solve these problems,” he said

Decrypt Uber data team's way to optimize basic data architecture

Uber’s big data

staySpeech by data bricksIn, Uber engineers describe (apparently for the first time in public) some of the challenges companies face in expanding their applications and meeting their needs.

As the head of Uber data architecture, vinoth chandar said: spark is already a “necessary artifact”.
Under the old architecture, Uber relies on Kafka’s data stream to transfer a large amount of log data to S3 of AWS, and then uses EMR to process the data. Then it is imported from EMR to relational database which can be used by internal users and city directors.

Chandar said: “the original ETL architecture of cell + Python actually works well, but when Uber wants to scale up, it encounters some bottlenecks.”. With the expansion of more and more cities, the data scale is also increasing. In the existing system, we have encountered a series of problems, especially in the batch process of data upload.

Uber needs to ensure the travel data, one of the most important data sets, where hundreds of real and accurate consumption records will affect downstream users and applications. “This system was not originally designed for multiple data centers,” chandar said. We need to use a series of fusion methods to put the data in a data center. “

The solution has evolved a so-called spark based streaming IO architecture to replace the previous Celery / Python ETL architecture. The new system decouples the raw data from the relational data warehouse table model. “You can get data on HDFS and then rely on tools like spark to handle large-scale data processing,” chandar said

Therefore, instead of aggregating travel data from multiple distributed data centers in a relational model, the company’s new architecture uses Kafka to provide real-time data logs from local data centers and load them into a centralized Hadoop cluster. Then, the system uses spark SQL to transform the unstructured JSON into a more structured parquet file that can use hive for SQL analysis.

“This solves a series of additional problems that we have encountered, and we are now on a node that uses spark and spark streaming to make the system stable for a long time,” he said. We also plan to use spark task, hive, machine learning and all interesting components to completely release the potential of spark from accessing and obtaining raw data. “

Decrypt Uber data team's way to optimize basic data architecture

Paricon and komondor

After chandar gave an overview of Uber’s involvement in spark, two other Uber engineers, Kelvin Chu and Reza shiftehfar, provided more details about paricon and shiftehfar. These are actually two core projects for Uber to enter spark.

Decrypt Uber data team's way to optimize basic data architecture

Although unstructured data can be easily handled, Uber ultimately needs to generate structured data through data pipeline, because the “contract” generated by structured data between data producers and data users can effectively avoid “data Breakage”.

That’s why parino came into the blueprint, Chu said. The tool is composed of four spark based tasks: transfer, infer, transform and verify. ” So no matter who wants to change the data structure, they will enter the system and have to use the tools we provide to modify the data structure. Then the system will run multiple validations and tests to make sure that there is no problem with the change. “

One of the highlights of paricon is the so-called column pruning. We have many wide tables, but usually we don’t use all the columns each time, so pruning can effectively save the IO of the system. “Paricon can also handle some data stitching,” he said. Some Uber data files are large, but most of them are smaller than HDFS blocks. Therefore, our company stitches these small data together to align the HDFS file size and avoid IO malfunction. In addition, Spark’s “data structure aggregation” function also helps us process Uber data in an intuitive and simplified way with paricon workflow tools. “

At the same time, shiftehfar provides architecture level details for the built-in data ingestion services of komondor and spark streaming. The data source is the basis of “cooking”. The original unstructured data flows into HDFS from Kafka and is ready to be consumed by downstream applications.

Decrypt Uber data team's way to optimize basic data architecture

Before komondor, it was a tool used to ensure the accuracy of data for each independent application (including obtaining the upstream data of the data they are processing) and make data backup when necessary. Now komondor can process more or less data automatically. If users need to load data, it is much easier to use spark streaming.

Decrypt Uber data team's way to optimize basic data architecture

In order to deal with millions of events and requests every day, we are investing heavily in spark and intend to leverage more spark technology stacks, including machine learning and graph computing using mlib and graphx libraries. For more details, watch the entire video of the speech below.

Spark and Spark Streaming at Uber

reference material

Related reading

Author: Alex Woodie translator: Harry Zhu
For better reading experience, you can visit the translation address directly:
As Sharism, all the pictures and texts published on the Internet are subject to CC copyright. Please keep the author’s information and indicate the author’s financer column of Harry Zhu: the source code is involved, please indicate the address of GitHub: Micro signal: harryzhustudio
For commercial use, please contact the author.