If you’ve ever used Uber, you’ll notice how easy it is to operate. You call a car with one button, and then the car will come to you. Finally, the payment will be completed automatically. The whole process will flow smoothly. However, behind this simple process, it is actually supported by complex basic big data architectures such as Hadoop and spark.
Uber has an enviable place at the crossroads of the real world and the virtual world. This makes hundreds of thousands of drivers who travel through cities every day. Of course, this is also a relatively simple data problem. But, as Aaron schildkrout, head of Uber’s data division, says, the simplicity of the business plan gives Uber a huge opportunity to leverage data to optimize its services.
“It’s essentially a data problem,” schildkruut said in a recent transcript of a speech by Uber and databricks. “Because it’s so obvious, we want to automate the car experience. To some extent, we are trying to provide intelligent, automated, real-time services for passenger drivers around the world and support the scale of services. “
Whether it is Uber pricing in peak hours, helping drivers avoid accidents or finding the best profit location for drivers, all of these computing services rely on data. These data problems are a real crystallization of mathematics and global destination prediction. “This makes the data here very exciting and drives us to use spark to solve these problems,” he said
Uber’s way of big data
stayPresentation by data bricksIn, Uber engineers described (apparently for the first time) some of the company’s challenges in expanding applications and meeting requirements.
As the head of Uber’s data architecture, vinoth chandar said: spark is “a must have artifact.”.
In the old architecture, Uber relies on Kafka’s data stream to transfer a large amount of log data to the S3 of AWS, and then uses EMR to process the data. Then it is imported from EMR into relational database which can be used by internal users and city directors.
“The original celery + Python ETL architecture worked well, but Uber encountered some bottlenecks when it wanted to scale up,” chandar said. With the expansion of more and more cities, the scale of this data is also increasing. In the existing system, we have encountered a series of problems, especially in the batch process of data upload.
Uber needs to ensure the travel data of one of the most important data sets, where hundreds of true and accurate consumption records will affect downstream users and applications. “This system was not designed for multiple data centers,” chandar said. We need to use a series of fusion methods to put the data into a data center. “
The solution evolved a so-called spark based streaming IO architecture to replace the previous Celery / Python ETL architecture. The new system decouples the original data from the relational data warehouse table model. “You can get data on HDFS and then rely on tools like spark to handle large-scale data processing,” chandar said
Instead, travel data is aggregated from multiple distributed data centers in a relational model. The company’s new architecture uses Kafka to provide real-time data logs from local data centers and load them into a centralized Hadoop cluster. Then, the system uses spark SQL to convert unstructured JSON into a more structured parquet file that can be used for SQL analysis by hive.
“This solved a series of additional problems that we encountered, and we are now on a node that uses spark and spark streaming to make the system stable for a long time,” he said. We also plan to use spark tasks, hive, machine learning and all interesting components to fully release Spark’s potential from accessing and obtaining raw data. “
Paricon and komondor
After chandar gave an overview of Uber’s involvement in spark, two other Uber engineers, Kelvin Chu and Reza shiftehfar, provided more details on paricon and shiftehfar. These are actually two core projects of Uber’s entry into spark.
Although unstructured data can be easily handled, Uber still needs to generate structured data through data pipeline, because the “contract” generated by structured data between data producers and data users can effectively avoid “data corruption”.
That’s why parino entered the blueprint, Chu said. The tool is made up of four spark based tasks: transfer, infer, transform, and validate. ” So no matter who wants to change the data structure, they will enter the system and have to use the tools provided by us to modify the data structure. This system will then run multiple tests to ensure that there will be no changes. “
One of the highlights of paricon is the so-called “column pruning”. We have many wide tables, but usually we don’t use all the columns at a time, so pruning can effectively save io of the system. “Paricon can also handle some” data stitching “work,” he said. Some Uber data files are very large, but most of them are smaller than the HDFS block. Therefore, our company stitches these small data together to align the HDFS file size and avoid IO operation disorder. In addition, Spark’s “data structure aggregation” function also helps us process Uber data in an intuitive and simplified way using the paricon workflow tool. “
At the same time, shiftehfar provides architecture level details for the built-in data ingestion services of komondor and spark streaming. The data source is the basis of “cooking”. The original unstructured data flows from Kafka to HDFS and is ready to be consumed by downstream applications.
Prior to komondor, it was a tool to ensure data accuracy for each individual application (including getting upstream data of the data they are processing) and backing up data when necessary. Komondor can now automatically process more or less data. If users need to load data, it is much easier to use spark streaming.
In order to deal with millions of events and requests every day, we are investing heavily in spark and intend to leverage more spark technology stacks, including machine learning and graph computing using mlib and graphx libraries. For more details, watch the entire video of the speech below.
Alex Woodie translator: Harry Zhu
For better reading experience, you can visit the translation address directly:https://segmentfault.com/a/1190000005174590
As a Sharism, all pictures and texts published on the Internet comply with CC copyright. Please keep the author’s information and indicate the author’s financer column:https://segmentfault.com/blog/harryprinceIf source code is involved, please indicate GitHub address:https://github.com/harryprince。 Micro signal: Harry zustudio
For commercial use, please contact the author.