In 10 hours, I shortened the running time of spark script from 15 hours to 12 minutes!


<!– In 10 hours, I shortened the running time of spark script from 15 hours to 12 minutes! — >

I was confused on Monday and wrote an article:How to fetch a specific row from Spark’s dataframe, there are several solutions to my conjecture.

I didn’t expect to face this problem so soon. I use examples that children can understand to describe what I’m doing.

Simple and vivid small examples

It is said that there are several classes in a primary schoolIn classRank the children according to their height and record it.

The problem is that there is only one ruler for measuring height in the whole school, and because the children are too stubborn and other subjective and objective factors,The processes of measuring height, sorting by height and registering height must be carried out in a classroom.The class that has not been measured is on the playground.

The most troublesome thing for teachers is the process of organizing children into the classroom. It takes less than a few minutes to measure height, record and sort,Only let the children into the classroom this thing, let the teachers use nine cattle and two tigers, and it is particularly time-consuming.

In 10 hours, I shortened the running time of spark script from 15 hours to 12 minutes!

The good news is that it takes about the same time to organize one class into the classroom as to organize 100 classes into the classroom at the same time.Therefore, generally speaking, teachers directly call all students into this classroom.

But I face a difficult situation. On my playground, there are 2200 classes with 160000 people in each class. My classroom is also big, but it certainly can’t hold 2200 × 160000 ≈ 300 million.

So I thought, I have a class test, which is the most intuitive and best managed.

“Come on, class one, come into the classroom!” It took more than ten minutes to call in… It took dozens of seconds to measure, arrange and record… “OK! Class one, get out! Class two, come in

So back and forth, until the 2200th class, nearly a month has passed

The inner watchman spoke: won’t you just call them in? Anyway, there are conditions ahead: “organizing one class into the classroom takes about the same time as organizing 100 classes into the classroom at the same time.”

It makes sense. That’s what I’m doing this morning: make the classroom bigger.

I invited people from the Land Bureau, engineers and construction teams to try various methods. Every time I tried my best to repair it (which can accommodate 500 million people), the classroom collapsed for various reasons.

Alas! I’ve calculated that it can be built in theory!

I didn’t want to, so I kept trying, over and over, and then a few hours passed.

This is another watchman’s speech:Don’t repair the classroom. Just divide the children into several groups and call several classes into the classroom at a time!

It makes sense, but it really took me some time to change part of the original management logic. In addition, it took a lot of time to debug.

I initially set 100 classes as a group to enter the classroom:

  • It turned out that I needed to do “call the children into the classroom” 2200 times (including once in each class)
  • Now I only do “call the children into the classroom” 22 times. Look, is it 100 times faster

Comparative Interpretation

The above is actually a simplified version of what I do, in which:

  • Classroom” is the “memory” of the computer. You have to take the data into the memory to sort it
  • “Entering the classroom” is the “IO operation” of the computer. The memory of the computer is very expensive. Generally, computers are 8g and 16g, while the hard disk is relatively cheap, with 256g, 512g or even a few T. therefore, the data is generally placed on the hard disk. When it needs to be used, it is read into the memory. This reading process is called “IO operation”
  • Compared with the calculation of “IO operation”, it is quite time-consuming

In 10 hours, I shortened the running time of spark script from 15 hours to 12 minutes!

The following is an excerpt from my work log (desensitized version):

First of all, I call each class into the classroom alone, which is very time-consuming.

On the morning of July 199:30It started on July 200:23 midnightAt the end, there are 2200 columns, each column has 160000 data, and sorting operation is required. It also involves IO operation, which takes a total of 15 hours. The time used is Io time and processing time for each column:

$$\alpha \times time_{\ Text {IO}} * column + \ beta \ times row \ log_ 2 {line}$$

Among them, compared with IO, the calculation time (such as sorting) can be ignored, so the time can be recorded as

$$\alpha \times time_{\ Text {IO} * column$$

So I thought, “call all the classes into the classroom at once”. After all:

  • My machine has 8g of memory
  • 4G data is used at most

I started to “expand the classroom” and tried a lot, and the configuration file.confspark-shellspark-env.cmdJVM -Xmx4gWait, this kind of information and operation has been fighting all morning without results.

In 10 hours, I shortened the running time of spark script from 15 hours to 12 minutes!

I think my attempt has had an effect, because the original error is not reported,collectThis process can also be completed (children can enter the classroom, but they can’t get in before), but once it comes to operation(collectIt will be stuck for a long time after the end, and it is unable to return to its due statusArray), it will explodeJVM heap。 In addition, after some other adjustments, it does not explodeheapOkay, boomGC overhead limit exceededThis garbage collection problem.

Then I have reason to suspect that the performance is limited by hardware.

So I thought: “divide the children into several groups and call several classes into the classroom at a time”.

There are many bugs. My last choice is a call100Class, about 12 minutes.

End of tuning.

Generally speaking, there is little difficulty in thinking. It took so much time, mainly because:

  • Unwilling to accept that a certain idea is not feasible, try it hard
  • Lack of experience

Ah! If the code that takes 15 hours is not written by me a month ago, but by someone else, I adjust it to 12 minutes. It seems that I’m serious. I’m kidding. I hope the code you write is great, so that we can save time for rest

Well, go back to bed and do other work for the “children” tomorrow; And there is another job in another school (it’s a little difficult to arrange two jobs in parallel by the leader at the same time).

I’m Xiaopai. I’m a graduate student of Tianjin University. I’m studying wechat piperlhj. If you’re also engaged in spark related work, be sure to add me wechat. I really need experts to harass me

Don’t forget to watch~