“Do you like travelling?”
This simple question often gets a positive response or even an extra one or two adventure stories. Generally speaking, travel is a good way to experience new culture and broaden your vision.
But instead of “do you like the process of checking tickets?” , I’m sure the reaction will be less enthusiastic
So, use Python to solve your problems! The author of this article, f á bio Neves, a senior business data analyst, will take you to build a web crawler project to help us find the best price!
The specific approach is to search for flight prices for specific destinations and flexible date ranges (up to 3 days before and after the date you choose).
Search results are saved to an excel and an email is sent to show you quick statistics. Obviously, the ultimate goal is to help us find the best price!
If you really want to try it, you can execute this script on the server (a simple raspberry PI can do it. Raspberry PI raspberry pie is also called a card computer. It’s only the size of a credit card. Its computing performance is similar to that of a smartphone. It’s enough for everyone to toss on their laptops.) And run once or twice a day. It will send the search results to you in the form of email. I suggest that you save the excel file to the Dropbox cloud so that you can access it anytime, anywhere.
Note: Dropbox is a cloud service similar to Baidu cloud
I still haven’t found any wrong low price tickets, but I think it’s still possible!
It will search based on the “elastic date range” to find all flights up to 3 days before and after your preferred date. Although the script works on only one set of from / to destinations at a time, you can easily adjust it to run multiple sets of trip destinations within each cycle. You may even end up finding some wrong low-cost tickets… Fantastic!
When I first started doing web crawlers, I wasn’t particularly interested in this. I wanted to do more with predictive modeling, financial analysis, and some emotional analysis, but it turned out to be interesting to figure out how to build the first web crawler. As I keep learning, I realize that network capture is the essence of Internet operation.
Yes… Just like Larry and Sergey, enjoy the Jacuzzi after launching the crawler!
You might think it’s a very bold idea, but what if I told you that Google was born out of crawlers written by Larry and Sergey in Java and python? Google is trying to provide the best answer to your question by crawling through the Internet. There are a lot of applications about web crawlers, and even if you prefer other topics in data science, you still need some crawler skills to get the data you want.
Python can save you
The first challenge is to choose which platform to grab information from. It wasn’t easy, but I finally chose kayak. Before the decision, I tried momondo, skyscanner, Expedia and so on, but the verification code part of these websites is really crazy. After several attempts to choose traffic lights, crosswalks and bicycles, my conclusion is that kayak is the best choice at present, even if it loads too many pages in a short time, it will throw out the security check.
I set the robot to query the website at 4 to 6 hours intervals, so there won’t be any problem. Occasionally, there may be a jam interrupt here and there, but if you encounter verification code verification, you need to manually verify the verification code, start the robot program after confirmation, and then wait a few hours for it to reset. You can also apply these codes to other platforms at will. Welcome to share your application in the comments section!
If you are a novice crawler, or don’t understand why some websites always set up various barriers to prevent network capture, before writing the first line of crawler code, please read and understand Google’s “network capture etiquette”. If you’re ready to start grabbing the web like a lunatic, you’ll probably get results much faster than you think.
Network grabbing Etiquette:
Please fasten your seat belt
After opening the chrome tab, we will define some functions to use within the loop. The general idea about the overall structure is as follows:
- A function will launch the robot and declare the city and date we want to search.
- This function obtains the first batch of search results and sorts them by “best” flight, then click “load more results”.
- Another function grabs the entire page and returns a dataframe dataset
- Repeat steps 2 and 3 for the cheapest and fastest results.
- Email you the final price results (cheapest and average) and save the three sorted (price, time, overall best) datasets as an excel file
- All the previous steps are repeated in cycles, running every x hours.
OK, each selenium project will start with webdriver. I use the chrome driver, and of course there are other options. For example, phantom JS or Firefox are also popular. After the webdriver is downloaded, you can put it in a folder. The first line of the code will automatically open a blank chrome tab.
Please note that I am not here to open up a new world, or to put forward a very pioneering innovation. At present, there are more advanced ways to find cheap tickets, but I hope this post can share some simple and practical things with you!
These are the packages I refer to for the entire project. I will use randInt to let the robot pause randomly for a few seconds between each search. This is a necessary function for all robots. If you run the previous code, you need to open a Chrome web window as the entry for robot search.
So let’s quickly test it and open http://kayak.com on the new page. Choose the city and date you want to fly to. When selecting a date, be sure to select “+ – 3 days”. I’ve written the relevant code, and if you just want to search for a specific date, you need to make some adjustments appropriately. I will try to point out all the variation values throughout the text.
Click the search button and get the link in the address bar. This link should be the link I need to use below. Here I define the variable kayak as URL and call the get method of webdriver. Your search results should appear next.
When get command is used many times in a short time, the system will jump out of verification code check. You can manually resolve the verification code problem and continue testing the script until the next problem occurs. From my tests, the first search run seems to be all right, so if you want to use this code and keep a long execution interval between them, you can solve this problem. You don’t need to update these prices every 10 minutes, do you?!
So far, we’ve opened a browser window and got the URL. Next, I’ll use XPath or CSS selectors to grab other information like prices. I used to only use XPath. At that time, I didn’t think it was necessary to use CSS, but now it seems that it’s better to use it in combination. You can copy the webpage XPath directly with your browser. You will also find that the webpage elements can be located by XPath, but the readability is poor. So I gradually realize that it is difficult to get the page elements you want only by XPath. Sometimes, the finer the point, the less useful it is.
Next, let’s use Python to choose the lowest fare page element. The red part of the above code is the code of the XPath selector. In the web page, you can right-click anywhere and select “check” to find it. Try, right-click where you want to see the code, and “check” it.
To illustrate the shortcomings of the XPath I mentioned earlier, please compare the following differences:
1 # This is what the copy method would return. Right click highlighted rows on the right side and select "copy > Copy XPath"//*[@id="wtKI-price_aTab"]/div/div/div/div/div/span/span2 # This is what I used to define the "Cheapest" buttoncheap_results = ‘//a[@data-code = “price”]’
In the above code, the simplicity of the second way is clear. It will search for a element with the data code attribute value of price. The first way is to search for a wtki price tab element, which is embedded in the 5-layer div and 2-layer span. For this page, it works, but the hole here is that the ID will change the next time the page is loaded, and the wtki value will change dynamically each time the page is loaded, so the code will be invalid at that time. So it’s still valuable for you to study more about what XPath represents.
However, this method of copying XPath directly is very easy to use for pages that are not very complex and fickle.
Based on the above code results, what should I do if I want to find all the matching results to coexist in the list? Very simple, because all the results are in the CSS object resultwrapper, as long as I write a for loop in the following code, you can get all the results. After mastering this idea, you can basically see the code in the figure below. That is to say, first select the outermost page element (such as the resultwrapper in this article’s website), then find a way (such as XPath) to obtain information, and finally save the information to a readable object (in this case, it first exists in flight_containers, then in flights_list).
I have printed out the first three results in detail, which contains all the useful information we need, but we still need to find a better way to extract them, at this time we need to parse these elements separately.
Start crawling data!
The simplest code is to read more of this function. Let’s start here. I want to get as many flights as possible without triggering the security check, so I will click the “load more results” button every time I load the page. It’s worth noting that I used the try statement because sometimes the button may not exist.
Oh, well, it’s a little long in the early stage (sorry, I’m really easy to deviate). Now we’re going to start defining the functions for crawling data.
I’ve parsed most of the elements in the page \. Sometimes, there are two flights in the returned Flight List. I simply crudely split it into two variables, such as section a list and section B list. Of course, the function will still return a dataframe object named flights_df, with which we can then sort and slice or merge as appropriate.
The variable name with a represents the first stroke, and the variable name with B represents the second stroke. Let’s look at the next function.
Don’t worry, there are dry goods!
So far, we have functions to load more results, and functions to parse them. You can think it’s over. You can rely on them to climb the web manually, but as I mentioned earlier, our goal is to be able to email ourselves, and of course, include some other information. Take a look at the function start_kayak below, all of which are in it.
This requires us to define the location and date of the flight to be queried. We will open the URL in the kayak variable, and the query results will be sorted directly according to “best”. After the first climb, I got the price matrix data set at the top of the page, which will be used to calculate the average price and the lowest price, and then sent via email together with kayak’s forecast price (the top left corner of the page). Searching for elements on a single date can cause errors because there is no price matrix at the top of the page.
I tested it with outlook email (http://hotmail.com). Although I haven’t tried Gmail, and even have other kinds of email, I think it’s OK. In addition, other ways of sending email are also mentioned in the book I mentioned earlier. If you have hotmail email, you can directly replace your email information in the code, and you can use it.
If you want to know the function of some part of the code in the script, you need to copy that part out and test it, because only in this way can you thoroughly understand it.
Run the code
Of course, we can also put the function we programmed in the loop to keep it running. Write 4 input prompts, including the city of takeoff and landing and start and end time (input). But in the test, we don’t want to input the four variables every time, so we can directly modify the four variables, as shown in the four lines of code commented.
Congratulations, so far we have finished! In fact, there are many things that can be improved. For example, what I can think of is that we can use twilio to send SMS instead of email. You can also use VPN or other covert ways to crawl data through multiple servers at the same time. There are also verification code problems. They always pop up from time to time, but there is still a way to solve them. If you have a good foundation, I think you can try to add these functions. You may even want to send an excel file as an attachment to an email.
Compiled by: Gao Yan, Xiong Yan, Hu Yao, Jiang Baoshang
Reprint to big data digest