Crawling urban bus stops with Python

Time:2022-5-25

Crawling urban bus stops with Python

Page analysis

https://guiyang.8684.cn/line1

公交路线.png
公交站点

Reptile

We use requests and beautiful soup to parse and obtain our site data. After getting our bus stop, we use Gaode API to obtain the longitude and latitude coordinates of the stop, and use pandas to parse the JSON file. Next, I recommend using an object-oriented approach to writing code.

import requests
import json
from bs4 import BeautifulSoup
import pandas as pd
​
​
class bus_stop:
 ##Define a class to obtain the station name and longitude and latitude of each bus
 def __init__(self):
 self.url = 'https://guiyang.8684.cn/line{}'
 self.starnum = []
 for start_num in range(1, 17):
 self.starnum.append(start_num)
 self.payload = {}
 self.headers = {
 'Cookie': 'JSESSIONID=48304F9E8D55A9F2F8ACC14B7EC5A02D'}
 ##Call Gaode API to obtain the longitude and latitude of the bus line
 ###You can apply for this key yourself
 def get_location(self, line):
 url_ api = ' https://restapi.amap.com/v3/bus/linename?s=rsv3&extensions=all&key=559bdffe35eec8c8f4dae959451d705c&output=json&city= Guiyang & offset = 2 & keywords = {} & platform = JS' format(
 line)
 res = requests.get(url_api).text
 #Print (RES) can be used to check whether the returned information contains the data you need
 rt = json.loads(res)
 dicts = rt['buslines'][0]
 #Return DF object
 df = pd.DataFrame.from_dict([dicts])
 return df
 ##Get the name of each bus stop
 def get_line(self):
 for start in self.starnum:
 start = str(start)
 #Construct URL
 url = self.url.format(start)
 res = requests.request(
 "GET", url, headers=self.headers, data=self.payload)
 soup = BeautifulSoup(res.text, "lxml")
 div = soup.find('div', class_='list clearfix')
 lists = div.find_all('a')
 for item in lists:
 line = item. Text # get bus route under a tag 
 lines.append(line)
 return lines
​
​
if __name__ == '__main__':
 bus_stop = bus_stop()
 stop_df = pd.DataFrame([])
 lines = []
 bus_stop.get_line()
 #Output route
 Print ('There are {} bus routes'. Format (len (lines)))
 print(lines)
 #Exception handling
 error_lines = []
 for line in lines:
 try:
 df = bus_stop.get_location(line)
 stop_df = pd.concat([stop_df, df], axis=0)
 except:
 error_lines.append(line)

 #Output abnormal route 
 Print ('abnormal route has {} bus routes'. Format (len (error_lines))) 
 print(error_lines)

 #Output file size 
 print(stop_df.shape)
 stop_df.to_csv('bus_stop.csv', encoding='gbk', index=False)

爬虫效果

Data cleaning

Let’s see the effect first. I need to clean the busstops column. Our general idea is: breakdown – > inverse perspective – > breakdown. I will accept two methods: Excel PQ and python.
预处理数据

数据清洗后的数据

Excel PQ data cleaning

This method makes full use of PQ and pure interface operation. It’s not a big problem, so let’s just look at the process. The core steps are the same as above.
PQ操作

Python data cleaning

##We need to deal with busstops column and ID column
data = stop_df[['id','busstops']]
data.head()

预处理数据

##Dictionary or list
df_pol = data.copy()
###Set index column
df_pol.set_index('id',inplace=True)
df_pol.head()

分列效果

##Inverse perspective
###Release index
df_pol.reset_index(inplace=True)
###Inverse perspective操作
df_pol_ps = df_pol.melt(id_vars=['id'], value_name='busstops')
df_pol_ps.head()

逆透视效果

##Delete empty lines
df_pol_ps.dropna(inplace=True,axis=0)
df_pol_ps.shape

删除空行

##Breakdown
###Set line_ id
df_parse['line_id'] = df_pol_ps['id']
df_parse = df_pol_ps['busstops'].apply(pd.Series)
df_parse

处理效果

We need to separate out a lot of work here, but I don’t need to use long. I’ll do it quickly.

##Write file
df_ parse. to_ Excel ('distribution of bus stops in Guiyang. Xlsx ', index = false)

QGIS coordinate correction

I won’t talk about the basic operation of QGIS. By the way, QGIS supports CSV format better. I recommend that the file we import into QGIS is in CSV format.

Import CSV file

QGIS导入csv

Coordinate correction

A lot has been said before. The coordinates on our Gaode map are gcj02 coordinates. We need to convert them to WGS 1984 coordinates. We need to use the geohey plug-in in QGIS.

GeoHey坐标纠偏

叠加路网效果

Look at this coordinate correction, the difference is still very big.

summary

Generally speaking, we still recommend using object-oriented method to write code, and exception handling is essential. The problem I face this time is that there are some bus routes in Gaode API, which will be abnormal, so this exception handling is indispensable. From the perspective of data processing, PQ beats Python in terms of speed and convenience. I recommend that you use PQ for data cleaning. Sometimes, I will give a variety of processing methods. PQ looks complex, but in fact PQ is the simplest. In short, I highly recommend PQ for data cleaning. Another point is that the index in Python is troublesome. This time I want to ensure that it is consistent with bus_ stop_ ID and line_ ID, so that the bus stop table and bus route table can be connected. In fact, this is the foreign key connection in SQL. Therefore, when I clean Python data, I involve a large number of index operations, which is not so complex in PQ. Speaking of this index, thank my SQL teacher. She explained the indexes and constraints in SQL as if it were yesterday. You can apply for the key of Gaode by yourself. There may be a limit on the number of keys. Next, I will upload the code to gitee. The code management is still very important. I will also learn the code management myself. Next, I would like to thank my primary school sister for this small project and Cui Gong for her encouragement. In fact, I’m very busy recently and don’t want to write an article. Finally, I would like to thank a primary school sister I know. She is really excellent. Finally, I hope you all have a good and happy life in the last month of 2021. I also hope we all have a bright future. There is also a pit. I suggest you write articles on the Jane book. If you are really local, there is a problem in uploading pictures.