Python disk commemorative coin series 2: identification verification code 01

Time:2020-10-22

Last time, because the appointment time of Mount Tai coin has already passed, and no other commemorative coins have been issued recently, so my appointment interface is not what we want. And today, smart I’ve found the appointment interface, though it’s castrated.

Appointment interface

The official account is open.

Python disk commemorative coin series 2: identification verification code 01

You can see the required information including:

  • full name
  • Certificate type (ID card by default)
  • Identification Number
  • phone number
  • Graphic verification code
  • SMS verification code
  • Exchange outlets (castrated, no choice)
  • Number of appointments (castrated directly to not shown)
  • Appointment date (cannot be entered)

In order to realize automation, we usually save the fixed information such as name, ID number, mobile phone number in the configuration file in advance, and use the program to read it automatically. In the later stage of actual combat, it will be more specific. However, the SMS verification code has not thought of a good solution at present, but the main content of this article is not here, so we will not repeat it.

Next, let’s take a look at the focus of this paper — Graphic captcha.

Python disk commemorative coin series 2: identification verification code 01

If you wait until the appointment time to check one by one and fill in manually, if you encounter a hot one, it is very likely that you have just filled in the quota has already gone. Here, we plan to use neural network to help us solve this problem by using deep learning (in fact, OCR is also a good solution, because this captcha is not complicated, at least much simpler than 12306 captcha).

Analyze data sources

Since we decide to use deep learning, collecting enough data is the first thing. After all, neural networks learn from a lot of data.

When it comes to data, we need to use our skills in writing crawlers. First, let’s take a look at how the captcha is generated

  1. Browser press F12

Python disk commemorative coin series 2: identification verification code 01

  1. Select the small arrow

Python disk commemorative coin series 2: identification verification code 01

  1. Find the location of the captcha and click

Python disk commemorative coin series 2: identification verification code 01

  1. Watch the source window

Python disk commemorative coin series 2: identification verification code 01

Seeing the SRC attribute in the corresponding code block, anyone familiar with HTML knows that the captcha is from here, so we copy it, as shown below

https://eapply.abchina.com/coin/Helper/ValidCode.ashx?0.5805915363836303

Open an invisible tab to see if cookies are required for the CAPTCHA to display:

Python disk commemorative coin series 2: identification verification code 01

We can see from the results that it can still be opened in the stealth window. But now there is only one picture. I don’t know if this link is a fixed link? Let’s refresh:

Python disk commemorative coin series 2: identification verification code 01

Verification code changed!

After the above process, we found that the image captcha is generated by a fixed connection, which will dynamically return different CAPTCHA to the foreground. Perfect, so we don’t have to worry about our data sources.

Data collection

The image captcha here is actually a picture. In essence, we are going to crawl the web page image. This article only writes the key steps, the detailed operation method may refer to this official account before the crawl the web page picture article, links here.

I will use the requests library to crawl the image, and then write it to the local file in binary mode:

data = requests.get(image_url).content
with open('./{}.jpg'.format(image_index), 'wb') as f:
    f.write(data)

So far, we have been able to save the CAPTCHA to our own machine. So the new question is, how many pictures do we need? If there are too few graphs, the model is not easy to achieve the optimal; if there are too many graphs, the training will take too much time. Considering that the complexity of the captcha is not high, the other 10 numbers plus 26 English letters are 36 categories in total. Each category is calculated according to 100 training pictures, and 3600 letters or numbers are needed. Because a captcha has four letters or numbers, a total of 800 captcha images are required. It should also be noted that after training the model, data is needed for testing, so we should set aside the verification code pictures for test in advance, which is calculated by 200 pieces, so we need to collect 1000 picture verification codes finally.

Through the above core code, I soon got the 1000 image verification codes:

Python disk commemorative coin series 2: identification verification code 01

Python disk commemorative coin series 2: identification verification code 01

See so many captcha, I already itch to plate them, but it’s too late, the specific processing to the next issue.

All the source code of this series will be put in this GitHub repository. If necessary, you can refer to it. If there is any problem, please point it out. Thank you!

Preview of next issue: captcha image preprocessing


First issue: Python disk commemorative coin series 1: Introduction

Recommended Today

Layout of angular material (2): layout container

Layout container Layout and container Using thelayoutDirective to specify the layout direction for its child elements: arrange horizontally(layout=”row”)Or vertically(layout=”column”)。 Note that if thelayoutInstruction has no value, thenrowIs the default layout direction. row: items arranged horizontally.max-height = 100%andmax-widthIs the width of the item in the container. column: items arranged vertically.max-width = 100%andmax-heightIs the height of the […]