Python realizes capturing expression package based on Baidu AI

Time:2022-5-11
catalogue
  • 1、 Key application method of Baidu AI open platform
  • 2、 Grab the Post Bar expression bag
  • 3、 Using Baidu AIP

This paper first grabs the expression image on the network, then uses Baidu AI to identify the description text on the expression package, and uses the expression text to rename the file. In this way, when sending the expression package, you don’t need to open it one by one, and directly select the expression according to the file name and send it.

1、 Key application method of Baidu AI open platform

This example uses the API interface of Baidu AI to realize character recognition. Therefore, you need to apply for the corresponding API permission first. The specific steps are as follows:

Enter AI in the address bar of a web browser (such as chrome or Firefox) baidu. COM, go to the official website of Baidu cloud AI, and click the in the upper right corner of the pageConsoleButton.

在这里插入图片描述

Enter the login page of Baidu cloud AI official website, enter baidu account and password, if not, clickRegister nowHyperlink to apply for registration.

After successful login, enter the console page of Baidu cloud AI official website and click the navigation on the leftProduct service, expand the list, and you can see at the bottom right of the listartificial intelligenceAnd selectimage recognition, or choose directlyCharacter recognition, as shown in the figure below.

在这里插入图片描述

get intoOverview I image recognitionTo use the API of Baidu cloud AI, you first need to apply for permission. Before applying for permission, you need to create your own application, so clickCreate applicationButton, as shown in the figure below.

在这里插入图片描述

Enter intoCreate applicationPage, in which you need to enter the name of the application, select the application type, and select the interface. Note: you can select more interfaces here, and select all the interfaces that may be used later, so that you can use them directly when developing other instances; After selecting the interface, select the character recognition package name, and select hereunwanted, enter the application description and clickCreate nowButton, as shown in the figure below.

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

When the creation is complete, clickReturn to application listButton, the page jumps to the application list page, where you can view the created application and the appid, API key and secret key automatically assigned to you by Baidu cloud. These values vary according to different applications, so you must save them for use during development.

在这里插入图片描述 

2、 Grab the Post Bar expression bag

In this example, we found some self-made expression packs in Baidu Post Bar: https://tieba.baidu.com/p/5522091060
Now I want to climb down all the pictures. The specific steps are as follows:

Network captures packets to see whether the returned data is consistent with the element, that is, whether it contains the desired data instead of being loaded through JS black magic. Copy the picture link of the first figure and find it in the response in the network tab.

在这里插入图片描述

No trace of Ajax dynamic loading data was found in the network packet capture.

Click the second page, and the trace of Ajax loading is found in the packet capture.

在这里插入图片描述

You can also find it by searching the URL of the first figure.

The three parameters guess PN as page_ Number, that is, the number of pages. Postman or write code to simulate the request. Remember to insert the host and x-requested-with to verify whether PN = 1 is the data of the first page. If the verification is passed, that is, all page data can be obtained through this interface.

First load the last page, and then go through a wave of circular traversal to parse the data, obtain the picture URL, write the file, and use multiple threads to download. The detailed code is as follows.

#Grab all the pictures in a post of Baidu Post Bar
import requests
import time
import threading
import queue
from bs4 import BeautifulSoup
import chardet
import os

tiezi_url = "https://tieba.baidu.com/p/5522091060"
headers = {
    'Host': 'tieba.baidu.com',
    'User-Agent': 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KH'
                  'TML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
}
pic_save_dir = 'tiezi_pic/'
if not os. path. Exists (pic_save_dir): # judge whether the folder exists. If it does not exist, create it
    os.makedirs(pic_save_dir)

pic_urls_file = 'tiezi_pic_urls.txt'
download_ q = queue. Queue() # download queue


#Get pages
def get_page_count():
    try:
        resp = requests.get(tiezi_url, headers=headers, timeout=5)
        if resp is not None:
            resp.encoding = chardet.detect(resp.content)['encoding']
            html = resp.text
            soup = BeautifulSoup(html, 'lxml')
            a_s = soup.find("ul", attrs={'class': 'l_posts_num'}).findAll("a")
            for a in a_s:
                if a.get_ Text () = = 'last page':
                    return a['href'].split('=')[1]
    except Exception as e:
        print(str(e))


#Download thread
class PicSpider(threading.Thread):
    def __init__(self, t_name, func):
        self.func = func
        threading.Thread.__init__(self, name=t_name)

    def run(self):
        self.func()


#Get the URL of all pictures on each page
def get_pics(count):
    params = {
        'pn': count,
        'ajax': '1',
        't': int(time.time())
    }
    try:
        resp = requests.get(tiezi_url, headers=headers, timeout=5, params=params)
        if resp is not None:
            resp.encoding = chardet.detect(resp.content)['encoding']
            html = resp.text
            soup = BeautifulSoup(html, 'lxml')
            imgs = soup.findAll('img', attrs={'class': 'BDE_Image'})
            for img in imgs:
                print(img['src'])
                with open(pic_urls_file, 'a') as fout:
                    fout.write(img['src'])
                    fout.write('\n')
            return None
    except Exception:
        pass


#Download thread调用的方法
def down_pics():
    global download_q
    while not download_q.empty():
        data = download_q.get()
        download_pic(data)
        download_q.task_done()


#Download called method
def download_pic(img_url):
    try:
        resp = requests.get(img_url, headers=headers, timeout=10)
        if resp.status_code == 200:
            Print ("download picture:" + img_url)
            pic_name = img_url.split("/")[-1][0:-1]
            with open(pic_save_dir + pic_name, "wb+") as f:
                f.write(resp.content)

    except Exception as e:
        print(e)


if __name__ == '__main__':
    Print ("search to determine whether the linked file exists:")
    if not os.path.exists(pic_urls_file):
        Print ("does not exist, start parsing the post...)
        page_count = get_page_count()
        if page_count is not None:
            headers['X-Requested-With'] = 'XMLHttpRequest'
            for page in range(1, int(page_count) + 1):
                get_pics(page)
        Print ("link has been resolved!")
        headers.pop('X-Requested-With')
    else:
        Print ("exists")
    Print ("start downloading pictures ~ ~ ~")
    headers['Host'] = 'imgsa.baidu.com'
    fo = open(pic_urls_file, "r")
    pic_list = fo.readlines()

    threads = []
    for pic in pic_list:
        download_q.put(pic)
    for i in range(0, len(pic_list)):
        T = picspider (t_name = 'thread' + str (I), func = down_ pics)
        t.daemon = True
        t.start()
        threads.append(t)
    download_q.join()
    for t in threads:
        t.join()
    Print ("picture download completed")

Operation results:

在这里插入图片描述

Next, through OCR character recognition technology, the text in the expression is directly put forward, and then the image is named. In this way, you can directly search the expression keyword in the file, and you can quickly find the required expression image. Use Google’s OCR word recognition engine: Tesseract, which is not suitable for such large pictures and small words. The recognition rate is too low or even unrecognizable. At this time, baidu cloud OCR is more appropriate. It can automatically locate the specific position in the picture and find out all the words in the picture.

3、 Using Baidu AIP

After applying for Baidu AI application key, you can install Baidu AIP in the local system. The code is as follows:


pip install baidu-aip 

First identify a picture and see how it works:

from aip import AipOcr

#Create a new aipocr object
config = {
    'appid': 'fill in your own appid',
    'apikey': 'fill in your own apikey',
    'secret key ':' fill in your own secret key '
}
client = AipOcr(**config)


#Recognize the text in the picture
def img_to_str(image_path):
    #Read picture
    with open(image_path, 'rb') as fp:
        image = fp.read()

        #Call general character recognition, and the picture parameter is local picture
    result = client.basicGeneral(image)
    #Return splicing results
    if 'words_result' in result:
        return '\n'.join([w['words'] for w in result['words_result']])


if __name__ == '__main__':
    print(img_to_str('tiezi_pic/5c0ddb1e4134970aebd593e29ecad1c8a5865dbd.jpg'))

Run the program, and the results are shown in the figure below:

在这里插入图片描述

Baidu AI returns a JSON format data, as shown below. Returns a dictionary object containing log_ id、words_ result_ num、words_ Result has three keys, including words_ result_ Num indicates the number of recognized text lines, words_ Result is a list. Each list item records a recognized text. Each item returns a dictionary object, including the words key. Words represents the recognized text.

{'words_result': [{'words':'o. o'}, {'words':'6226-16:59'}, {'words':'despair JPG'}],'log_ id': 1393611954748129280, 'words_ result_ num': 3}
o。o
6226-16:59
Despair jpg

Since each picture may contain a lot of text information, such as the date text of the watermark, and some special text symbols are incorrectly parsed, we need to put forward the information of Chinese characters or letters, and may contain multiple pieces of Chinese character information at the same time. In this example, select the one with the longest Chinese characters or letters to name the file. The complete example code is as follows:

#Identify pictures and text, and name pictures and text in batches

import os
from aip import AipOcr
import re
import datetime

#Create a new aipocr object
config = {
    'appid': 'fill in your own appid',
    'apikey': 'fill in your own apikey',
    'secret key ':' fill in your own secret key '
}
client = AipOcr(**config)

pic_dir = r"tiezi_pic/"


#Read picture
def get_file_content(file_path):
    with open(file_path, 'rb') as fp:
        return fp.read()


#Recognize the text in the picture
def img_to_str(image_path):
    image = get_file_content(image_path)
    #Call general character recognition, and the picture parameter is local picture
    result = client.basicGeneral(image)
    #Result splicing return
    words_list = []
    if 'words_result' in result:
        if len(result['words_result']) > 0:
            for w in result['words_result']:
                words_list.append(w['words'])
            file_name = get_longest_str(words_list)
            print(file_name)
            file_dir_name = pic_dir + str(file_name).replace("/", "") + '.jpg'
            if os. path. Exists (file_dir_name): # handle the problem of duplicate file names
                sec = datetime. datetime. now(). Microsecond # gets the current millisecond time value
                file_dir_name = pic_dir + str(file_name).replace("/", "") + str(sec) + '.jpg'
            try:
                os.rename(image_path, file_dir_name)
            except Exception:
                Print ("Rename failed:", image_path, "= >", file_name)


#Gets the longest string in the string list
def get_longest_str(str_list):
    pat = re.compile(r'[\u4e00-\u9fa5A-Za-z]+')
    str = max(str_list, key=hanzi_len)
    result = pat.findall(str)
    return ''.join(result)


def hanzi_len(item):
    pat = re.compile(r'[\u4e00-\u9fa5]+')
    sum = 0
    for i in item:
        if pat.search(i):
            sum += 1
    return sum


#Traverse all pictures in a folder
def query_picture(dir_path):
    pic_path_list = []
    for filename in os.listdir(dir_path):
        pic_path_list.append(dir_path + filename)
    return pic_path_list


if __name__ == '__main__':
    pic_list = query_picture(pic_dir)
    if len(pic_list) > 0:
        for i in pic_list:
            img_to_str(i)

Run the program, and the results are shown in the figure below:

在这里插入图片描述

This is the end of this article about Python’s implementation of capturing expression package based on Baidu AI. For more information about Python’s capturing expression package, please search the previous articles of developeppaer or continue to browse the relevant articles below. I hope you will support developeppaer in the future!

Recommended Today

Graphic tutorial for installing oracle19c and SQL developer in Windows 10

First register an Oracle account. Prepare compressed packages for Oracle and SQL developer Oracle download address: https://www.oracle.com/database/technologies/oracle-database-software-downloads.html#19c SQL develoer download address: https://www.oracle.com/tools/downloads/sqldev-downloads.html Oracle installation 1. After decompressing the Oracle compressed package, click setup ext 2. Select create single instance database 3. Select desktop class 4. Select a user You can create a non administrator window […]