Extracting site address from search results based on Python regular expression


Regular expressions are not unique to python. Recently, we exported all site addresses in Google search results, so we thought of using Python regular expressions to extract site addresses in search results.

Several problems need to be solved:

1. Get result text for search

To get more addresses, I used Google’s advanced search feature, which displays 100 results per page.

After obtaining the displayed results, you can view the source code and keep it as a text file to have the search result text

2. Analyze how to extract site information

First of all, you need to analyze the obtained page to see how to extract the site information.

I use the profiler function in IE8’s own development tool (press F12 to pop it up) to check the special format of the content I want to care about

It can be seen from the above figure that the site I need is in the tag < cite > < / cite >, so can I use regular expression to extract the text in it?

3. Write a regular expression to get the site address

The next step is to write the expression. I wrote it in Python 3.2, which is easy to use

The code is as follows: first, keep the search results page in E: / t3.txt, and execute the following code

import re
p = re.compile(r'<cite>([^<>\/].+?)</cite>')
f = open("e:/t3.txt", encoding='utf-8')
content = f.read()
print ("\n".join(p.findall(content)))

The operation is as follows:

You can check the running effect map to see if all the site addresses have been obtained.