Regular expressions are not unique to python. Recently, we exported all site addresses in Google search results, so we thought of using Python regular expressions to extract site addresses in search results.
Several problems need to be solved:
1. Get result text for search
To get more addresses, I used Google’s advanced search feature, which displays 100 results per page.
After obtaining the displayed results, you can view the source code and keep it as a text file to have the search result text
2. Analyze how to extract site information
First of all, you need to analyze the obtained page to see how to extract the site information.
I use the profiler function in IE8’s own development tool (press F12 to pop it up) to check the special format of the content I want to care about
It can be seen from the above figure that the site I need is in the tag < cite > < / cite >, so can I use regular expression to extract the text in it?
3. Write a regular expression to get the site address
The next step is to write the expression. I wrote it in Python 3.2, which is easy to use
The code is as follows: first, keep the search results page in E: / t3.txt, and execute the following code
import re p = re.compile(r'<cite>([^<>\/].+?)</cite>') f = open("e:/t3.txt", encoding='utf-8') content = f.read() print ("\n".join(p.findall(content)))
The operation is as follows:
You can check the running effect map to see if all the site addresses have been obtained.