Write a web crawler

Use C language to write a web crawler, to get a website interested in information, grab everything you need.
#include/*
Custom parsing function, D is the obtained HTML page string
*/voidp(cspider_ T * cspider, char * d) {char * get [100]; / / XPath parsing htmlintsize = XPath (D, “/ / body / div [@ class = – wrap ‘] / div [@ class = – sort column area’] / div [@ class = – BD cfix ‘] / UL [@ class = – St list cfix’] / Li / strong / a”, get, 100); inti; for (I = 0; I < size; I + +) {/ / persist the obtained movie name to savestring (cspider, get [i]);}/*
Data persistence function to further preserve the data imported by the saveString () function in the analytic function above.
*/Void (void * STR) {char * get = (char *) str; printf (% Sn “, get); return;} intmain() {/ / initialize spidercspider_ t *spider = init_ cspider();char*agent =”Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:42.0) Gecko/20100101 Firefox/42.0”;//char *cookie = “bid=s3/yuH5Jd/I; ll=108288; viewed=1130500_ 24708145_ 6433169_ 4843567_ 1767120_ 5318823_ 1899158_ 1271597;
__utma=30149280.927537245.1446813674.1446983217.1449139583.4;
__utmz=30149280.1449139583.4.4.utmcsr=accounts.douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/login; ps=y; [email protected]; dbcl2=58742090:QgZ2PSLiDLQ; ck=T9Wn; push_noty_num=0; push_doumail_num=7; ap=1;
Wei utmb=30149280.0.10.1449139583; __ UTMC = 30149280 “; / / set the urlcs to capture the page_ setopt_ url(spider,” so.tv.sohu .com/list_ p1100_ p20_ P3_ u5185_ u5730_ p40_ P5_ P6_ p77_ p80_ P9_ 2d1_ p101_ P11. HTML “); / / set user agents_ setopt_ useragent(spider, agent);//cs_ setopt_ Cookie (spider, cookie); / / pass in pointers to parsing functions and data persistence functions
Cs_ setopt_ process(spider, p); cs_ setopt_ Save (spider, s); / / set the number of threads_ setopt_ threadnum(spider, DOWNLOAD,2); cs_ setopt_ threadnum(spider, SAVE,2);//FILE *fp = fopen(“log”, “wb+”);//cs_ setopt_ Logfile (spider, FP); / / start crawler returns_ run(spider);}

Crawler optimization
Crawler program is generally divided into data acquisition module, data analysis module and anti climbing strategy module. If these three modules can be optimized, the crawler program can run stably and continuously.
1. Acquisition module
Generally speaking, the target server will provide a variety of interfaces, including URL, app or data API. The R & D personnel need to test according to the difficulty of data collection, daily data volume requirements, and the anti climbing limit frequency of the target server, and select the appropriate collection interface and method.
2. Data analysis module

As there are various uncertainties in network collection, the data analysis part should do well in exception handling and positioning restart function after data analysis according to the needs, so as to avoid abnormal exit of program or omission and repetition of data collection
3. Anti climbing strategy module
Analyze the crawler strategy of the target server, control the frequency of crawler requests, and even crack the verification code and encrypted data. At the same time, use high-quality agent or crawler agent to find agent products with exclusive business, stable network, high concurrency and low delay, so as to ensure that the target server can not carry out anti crawling restriction and early warning,
By using the above optimization strategies, the crawler program can run stably for a long time.

Click to learn more information, more free open source projects and courses for you to watch!