Programming crawler with C language


Write a web crawler



Use C language to write a web crawler, to get a website interested in information, grab everything you need.


Custom parsing function, D is the obtained HTML page string

*/voidp(cspider_ T * cspider, char * d) {char * get [100]; / / XPath parsing htmlintsize = XPath (D, “/ / body / div [@ class = – wrap ‘] / div [@ class = – sort column area’] / div [@ class = – BD cfix ‘] / UL [@ class = – St list cfix’] / Li / strong / a”, get, 100); inti; for (I = 0; I < size; I + +) {/ / persist the obtained movie name to savestring (cspider, get [i]);}/*

Data persistence function to further preserve the data imported by the saveString () function in the analytic function above.

*/Void (void * STR) {char * get = (char *) str; printf (% Sn “, get); return;} intmain() {/ / initialize spidercspider_ t *spider = init_ cspider();char*agent =”Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:42.0) Gecko/20100101 Firefox/42.0”;//char *cookie = “bid=s3/yuH5Jd/I; ll=108288; viewed=1130500_ 24708145_ 6433169_ 4843567_ 1767120_ 5318823_ 1899158_ 1271597;

__utma=30149280.927537245.1446813674.1446983217.1449139583.4;|utmccn=(referral)|utmcmd=referral|utmcct=/login; ps=y; [email protected]; dbcl2=58742090:QgZ2PSLiDLQ; ck=T9Wn; push_noty_num=0; push_doumail_num=7; ap=1;

Wei utmb=30149280.0.10.1449139583; __ UTMC = 30149280 “; / / set the urlcs to capture the page_ setopt_ url(spider,” .com/list_ p1100_ p20_ P3_ u5185_ u5730_ p40_ P5_ P6_ p77_ p80_ P9_ 2d1_ p101_ P11. HTML “); / / set user agents_ setopt_ useragent(spider, agent);//cs_ setopt_ Cookie (spider, cookie); / / pass in pointers to parsing functions and data persistence functions

Cs_ setopt_ process(spider, p);  cs_ setopt_ Save (spider, s); / / set the number of threads_ setopt_ threadnum(spider, DOWNLOAD,2);  cs_ setopt_ threadnum(spider, SAVE,2);//FILE *fp = fopen(“log”, “wb+”);//cs_ setopt_ Logfile (spider, FP); / / start crawler returns_ run(spider);}



Crawler optimization

Crawler program is generally divided into data acquisition module, data analysis module and anti climbing strategy module. If these three modules can be optimized, the crawler program can run stably and continuously.

1. Acquisition module

Generally speaking, the target server will provide a variety of interfaces, including URL, app or data API. The R & D personnel need to test according to the difficulty of data collection, daily data volume requirements, and the anti climbing limit frequency of the target server, and select the appropriate collection interface and method.

2. Data analysis module



As there are various uncertainties in network collection, the data analysis part should do well in exception handling and positioning restart function after data analysis according to the needs, so as to avoid abnormal exit of program or omission and repetition of data collection

3. Anti climbing strategy module

Analyze the crawler strategy of the target server, control the frequency of crawler requests, and even crack the verification code and encrypted data. At the same time, use high-quality agent or crawler agent to find agent products with exclusive business, stable network, high concurrency and low delay, so as to ensure that the target server can not carry out anti crawling restriction and early warning,

By using the above optimization strategies, the crawler program can run stably for a long time.



Click to learn more information, more free open source projects and courses for you to watch!

Recommended Today

Node.js Several interview questions

Some statements Through these questions to judge a person’s personality Node.js The level is not very rigorous, butIt allows you to have a good understanding of the interviewer’s Node.js How to have a general understanding of the experience in the field. But obviously, these questions don’t tell you the way the interviewer thinks. Show me […]