Hand in hand teaching you to crawl Youku movie information – 2


In the last chapter, we implemented the crawling of Youku single page. We will review it briefly. Using htmlagility pack library, crawling of crawler is divided into three steps

  • Crawler steps

    • Load page
    • Parsing data
    • Save data

After the first document, this article is mainly about the last one. The main functions are as follows
1. Crawl movie category list
2. Loop the movie information of each category and crawl the information of each category page by page
3. The crawled data is saved in the database

1、 Crawl movie category list

Hand in hand teaching you to crawl Youku movie information - 2

Use Chrome browser, F12, find the current location, get the current location of the XPath. The data we need is the category code and category name of the movie.

Rule analysis:
The XPath path is “/ / * [@ id = filterpanel ‘] / div / UL / Li / a”)
The category code is the content of a label’s ref path, and we intercept it
The category name is a label innertest, and we intercept it

Code examples

//Loading web content
         private static readonly string _url = "http://list.youku.com/category/video/c_0.html";

        /// <summary>
        ///Get all the categories
        /// </summary>
        public static List<VideoType> GetVideoTypes()
            //Loading web content
            var web = new HtmlWeb();
            var doc = web.Load(_url);

            //Content resolution - get all categories
            var allTypes = doc.DocumentNode.SelectNodes("//*[@id='filterPanel']/div/ul/li/a").ToList();

            //Remove all from the category list
            var typeResults = allTypes.Where((u, i) => { return i > 0; }).ToList();

            var reList = new List<VideoType>();
            foreach (var node in typeResults)
                var href = node.Attributes["href"].Value;
                reList.Add(new VideoType
                    Code = href.Substring(href.LastIndexOf("/") + 1, href.LastIndexOf(".") - href.LastIndexOf("/") - 1),
                    Name = node.InnerText

            return reList;

2、 Crawl the total number of pages per category

Code for the movie category
Page rules$“ http://list.youku.com/category/show/ {code}.html”
Crawling according to page rules:

/// <summary>
        ///Gets the total number of pages in the current category
        /// </summary>
        public static int GetPageCountByCode(string code)
            var web = new HtmlWeb();
            var doc = web.Load($"http://list.youku.com/category/show/{code}.html");

            //Pagination list
            var pageList = doc.DocumentNode.CssSelect(".yk-pages li").ToList();
            //Get the penultimate term
            var lastsecond = pageList[pageList.Count - 2];
            return Convert.ToInt32(lastsecond.InnerText);

3、 Get the content of each movie category by page number

According to the paging rules, the address after paging is
Code is the code and PageIndex is the page number
Page rules:http://list.youku.com/categor…{code}_s_1_d_1_p_{pageIndex}.html
Crawling according to page rules:

/// <summary>
        ///Get the content of the current category
        /// </summary>
        public static List<VideoContent> GetContentsByCode(string code, int pageIndex)
            var web = new HtmlWeb();
            var doc = web.Load($"http://list.youku.com/category/show/{code}_s_1_d_1_p_{pageIndex}.html");

            var returnLi = new List<VideoContent>();
            var contents = doc.DocumentNode.CssSelect(".yk-col4").ToList();

            foreach (var node in contents)
                returnLi.Add(new VideoContent
                    PageIndex = pageIndex.ToString(),
                    Code = code,
                    Title = node.CssSelect(".info-list .title a").FirstOrDefault()?.InnerText,
                    Hits = node.CssSelect(".info-list li").LastOrDefault()?.InnerText,
                    Href = node.CssSelect(".info-list .title a").FirstOrDefault()?.Attributes["href"].Value,
                    ImgHref = node.CssSelect(".p-thumb img").FirstOrDefault()?.Attributes["Src"].Value

            return returnLi;

4、 Test crawl results

/// <summary>
        ///Printed content
        /// </summary>
        public static void PrintContent()
            var count = 0;
            foreach (var node in GetVideoTypes())
                var resultLi = new List<VideoContent>();
                //Get the total number of pages in the current category
                var pageCount = GetPageCountByCode(node.Code);
                //Traverse the page to get the content
                for (var i = 1; i <= pageCount; i++) resultLi.AddRange(GetContentsByCode(node.Code, i));
                Console.WriteLine ($"encoding{ node.Code }Number of pages {pagecount} total number of pages{ resultLi.Count }");
                count += resultLi.Count;

            Console.WriteLine ($"total number is {count}");

Code download address: