Crawler, let you no longer feel mysterious

Time:2021-1-14

1. Using the third party class library htmlagilitypack

Official website:https://html-agility-pack.net/?z=codeplex、

//Get HTML information from file
var doc = new HtmlDocument();
doc.Load(filePath);

//From string to get HTML information from a string
var doc = new HtmlDocument();
doc.LoadHtml(html);

//Get HTML information from web address
var url = "http://html-agility-pack.net/";
var web = new HtmlWeb();
var doc = web.Load(url);

Here is the last usage

var web = new HtmlWeb();
var doc = web.Load(url);

stayweb We can also set cookies, headers and other information to deal with some specific website needs, such as login.

1.2 explanation of usage

The web page is just a string after you look at the source code of the web page, and what the crawler does is to find the information we want in this pile of strings and pick it out.
Previous filtering method: regular (too cumbersome to write)
Htmlagility pack supports parsing the information we need through XPath.

1.2.1 where to find XPath?

Web page right click Check

Through XPath, you can get all the information of the elements you want accurately.

1.2.2 get the information of the selected HTML element?

Gets the selected element

var web = new HtmlWeb();
var doc = web.Load(url);
var htmlnode = doc?.DocumentNode?.SelectSingleNode("/html/body/header")

Get element information

htmlnode.InnerText;
htmlnode.InnerHtml;
//Value by attribute
Htmlnode?. getattributevalue ("SRC", "not found")

2. Self encapsulated class library

/// 
    ///Download HTML help class
    /// 
    public static class LoadHtmlHelper
    {
        /// 
        ///Download page from URL address
        /// 
        /// 
        /// 
        public async static ValueTask LoadHtmlFromUrlAsync(string url)
        {
            HtmlWeb web = new HtmlWeb();
             return await
                 web?.LoadFromWebAsync(url);
        }

        /// 
        ///Get single node extension method
        /// 
        ///Document object
        ///XPath path
        /// 
        public static HtmlNode GetSingleNode(this HtmlDocument htmlDocument, string xPath)
        {
          return  htmlDocument?.DocumentNode?.SelectSingleNode(xPath);
        }

        /// 
        ///Get multiple node extension method
        /// 
        ///Document object
        ///XPath path
        /// 
        public static HtmlNodeCollection GetNodes(this HtmlDocument htmlDocument, string xPath)
        {
            return htmlDocument?.DocumentNode?.SelectNodes(xPath);
        }

     

        /// 
        ///Get multiple node extension method
        /// 
        ///Document object
        ///XPath path
        /// 
        public static HtmlNodeCollection GetNodes(this HtmlNode htmlNode, string xPath)
        {
            return htmlNode?.SelectNodes(xPath);
        }


        /// 
        ///Get single node extension method
        /// 
        ///Document object
        ///XPath path
        /// 
        public static HtmlNode GetSingleNode(this HtmlNode htmlNode, string xPath)
        {
            return htmlNode?.SelectSingleNode(xPath);
        }

        /// 
        ///Download pictures
        /// 
        ///Address
        ///File path
        /// 
        public async static ValueTask DownloadImg(string url ,string filpath)
        {
            HttpClient httpClient = new HttpClient();
            try
            {
                var bytes = await httpClient.GetByteArrayAsync(url);
                using (FileStream fs = File.Create(filpath))
                {
                    fs.Write(bytes, 0, bytes.Length);
                }
                return File.Exists(filpath);
            }
            catch (Exception ex)
            {
             
                Throw new exception ("download picture exception", ex.);
            }
            
        }
    }

3. I wrote the crawler case, crawled the website https://www.meitu131.com/

The data storage layer is not implemented. I’m too lazy to write. It’s up to you. My data is temporarily stored in the file
GitHub address:https://github.com/ZhangQueque/quewaner.Crawler.git

Recommended Today

JS function

1. Ordinary function Grammar: Function function name (){ Statement block } 2. Functions with parameters Grammar: Function function name (parameter list){ Statement block } 3. Function with return value Grammar: Function function name (parameter list){ Statement block; Return value; } Allow a variable to accept the return value after calling the function Var variable name […]