1. Using the third party class library htmlagilitypack
Official website:https://html-agility-pack.net/?z=codeplex、
//Get HTML information from file
var doc = new HtmlDocument();
doc.Load(filePath);
//From string to get HTML information from a string
var doc = new HtmlDocument();
doc.LoadHtml(html);
//Get HTML information from web address
var url = "http://html-agility-pack.net/";
var web = new HtmlWeb();
var doc = web.Load(url);
Here is the last usage
var web = new HtmlWeb();
var doc = web.Load(url);
stayweb
We can also set cookies, headers and other information to deal with some specific website needs, such as login.
1.2 explanation of usage
The web page is just a string after you look at the source code of the web page, and what the crawler does is to find the information we want in this pile of strings and pick it out.
Previous filtering method: regular (too cumbersome to write)
Htmlagility pack supports parsing the information we need through XPath.
1.2.1 where to find XPath?
Web page right click Check
Through XPath, you can get all the information of the elements you want accurately.
1.2.2 get the information of the selected HTML element?
Gets the selected element
var web = new HtmlWeb();
var doc = web.Load(url);
var htmlnode = doc?.DocumentNode?.SelectSingleNode("/html/body/header")
Get element information
htmlnode.InnerText;
htmlnode.InnerHtml;
//Value by attribute
Htmlnode?. getattributevalue ("SRC", "not found")
2. Self encapsulated class library
///
///Download HTML help class
///
public static class LoadHtmlHelper
{
///
///Download page from URL address
///
///
///
public async static ValueTask LoadHtmlFromUrlAsync(string url)
{
HtmlWeb web = new HtmlWeb();
return await
web?.LoadFromWebAsync(url);
}
///
///Get single node extension method
///
///Document object
///XPath path
///
public static HtmlNode GetSingleNode(this HtmlDocument htmlDocument, string xPath)
{
return htmlDocument?.DocumentNode?.SelectSingleNode(xPath);
}
///
///Get multiple node extension method
///
///Document object
///XPath path
///
public static HtmlNodeCollection GetNodes(this HtmlDocument htmlDocument, string xPath)
{
return htmlDocument?.DocumentNode?.SelectNodes(xPath);
}
///
///Get multiple node extension method
///
///Document object
///XPath path
///
public static HtmlNodeCollection GetNodes(this HtmlNode htmlNode, string xPath)
{
return htmlNode?.SelectNodes(xPath);
}
///
///Get single node extension method
///
///Document object
///XPath path
///
public static HtmlNode GetSingleNode(this HtmlNode htmlNode, string xPath)
{
return htmlNode?.SelectSingleNode(xPath);
}
///
///Download pictures
///
///Address
///File path
///
public async static ValueTask DownloadImg(string url ,string filpath)
{
HttpClient httpClient = new HttpClient();
try
{
var bytes = await httpClient.GetByteArrayAsync(url);
using (FileStream fs = File.Create(filpath))
{
fs.Write(bytes, 0, bytes.Length);
}
return File.Exists(filpath);
}
catch (Exception ex)
{
Throw new exception ("download picture exception", ex.);
}
}
}
3. I wrote the crawler case, crawled the website https://www.meitu131.com/
The data storage layer is not implemented. I’m too lazy to write. It’s up to you. My data is temporarily stored in the file
GitHub address:https://github.com/ZhangQueque/quewaner.Crawler.git