. net core implements timing to grab website articles and send them to mailbox

Time:2020-2-10

Preface

Hello, everyone. This is Xiaochen. I haven’t updated my blog for a long time. Today I will bring you a dry article. It’s a gadget that grabs the homepage information of blog garden every 5 minutes and sends it to your mailbox at 9:00 the next morning. For example, when I come to the company at 9:00 on February 14, 2018, I will receive an email, which is the article information on the homepage of the blog park on February 13, 2018. The original intention of writing this gadget is to have the habit of reading blog all the time. But recently, for various reasons, I may not read blog for several days. If I miss any good article in the middle, I feel very sad. So I made a tool to archive and send it to the mailbox every day, and my mother will never worry about my missing good articles. Why just grab the home page? Because the quality of blog Garden Home page is relatively high.

Get ready

As a continuously running tool, how can I do without logging? What I am going to use is NLog to log. It has a very good log archiving function. In HTTP requests, there may be failures due to network problems. Here I use Polly to retry. Using HTML agility pack to parse web pages requires some knowledge of XPath. Here is a detailed description:

Component name purpose github
NLog Log https://github.com/NLog/NLog
Polly Retry when HTTP request fails https://github.com/App-vNext/Polly
HtmlAgilityPack Web page parsing https://github.com/zzzprojects/html-agility-pack
MailKit Send mail https://github.com/jstedfast/MailKit

For unknown components, you can access GitHub to get information.

Reference articles

//www.jb51.net/article/112595.htm

Get & analyze the homepage data of blog Park

I use Httpwebrequest to make HTTP requests. Here is my simple encapsulated class library:


using System;
using System.IO;
using System.Net;
using System.Text;

namespace CnBlogSubscribeTool
{
 /// <summary>
 /// Simple Http Request Class
 /// .NET Framework >= 4.0
 /// Author:stulzq
 /// CreatedTime:2017-12-12 15:54:47
 /// </summary>
 public class HttpUtil
 {
 static HttpUtil()
 {
  //Set connection limit ,Default limit is 2
  ServicePointManager.DefaultConnectionLimit = 1024;
 }

 /// <summary>
 /// Default Timeout 20s
 /// </summary>
 public static int DefaultTimeout = 20000;

 /// <summary>
 /// Is Auto Redirect
 /// </summary>
 public static bool DefalutAllowAutoRedirect = true;

 /// <summary>
 /// Default Encoding
 /// </summary>
 public static Encoding DefaultEncoding = Encoding.UTF8;

 /// <summary>
 /// Default UserAgent
 /// </summary>
 public static string DefaultUserAgent =
  "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
  ;

 /// <summary>
 /// Default Referer
 /// </summary>
 public static string DefaultReferer = "";

 /// <summary>
 /// httpget request
 /// </summary>
 /// <param name="url">Internet Address</param>
 /// <returns>string</returns>
 public static string GetString(string url)
 {
  var stream = GetStream(url);
  string result;
  using (StreamReader sr = new StreamReader(stream))
  {
  result = sr.ReadToEnd();
  }
  return result;

 }

 /// <summary>
 /// httppost request
 /// </summary>
 /// <param name="url">Internet Address</param>
 /// <param name="postData">Post request data</param>
 /// <returns>string</returns>
 public static string PostString(string url, string postData)
 {
  var stream = PostStream(url, postData);
  string result;
  using (StreamReader sr = new StreamReader(stream))
  {
  result = sr.ReadToEnd();
  }
  return result;

 }

 /// <summary>
 /// Create Response
 /// </summary>
 /// <param name="url"></param>
 /// <param name="post">Is post Request</param>
 /// <param name="postData">Post request data</param>
 /// <returns></returns>
 public static WebResponse CreateResponse(string url, bool post, string postData = "")
 {
  var httpWebRequest = WebRequest.CreateHttp(url);
  httpWebRequest.Timeout = DefaultTimeout;
  httpWebRequest.AllowAutoRedirect = DefalutAllowAutoRedirect;
  httpWebRequest.UserAgent = DefaultUserAgent;
  httpWebRequest.Referer = DefaultReferer;
  if (post)
  {

  var data = DefaultEncoding.GetBytes(postData);
  httpWebRequest.Method = "POST";
  httpWebRequest.ContentType = "application/x-www-form-urlencoded;charset=utf-8";
  httpWebRequest.ContentLength = data.Length;
  using (var stream = httpWebRequest.GetRequestStream())
  {
   stream.Write(data, 0, data.Length);
  }
  }

  try
  {
  var response = httpWebRequest.GetResponse();
  return response;
  }
  catch (Exception e)
  {
  throw new Exception(string.Format("Request error,url:{0},IsPost:{1},Data:{2},Message:{3}", url, post, postData, e.Message), e);
  }
 }

 /// <summary>
 /// http get request
 /// </summary>
 /// <param name="url"></param>
 /// <returns>Response Stream</returns>
 public static Stream GetStream(string url)
 {
  var stream = CreateResponse(url, false).GetResponseStream();
  if (stream == null)
  {

  throw new Exception("Response error,the response stream is null");
  }
  else
  {
  return stream;

  }
 }

 /// <summary>
 /// http post request
 /// </summary>
 /// <param name="url"></param>
 /// <param name="postData">post data</param>
 /// <returns>Response Stream</returns>
 public static Stream PostStream(string url, string postData)
 {
  var stream = CreateResponse(url, true, postData).GetResponseStream();
  if (stream == null)
  {

  throw new Exception("Response error,the response stream is null");
  }
  else
  {
  return stream;

  }
 }


 }
}

Get home page data


string res = HttpUtil.GetString(https://www.cnblogs.com);

Analytical data

We have successfully obtained HTML, but how to extract the information we need (article title, address, abstract, author, release time). Here is our sword, HTML agility pack. It is a component that can parse web pages according to XPath.

Load the HTML we obtained earlier:


HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

From the above figure, we can see that all the information of each article is in a div with class as post item. First, we get all the divs with class = post item

//Get all article data items
var itemBodys = doc.DocumentNode.SelectNodes("//div[@class='post_item_body']");

We continue to analyze, and we can see that the title of the article is a tag under the H3 tag under the div of class = post item body, the summary information is in the P tag of class = post item summary, and the publishing time and author are in the div of class = post item foot. After the analysis, we can get the data we want:

foreach (var itemBody in itemBodys)
{
 //Title element
 var titleElem = itemBody.SelectSingleNode("h3/a");
 //Get title
 var title = titleElem?.InnerText;
 // get URL
 var url = titleElem?.Attributes["href"]?.Value;

 //Summary elements
 var summaryElem = itemBody.SelectSingleNode("p[@class='post_item_summary']");
 //Get summary
 var summary = summaryElem?.InnerText.Replace("\r\n", "").Trim();

 //Data item bottom element
 var footElem = itemBody.SelectSingleNode("div[@class='post_item_foot']");
 //Get author
 var author = footElem?.SelectSingleNode("a")?.InnerText;
 //Get article release time
 var publishTime = Regex.Match(footElem?.InnerText, "\d+-\d+-\d+ \d+:\d+").Value;
 Console. Writeline ($"title: {Title}");
 Console. Writeline ($"URL: {URL}");
 Console. Writeline ($"summary: {summary}");
 Console. Writeline ($"Author: {author}");
 Console. Writeline ($"publish time: {publishtime}");
 Console. Writeline ("------- gorgeous split line ------------");
}

Run the following:

We succeeded in getting the information we wanted. Now we define a blog object to mount them.

public class Blog
{
 /// <summary>
 // / Title
 /// </summary>
 public string Title { get; set; }

 /// <summary>
 ///Blog URL
 /// </summary>
 public string Url { get; set; }

 /// <summary>
 // / abstract
 /// </summary>
 public string Summary { get; set; }

 /// <summary>
 // / Author
 /// </summary>
 public string Author { get; set; }

 /// <summary>
 ///Release time
 /// </summary>
 public DateTime PublishTime { get; set; }
}

HTTP request failed retry

We use Polly to retry when our HTTP request fails, set to retry 3 times.

//Initialize retrier
_retryTwoTimesPolicy =
 Policy
 .Handle<Exception>()
 .Retry(3, (ex, count) =>
 {
  _logger.Error("Excuted Failed! Retry {0}", count);
  _logger.Error("Exeption from {0}", ex.GetType().Name);
 });

Test it:

It can be seen that Polly will help us retry three times when the exception is encountered, and will give up if all three retries fail.

Send mail

Mailkit is used to send mail. It supports IMAP, POP3 and SMTP protocols, and is excellent across platforms. The following is a class library encapsulated according to the sharing of previous Garden Friends:

using System.Collections.Generic;
using CnBlogSubscribeTool.Config;
using MailKit.Net.Smtp;
using MimeKit;

namespace CnBlogSubscribeTool
{
 /// <summary>
 /// send email
 /// </summary>
 public class MailUtil
 {
 private static bool SendMail(MimeMessage mailMessage,MailConfig config)
 {
  try
  {
  var smtpClient = new SmtpClient();
  Smtpclient. Timeout = 10 * 1000; // set timeout
  Smtpclient. Connect (config.host, config.port, mailkit. Security. Securesocketoptions. None); // connect to remote SMTP server
  smtpClient.Authenticate(config.Address, config.Password);
  Smtpclient. Send (mailmessage); // send mail
  smtpClient.Disconnect(true);
  return true;

  }
  catch
  {
  throw;
  }

 }

 /// <summary>
 ///Send mail
 /// </summary>
 ///< param name = "config" > configure < / param >
 ///< param name = "receives" > receiver < / param >
 ///< param name = "sender" > sender < / param >
 ///< param name = "subject" > Title < / param >
 ///< param name = "body" > content < / param >
 ///< param name = "attachments" > attachments < / param >
 ///< param name = "filename" > attachment name < / param >
 /// <returns></returns>
 public static bool SendMail(MailConfig config,List<string> receives, string sender, string subject, string body, byte[] attachments = null,string fileName="")
 {
  var fromMailAddress = new MailboxAddress(config.Name, config.Address);
  
  var mailMessage = new MimeMessage();
  mailMessage.From.Add(fromMailAddress);
  
  foreach (var add in receives)
  {
  var toMailAddress = new MailboxAddress(add);
  mailMessage.To.Add(toMailAddress);
  }
  if (!string.IsNullOrEmpty(sender))
  {
  var replyTo = new MailboxAddress(config.Name, sender);
  mailMessage.ReplyTo.Add(replyTo);
  }
  var bodyBuilder = new BodyBuilder() { HtmlBody = body };

  // accessories
  if (attachments != null)
  {
  if (string.IsNullOrEmpty(fileName))
  {
   Filename = "unnamed file. TXT";
  }
  var attachment = bodyBuilder.Attachments.Add(fileName, attachments);

  //Solve the confusion of Chinese file name
  var charset = "GB18030";
  attachment.ContentType.Parameters.Clear();
  attachment.ContentDisposition.Parameters.Clear();
  attachment.ContentType.Parameters.Add(charset, "name", fileName);
  attachment.ContentDisposition.Parameters.Add(charset, "filename", fileName);

  //Resolution file name cannot exceed 41 characters
  foreach (var param in attachment.ContentDisposition.Parameters)
   param.EncodingMethod = ParameterEncodingMethod.Rfc2047;
  foreach (var param in attachment.ContentType.Parameters)
   param.EncodingMethod = ParameterEncodingMethod.Rfc2047;
  }

  mailMessage.Body = bodyBuilder.ToMessageBody();
  mailMessage.Subject = subject;

  return SendMail(mailMessage, config);

 }
 }
}

Test it:

Explain

About the scheduling of grabbing data and sending mail, data processing of program exception exit, etc., I won’t elaborate here, and I’m interested in looking at the source code (GitHub address at the end of the article)

The fetching data is updated incrementally. The reason for not using RSS subscriptions is that RSS updates are slow.

Screenshot of complete program operation:

Every time an email is sent, the program will adjust the recording time to today’s 9 o’clock, and then determine whether the current time minus the recording time is greater than or equal to 24 hours after each data grabbing. If it meets the requirements, the program will send an email and update the recording time.

Screenshot of received mail:

The title of the email in the screenshot is 13 days, but the content of the email is 14 days, because I copied the data of today (14 days) into the data of 13 days to demonstrate the effect. Don’t be misled.

An attachment is also provided for easy collection:

OK, after the introduction, I have deployed the gadget to the server. Those who want to enjoy the service can leave their mailbox in the comments (manual funny).

Source sharing: https://github.com/stulzq/cnblogsubscribetool