Parse the txt content with the label removed

Time:2021-9-22
Copy codeThe code is as follows:
NodeList body_nodes=this.getParser().parse(body_filter);
for(int i=0;i<body_nodes.size();i++)
{
Node node=body_nodes.elementAt(i);

Parser body_parser=new Parser(node.toHtml());
TextExtractingVisitor visitor=new TextExtractingVisitor();
body_parser.visitAllNodesWith(visitor);
body.append(visitor.getExtractedText());
}

Textextractingvisitor, visitalnodeswith and other classes and methods are very important but rare in visitor.
The source code is attached below:

Copy codeThe code is as follows:
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Date;

import org.htmlparser.Node;
import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.filters.HasChildFilter;
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.util.NodeList;
import org.htmlparser.visitors.TextExtractingVisitor;

import com.extractor.Extractor;

public class ExtractorHangdian extends Extractor{
public void extract()
{
BufferedWriter bw=null;
String indextime;
String title;
StringBuffer body=new StringBuffer();;
NodeFilter time_filter=new AndFilter(new TagNameFilter(“font”),new HasAttributeFilter(“color”,”#808080″));
NodeFilter title_filter1=new AndFilter(new TagNameFilter(“td”),new HasChildFilter(new TagNameFilter(“b”)));
NodeFilter body_filter=new AndFilter(new TagNameFilter(“td”),new HasChildFilter(new TagNameFilter(“p”)));

try
{
NodeList title_nodes=this.getParser().parse(title_filter1);
Node node=title_nodes.elementAt(0);
NodeList node2=node.getChildren();
//title=node2.elementAt(0).toHtml(); /* ‘\r\n’ */
//title=node2.elementAt(1).toHtml(); /*font color=”#000080″ style=”font-size:14.4px*/
//title=node2.elementAt(2).toHtml(); /* b */
title=node2.elementAt(3).toHtml(); /* Notice on subscription of teaching materials and registration of teachers’ books*/

bw=new BufferedWriter(new FileWriter(new File(this.getOutputPath()+title+”.txt”)));

String url_seg1=getInputFilePath().substring(3,30);
int end=getInputFilePath().lastIndexOf(“.”);
String url_seg2=getInputFilePath().substring(30, end);
String url_seg=url_seg1+”.asp?”+url_seg2;
url_seg=url_seg.replaceAll(“\\\\”,”/”);
String url=”http://”+url_seg;

bw.write(url+NEWLINE);
bw.write(title+NEWLINE);

}
catch(Exception e)
{
e.printStackTrace();
}

this.getParser().reset();
try
{
NodeList time_nodes=this.getParser().parse(time_filter);
Node time_ node=time_ nodes.elementAt(1);// “1” here means time compliance_ Second element of filter
indextime=time_node.getNextSibling().toHtml();

bw.write(indextime+NEWLINE);
}
catch(Exception e)
{
e.printStackTrace();
}

this.getParser().reset();// Get all TXT text with labels removed
try
{
NodeList body_nodes=this.getParser().parse(body_filter);
for(int i=0;i<body_nodes.size();i++)
{
Node node=body_nodes.elementAt(i);

Parser body_parser=new Parser(node.toHtml());
TextExtractingVisitor visitor=new TextExtractingVisitor();
body_parser.visitAllNodesWith(visitor);
body.append(visitor.getExtractedText());
}
bw.write(body+NEWLINE);

}
catch(Exception e)
{
e.printStackTrace();
}

try
{
if(bw!=null)
bw.close();
}catch(IOException e)
{
e.printStackTrace();
}
}
}

By the way, when BW was not turned off, why couldn’t I read it? I’ve been depressed for several days. I’m angry when I think of it. Pay attention!!