DWQA QuestionsCategory: ProgramHow can Java quickly read the TXT format corpus and match the specified string?
lionel_sun asked 2 months ago

There is a 9m multi line corpus with a file size of 4G. Now you need to match the output that meets the sentence conditions with the specified verb.
But the file is too large. Read one line at a time. It will take a long time to match. Is there any way to speed up the processing.

BufferedReader cpreader = new BufferedReader(new InputStreamReader(new FileInputStream(this.getCorpusPath())));
tring line = cpreader.readLine();
while(line != null)
            {
                ArrayList<String> verbList = new ArrayList();
                matcher_line = Pattern.compile("(.*\\%\\&\$cook\\%\\&\$VB.*)").matcher(line);
                if(matcher_line.find())
                {
                    System.out.println(line);
                }
                
                
                
                line = cpreader.readLine();
            }

,There should be no problem reading files, but you can try to change to buffered reading, because the size of a line may be uncertain, which will affect the efficiency..
If it is a single word, you can use a better matching method. If it is regular, you don’t know,Your program is processed by line. Single thread processing must be slow. Use multi-threaded processing. Each thread processes one line, and then requests to process the next line. If you read a line, you’d better use the cache to read multiple lines, and then allocate them to multiple threads for processing, so as to maximize the use of CPU.,NiO + multithreading,

Pattern.compile("(.*\\%\\&\$cook\\%\\&\$VB.*)")

This is in the loop. You have to compile the regular code every time, so it’s very slow. You can put this outside of while,AC automata, the size of the constructed tree should be less than 4G, and ordinary notebooks should be enough

Reinterpretation replied 2 months ago

I’m also doing Bi design in this regard. Can you add QQ and ask some questions

5 Answers
araraloren answered 2 months ago

There should be no problem reading files, but you can try to change to buffered reading, because the size of a line may be uncertain, which will affect the efficiency..
If it is a single word, you can use a better matching method. If it is regular, you don’t know

lionel_sun replied 2 months ago

Thank you. It’s much faster for me to use if (line. Contains)… It’ll be over in a minute

min answered 2 months ago

Your program is processed by line. Single thread processing must be slow. Use multi-threaded processing. Each thread processes one line, and then requests to process the next line. If you read a line, you’d better use the cache to read multiple lines, and then allocate them to multiple threads for processing, so as to maximize the use of CPU.

lionel_sun replied 2 months ago

thank you. I’m a novice and don’t know how to use threads. Using cache will

KaiLee answered 2 months ago

NiO + multithreading

lihuanghe121 answered 2 months ago
Pattern.compile("(.*\\%\\&\$cook\\%\\&\$VB.*)")

This is in the loop. You have to compile the regular code every time, so it’s very slow. You can put this outside of while

morriaty_the_murderer answered 2 months ago

AC automata, the size of the constructed tree should be less than 4G, and ordinary notebooks should be enough