I developed a whole network search engine from zero!

Time:2021-8-20

preface

Because I was very interested in search engine technology, I began to try to develop a search engine. After referring to the limited information on the Internet and studying by ourselves, we finally developed a small network wide search engine with project address and search test pictures at the bottom.

The language of this project is PHP (the language is not important, but the idea, architecture, and algorithm).


General flow of search engine

1、 Web page collection

Web page collection requires web crawlers. Due to various and unstable Internet connections, a robust crawler system is needed to deal with complex situations. Crawling strategies are generally divided into depth first and breadth first. The specific choice depends on the situation. An HTTP request is very time-consuming, ranging from 1 second to a few seconds, so you need to use multi-threaded crawling (I use curl)_ multi); At the same time, cluster crawling can be configured conditionally.

2、 Pretreatment

Preprocessing is the most complex part of search engine. Basically, most ranking algorithms take effect in preprocessing. In the preprocessing process, the search engine mainly processes the data in the following steps:

Extract keywords

The page captured by the spider is the same as the source code we view in the browser. The code is usually chaotic, and many of them have nothing to do with the main content of the page. Therefore, search engines need to do the following things:
① Code denoising. Remove all the code from the web page, leaving only the text.
② Remove non text keywords. For example, the navigation bar on the page and the keywords of public areas shared by other different pages.
③ Remove stop words. Stop words refer to words without specific meaning, such as “de”, “Zai”, etc.

When the search engine gets the keywords of this web page, it will use its own word segmentation system to divide the text into a word segmentation list, and then store it in the database and correspond with the URL of this text one by one.

Web page de duplication

There are a lot of duplicate web content on the Internet. If it is not processed, it will seriously affect the search experience. This step involves the de duplication technology of massive data. Since web pages cannot simply compare whether they are repeated by string comparison, the general web page de duplication logic is to extract the fingerprint of web pages (involving natural language processing, word vector, etc.), and then compare and de duplicate.
Comparing the similarity of strings, common techniques include “cosine similarity”, “Hamming distance”, etc.

Web page denoising

In the process of web page denoising, the useless contents such as tags in the web page are removed, and the most important phrases in the web page are analyzed by making full use of web page code (such as H tag, strong tag), keyword density, internal chain anchor text and so on.

Data save update

When the amount of data comes up, all small problems will become big problems. After a large amount of data processing, the selection and design of database are particularly important. Because we should take into account the rapid insertion and query of massive data. For the saved data, we should also consider the update of data and design the update strategy. The crawling and updating of so many contents will have higher requirements for the performance and quantity of servers.

Web page importance analysis

Determine the weight value of a web page, combined with the above “important information analysis”, so as to establish the ranking coefficient of each keyword in the keyword set P of this web page.

Inverted index

The search engine can quickly query the corresponding content because of the use of index. An index is a data structure. Generally speaking, search engines use inverted index structure, that is, word segmentation of web page content, synthesis of different document IDs of the same word segmentation, and so on. You can learn more about the relevant details. Search engines need to have a high recall rate and ensure the search effect, so the choice of word segmentation device and word segmentation strategy need to be carefully considered and selected.

The index is divided into full index and incremental index. The full index updates all at one time, which is relatively time-consuming. The incremental index updates only the index of “newly added content” each time, and then merges and queries with the old index.

3、 Query service

As the name suggests, query service is to process the query requests of users in the search interface. The search engine builds the searcher and then processes the request in four steps.

Query rewrite

A considerable part of the search statements may be unclear and incomplete. At this time, if the word segmentation search is carried out according to the original content, the effect is certainly not ideal. At this time, query rewriting should be carried out to make the search words more accurately express the searcher’s ideas, so as to achieve a high recall rate.

Cut words according to query methods and keywords.

Firstly, the keywords searched by the user are segmented into a keyword sequence. We temporarily use Q to represent it, and then the keyword Q searched by the user is segmented into q = {Q1, Q2, Q3,…, QN}.

Then determine the importance of each word in the query results according to the user’s query method, such as whether all words are connected together or there is a space in the middle, and according to the part of speech of different keywords in Q.

Content filtering and elimination

There will certainly be some illegal contents in many web pages, so it is necessary to eliminate the relevant contents to prevent them from being displayed on the front desk; Sometimes, searchers will search for some sensitive content and process the search query.

Sorting search results

We have a search term set Q, calculate the importance of each keyword in Q relative to the document, and carry out a comprehensive sorting algorithm, and the search results will come out. Sorting algorithm is the core of search engine, which affects the accuracy of search results. In practical application, the calculation method of sorting is multi-dimensional and extremely complex.

Show search results and document summaries

When there are search results, the search engine will display the search results on the user reading interface for users to use. Generally, the search words will be marked in red for better display effect.


other

Optimize the content, including caching search results using redis and other caching tools; Ensure speed, use CDN, etc.


Search results test

I developed a whole network search engine from zero!
I developed a whole network search engine from zero!
I developed a whole network search engine from zero!
I developed a whole network search engine from zero!
I developed a whole network search engine from zero!
I developed a whole network search engine from zero!


Summary

The above is just a brief description of the general process of search engine, and there are many specific details. My project address iswww.qsask.com, welcome to experience.
Limited by the objective environment, the amount of data stored now is not large, that is, tens of millions.