When inputting a web address, what happened in the background


As a software developer, you must have a complete hierarchical understanding of how network applications work, and also include the technologies used in these applications, such as browser, HTTP, HTML, web server, requirement processing, etc.

This article will further study what happens in the background when you input a web address

1. First of all, you have to input the web address in the browser:

2. The browser looks up the IP address of the domain name


The first step of navigation is to find the IP address of the domain name visited. The DNS lookup process is as follows:

Browser cacheThe browser caches DNS records for a period of time. Interestingly, the operating system does not tell the browser when to store DNS records, so different browsers will store a self fixed time (ranging from 2 minutes to 30 minutes). System cache  – If the required record is not found in the browser cache, the browser will make a system call (gethostbyname in Windows). In this way, the records in the system cache can be obtained. Router cache  – Next, the previous query request is sent to the router, which usually has its own DNS cache. ISP DNS cache  – The next thing to check is the ISP’s DNS caching server. You can usually find the corresponding cache record here. Recursively searching  – Your ISP’s DNS server starts recursive search with the domain name server, from. Com top-level domain name server to Facebook’s domain name server. Generally, there are domain names in. COM domain name server in DNS server cache, so the matching process to top-level server is not so necessary.

DNS recursive lookup is shown in the following figure:


One thing about DNS is worrying. It’s like wikipedia.org Or facebook.com Such a whole domain name just seems to correspond to a single IP address. Fortunately, there are several ways to eliminate this bottleneck:

Circular DNS  It is a solution when DNS lookup returns multiple IPS. For example, Facebook.com It actually corresponds to four IP addresses. Load balancer is a hardware device that listens with a specific IP address and forwards network requests to the cluster server. Some large sites generally use this expensive high-performance load balancer. GeographyDNS  According to the geographical location of users, the domain name is mapped to multiple different IP addresses to improve scalability. In this way, different servers can’t update synchronization status, but it’s very good to map static content.Anycast  It is a routing technology that one IP address maps to multiple physical hosts. Anycast is not well adapted to TCP protocol, so it is rarely used in those schemes.

Most DNS servers use anycast for efficient and low latency DNS lookup.


3. The browser sends an HTTP request to the web server


Because dynamic pages like Facebook’s home page will soon or even expire in the browser’s cache after they are opened, there is no doubt that they can’t read from them.

So the browser will send the following request to the server where Facebook is located:

GET http://facebook.com/ HTTP/1.1 Accept: application/x-ms-application, image/jpeg, application/xaml+xml, [...] User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; [...] Accept-Encoding: gzip, deflate Connection: Keep-Alive Host: facebook.com Cookie: datr=1265876274-[...]; locale=en_US; lsd=WW[...]; c_user=2101[...]

Get this request defines what to readURL: “ http://facebook.com/ ”。 Browser self definition(User-Agent  Head), and what type of correspondence it wants to accept(Accept andAccept-Encoding  Head)  ConnectionThe header requires the server not to close the TCP connection for subsequent requests.

The request also contains the name of the domain stored in the browsercookies. As you may already know, among different page requests, cookies are key values that match the status of a website. In this way, cookies will store the login user name, the password assigned by the server and some user settings. Cookies are stored in the client as text documents and sent to the server every time they are requested.

There are many original HTTP requests and their corresponding tools. The author prefers fiddler. Of course, there are other tools like firebug. These software will be very helpful in website optimization.

In addition to getting requests, there is also a way to send requests, which is often used in submitting forms. Send the request and pass its parameters through the URL (e.g.): http://robozzle.com/puzzle.aspx?id=85 )。 The send request sends its parameters after the request body header.

Like“ http://facebook.com/ ”The slash in is crucial. In this case, the browser can safely add slashes. Like “http:// example.com/folderOrFile ”Because the browser doesn’t know whether folderorfile is a folder or a file, it can’t add slashes automatically. At this time, the browser will directly access the address without slash, and the server will respond to a redirection, resulting in an unnecessary handshake.

4. Permanent redirection response of Facebook service


The figure shows the response from the Facebook server to the browser

HTTP/1.1 301 Moved Permanently Cache-Control: private, no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Sat, 01 Jan 2000 00:00:00 GMT Location: http://www.facebook.com/ P3P: CP="DSP LAW" Pragma: no-cache Set-Cookie: made_write_conn=deleted; expires=Thu, 12-Feb-2009 05:09:50 GMT; path=/; domain=.facebook.com; httponly Content-Type: text/html; charset=utf-8 X-Cnection: close Date: Fri, 12 Feb 2010 05:09:51 GMT Content-Length: 0

The server responds to the browser with a 301 permanent redirection response, so that the browser can access it“ http://www.facebook.com/ ”Instead of“ http://facebook.com/ ”。

Why does the server have to redirect rather than directly send the web content that users want to see? There are many interesting answers to this question.

One of the reasons is related to the ranking of search engines. You see, if a page has two addresses, it’s like http://www.igoro.com/ and http://igoro.com/ Search engines will think that they are two websites, resulting in the reduction of search links in each website, thus lowering the ranking. The search engine knows what 301 permanent redirection means, so it will visit with WWW and without www address to the same site ranking.

Another is that using different addresses will make cache friendliness worse. When a page has several names, it may appear several times in the cache.

5. The browser tracks the redirection address


Now, the browser knows“ http://www.facebook.com/ ”Is the correct address to access, so it will send another get request:

GET http://www.facebook.com/ HTTP/1.1 Accept: application/x-ms-application, image/jpeg, application/xaml+xml, [...] Accept-Language: en-US User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; [...] Accept-Encoding: gzip, deflate Connection: Keep-Alive Cookie: lsd=XW[...]; c_user=21[...]; x-referer=[...] Host: www.facebook.com

The header information has the same meaning as in the previous request.

6. The server “processes” the request


The server receives the get request, then processes and returns a response.

On the surface, it seems to be a forward task, but in fact, a lot of interesting things have happened in the process – a simple website like the author’s blog, not to mention a website with a large number of visits like Facebook!

Web server software
Web server software (like IIS and Apache) receives the HTTP request and then determines what request processing to perform to process it. Request processing is a program that can read requests and generate HTML to respond (like ASP.NET ,PHP,RUBY…)。

For the simplest example, requirement processing can be stored in a file hierarchy that maps the address structure of a website. image http://example.com/folder1/page1.aspx This address maps to the file / httpdocs / folder1 / page1. ASPX. Web server software can be set as the corresponding request processing of address manual, so that the publishing address of page1. ASPX can be http://example.com/folder1/page1 .

Request processing
The request handles the reading request and its parameters and cookies. It will read and possibly update some data and store it on the server. Then, the requirements processing generates an HTML response.

All dynamic websites are faced with an interesting difficulty – how to store data. Half of small websites will have a SQL database to store data. Websites that store a lot of data and / or have a large number of visits have to find some ways to allocate the database to multiple machines. The solutions are: sharding (based on the primary key value, the data table is distributed to multiple databases), replication, and the use of weak semantic consistency to simplify the database.

Delegating work to batch processing is a cheap technique to keep data updated. For example, Facebook has to update its news feed in time, but the “people you may know” function supported by data only needs to be updated every night (the author guesses that’s true, but how to improve the function is unknown). Batch job update can cause some unimportant data stale, but it can make data update faster and more concise.

7. The server sends back an HTML response


The response generated and returned by the server is shown in the figure

HTTP/1.1 200 OK Cache-Control: private, no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Sat, 01 Jan 2000 00:00:00 GMT P3P: CP="DSP LAW" Pragma: no-cache Content-Encoding: gzip Content-Type: text/html; charset=utf-8 X-Cnection: close Transfer-Encoding: chunked Date: Fri, 12 Feb 2010 09:05:55 GMT  [email protected][...]

The whole response size is 35kb, most of which are transmitted as blobs after sorting.

Content codingThe header tells the browser to compress the whole response body with gzip algorithm. After decompressing the blob, you can see the expected HTML as follows:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" id="facebook" class=" no_js"> <head> <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> <meta http-equiv="Content-language" content="en" /> ...

About compression, the header information indicates whether to cache the page, how to do it if it is cached, what cookies to set (there is no such point in the previous response) and privacy information, etc.

Please pay attention to the key in the headerContent-typeSet to“text/html”。 The header allows the browser to present the response in HTML rather than download it as a file. The browser will decide how to interpret the response based on the header information, but other factors such as URL extension content will also be considered.

8. The browser starts to display HTML

When the browser does not fully accept all the HTML documents, it has already started to display this page:


9. The browser sends and gets the object embedded in HTML


When the browser displays HTML, it will notice that it needs to get tags for other address content. At this time, the browser will send a request to retrieve the files.

Here are some of our interviews facebook.com Several URLs need to be reacquired during the process:

CSS style sheet
JavaScript file

These addresses have to go through a process similar to HTML reading. So the browser will look up these domain names in DNS, send requests, redirect and so on

But unlike dynamic pages, static files allow browsers to cache them. Some files may be read directly from the cache without communication with the server. The server’s response contains information about the age of static files, so the browser knows how long to cache them. In addition, each response may contain Etag header (the entity value of the requested variable) that works like the version number. If the browser observes that the Etag information of the file version already exists, it will stop the file transmission immediately.

Try to guess“fbcdn.net”What does it represent in the address? The smart answer is “Facebook content distribution network.”. Facebook uses CDN to distribute static files like images, CSS tables and JavaScript files. Therefore, these files will be backed up in many CDN data centers around the world.

Static content often represents the bandwidth size of the site, and can be easily replicated through CDN. Usually the website will use the third party CDN. Facebook’s static files, for example, are hosted by the largest CDN provider, aka.

For example, when you try to Ping static.ak.fbcdn.net At the same time, it may be from a certain point akamai.net Get a response on the server. Interestingly, when you Ping again, the responding servers may be different, which indicates that the load balancing behind the scenes is working.

10. Browser sends asynchronous (Ajax) request


Under the guidance of the great spirit of Web 2.0, the client still keeps in touch with the server after the page is displayed.

Take Facebook chat as an example. It keeps in touch with the server to update your bright and gray friends. In order to update the status of friends whose avatars are on, JavaScript code executed in the browser sends asynchronous requests to the server. This asynchronous request is sent to a specific address. It is a programmatically constructed get or send request. Or in the case of Facebook, the client sends the http://www.facebook.com/ajax/chat/buddy_ list.php A post request to get status information about which of your friends is online.

When it comes to this pattern, we must talk about “Ajax” – asynchronous JavaScript and XML. Although there is no clear reason why the server responds in XML format. As another example, for asynchronous requests, Facebook will return some snippets of JavaScript.

Among other things, Fiddler allows you to see asynchronous requests sent by browsers. In fact, you can not only passively watch these requests, but also actively modify and resend them. Ajax request is so easy to be hoodwinked, but it really makes those scoring online game developers depressed( Of course, don’t cheat people like that ~)

Facebook chat provides an interesting case of Ajax problem: pushing data from the server to the client. Because HTTP is a request response protocol, the chat server cannot send new messages to clients. Instead, the client has to poll the server every few seconds to see if it has any new messages.

Long polling is an interesting technique to lighten the load on the server. If the server has no new messages when it is polled, it ignores the client. When a new message from the client is received before the timeout, the server will find the unfinished request and return the new message to the client as a response.