Installation of CentOS
- First prepare the two image files of CentOS. Here we use CentOS version 6.5
- Then open VMware and create a new virtual machine (select classic mode). It should be noted here that the virtual disk should be stored as a single file, so that the files are not easy to be lost in the process of moving and copying in the future
- After the virtual machine is started, it takes about 20 minutes for the first boot (because many files need to be updated, QAQ)
Network environment preparation NAT configuration
Related knowledge: the difference between bridge mode and NAT mode
- Right click the illuminated icons (network adapter) of the two computers in the lower right corner to enter the settings, and the interface as shown in the figure appears
Among the four modes of network connection, bridge mode and NAT mode are commonly used.
Choosing the bridge mode is equivalent to treating the virtual machine as a computer parallel to the windows host. However, the bridge mode is not suitable in some fields. For example, in the company and school, sometimes each person’s host will be assigned a fixed IP address. At this time, if the virtual machine is connected to the network as a parallel host, it needs to assign a new IP address for the virtual machine, which may conflict with the IP address of other computers. Therefore, we often use NAT mode in school companies.
Nat mode not only ensures the virtual machine to be connected to the Internet, but also does not need to configure the IP address of the same network segment with the host. It is equivalent to building a LAN network environment in your own windows host, so that the virtual machine can access the Internet through the windows network.
Related knowledge: how to solve the problem that ipconfig command cannot be used in windows?
The first time you enter the ipconfig command, this error occurs: the “XX” command is not an internal or external command, nor is it a runnable program
First, search in the system ipconfig.exe And copy the path of this file
Then set the environment variable. The value in the environment variable is equivalent to a shortcut to tell the system where to find some commands entered on the command line. Control panel – system and security – System – advanced system settings – environment variable, edit the path variable in the system variable, and add the ipconfig.exe Click OK to complete the configuration
At this time, it is possible that the ipconfig input in the console still reports an error. This is because the console was opened before configuration, and the settings before configuration are still retained. Let’s close the console and turn it back on again
- First of all, we need to re allocate a network segment for the network in NAT mode to prevent conflicts with the host network segment. Click Edit – virtual network editor in the upper right corner, remove VMnet8 in net mode, and then add VMnet8 again. You can see that the subnet address of VMnet8 is 192.168.186.0, which is not in a network segment (186, 0) with the IP address 192.168.0.103 of the host, thus avoiding IP address conflict
It is worth noting that after the net mode is set, it needs to be initialized again. We first switch the net mode to the bridge mode, and then call back to the net mode to complete the initialization
At this time, enter ifconfig to view and find that the network is not connected
- Then, we need to configure the files in the Linux system
First, modify the ifcfg-eth0 file, which initially looks like this
Change to this way, ESC,: WQ saves and exits
Revision and interpretation
- By default, IP is automatically obtained by DHCP. Dynamic Host Configuration Protocol, referred to as DHCP, is a network protocol applied to LAN. It allows servers to dynamically assign IP addresses and configuration information to clients. Let’s change it to static so that the IP address is fixed
- IPADDR assigns an IP address to this virtual machine. 186 is the network segment we set up and 14 is the address we get ourselves. Taking 14 here is a personal habit. I used to take the first one as 10 and then add 1 to each virtual machine in the back. This is the fourth, so I take 14
- Netmask sets the subnet mask. The subnet mask cannot exist alone and must be used in combination with the IP address. Subnet mask has only one function, which is to divide an IP address into two parts: network address and host address.
The setting of subnet mask must follow certain rules. It must be in the form of multiple 1’s and multiple 0’s connected, with a total of 8 * 4 = 32 bits. If the subnet mask is 255.255.255.0, it is 11111111.11111111.11111111.00000000. The part of 1 is the network address part, and the part of 0 is the host address part. 255.255.255.0 this subnet mask can hold 2 8th power computers, that is 256 computers. However, two IP addresses cannot be used. All the bits are 0 for the network number and 1 for the broadcast number. Subtracting these two sets, 254 computers are obtained.
The subnet mask setting for the IP address is not arbitrary. If the subnet mask is set too large, that is to say, the subnet range is expanded. According to the subnet routing rules, it is likely that the data sent to the destination machine which is not in the same subnet as the local machine will be considered as the destination machine because of the wrong judgment. Then, the data packets will cycle in the local subnet until the timeout and discard, so that the data can not reach the destination correctly If the subnet mask is set too small, the communication between machines in the same subnet will be regarded as cross subnet transmission, and the data packets will be handed over to the default gateway for processing, which will inevitably increase the burden of the default gateway and reduce the network efficiency. Therefore, the subnet mask should be set according to the size of the network.
4. Gateway sets the gateway, which is the data 192.168.186.2 in the network editor. Gateway is essentially an IP address from a network to other networks. For example, there are network a and network B. the IP address range of network a is 192.168.1.1 ~ 192.168.1.254, and the subnet mask is 255.255.255.0; the IP address range of network B is 192.168.2.1 ~ 192.168.2.254, and the subnet mask is 255.255.255.0. If there is no router, there is no TCP / IP communication between the two networks. Even if the two networks are connected to the same switch (or hub), TCP / IP protocol will determine that the hosts in the two networks are in different networks according to the subnet mask (255.255.255.0). In order to realize the communication between the two networks, the gateway must be used.
If the host in network a finds that the destination host of the packet is not in the local network, it forwards the packet to its own gateway, which then forwards it to the gateway of network B, and the gateway of network B forwards it to a host of network B. The same is true for network B to forward packets to network a. Therefore, only by setting the IP address of gateway, can TCP / IP protocol realize the communication between different networks. Which machine’s IP address is this IP address? The IP address of the gateway is the IP address of the device with routing function. The devices with routing function include router, server with routing protocol enabled (essentially equivalent to a router), proxy server (also equivalent to a router)
5. DNS uses common Unicom domain name. Domain name system (DNS) is a service of Internet. As a distributed database that maps domain name and IP address to each other, it can make people access the Internet more conveniently, without having to remember the IP data string that can be read directly by the machine.
- After the configuration is completed, open another console and restart the network service
At this time, enter ifconfig and find that the IP address is the same as the setting, indicating that the file is modified successfully
Visit Baidu successfully, has been connected to the Internet
Terminal SecureCRT configuration
In addition, we usually do not log in to the machine directly, but operate remotely through the terminal. Here we use SecureCRT
Open SecureCRT, click new session, enter virtual machine IP address in hostname, enter virtual machine user name in username, and click finish.
How to install and activate SecureCRT, please refer to Baidu
Installation and activation of SecureCRT
Hadoop cluster building
It is worth noting that a distributed environment requires multiple machines to form a cluster. Here we take three machines as an example
We first replicate twice to get three virtual machines. Note that the environment must be suspended or shut down when copying, because the environment of the operating system often changes in the running state, and it is easy to lose information.
After the replication is completed, it should be noted that since slave1 and 2 are identical with the master, there is a conflict due to the same IP address. We need to modify them first
First, we configure the IP addresses of 1 and 2 as the first machine and restart, but we find that we still can’t access the network.
This is because we copy it, it is bound to copy the attributes of the previous network card. Ifconfig found that although the IP address has changed, the network card is still the same (hwaddr).
Let’s click Network Adapter – settings to remove the network adapter and add it again. At this time, ifconfig again found that the network card has changed, which can be online
In the same way, the three machines are built and can be connected to the Internet
Since Hadoop needs JAVA support, we need to install Java in the virtual machine first
Here, in order to facilitate the operation of copy, we share the directory
After completion, check whether the sharing is successful. If the sharing is successful, we will find the share in the HGFs directory of MNT_ Folder folder. Copy both files to SRC and execute JDK.
After Java, we need to configure environment variables. VIM ~ /. Bashrc configure bashrc file as shown in the figure
Note that after the configuration is completed, we need to source it, otherwise it will not take effect
Officially install Hadoop cluster
First, decompress Hadoop
Enter Hadoop directory, and then add TMP directory under the directory to store temporary files generated during Hadoop operation
Then enter the conf directory, first modify the master file inside
Then modify the slaves file
Modify core- site.xml Documents. This file is responsible for configuring cluster global parameters. The main content is used to define system level parameters, such as HDFS URL, Hadoop temporary directory, etc
Modify mapred- site.xml This file is responsible for configuring MapReduce parameters, including jobhistory server and application parameters, such as the default number of reduce tasks, the default upper and lower limits of memory that tasks can use, etc
Modify HDFS- site.html File, responsible for HDFS, such as the storage location of name node and data node, the number of file copies, the read permission of file, etc. Value configures the number of copies of HDFS storage, three by default
Configure Hadoop- env.sh File, in the last export Java_ HOME
So far, we have configured a total of 6 files. Next, configure the local network
Open / etc / hosts. The purpose of changing this file is: if it is not changed, the access can only be accessed through IP address. After establishing this mapping relationship, you can access the node by visiting the hostname
Modify / etc / sysconfig / network and modify the host name of each virtual machine
At present, we have modified two files to specify the mapping relationship between the host name and IP address of the current machine
We further copy Hadoop to other virtual machines remotely, so that we don’t need to change another file in Hadoop. Just configure the local network
In order to avoid the problem of network operation failure in the future (the problem of network transmission is difficult to investigate), first close the system firewall and command line is / etc / init.d/iptables stop
And SELinux, the command line: setenforce 0
Next, we need to establish the mutual trust relationship between each machine, that is, when accessing the IP or host name of other machines remotely, there is no need for password verification
First SSH keygen, and then enter the SSH hidden directory. LS finds that there are two files in SSH: public key and dead key. Let’s first create authorized through touch
A kind of Then copy all the contents of the public keys of the three virtual machines to the authorized of each virtual machine_ Keys.
Finally, note that after creating a new one, you need to change authorized_ If not, the SSH public key must satisfy at least two conditions as follows:
1. The permission of SSH directory must be 700;
2 .ssh/authorized_ The permissions of keys file must be 600;
We execute the following command to do this
chmod 600 ~/.ssh/authorized_keys
Question: why authorized_ After the keys are created, the remote connection still needs a password