Big data platform building hadoop-2.7.4 + spark-2.2.0 fast building

Time:2020-5-22

About Apache spark

Apache spark is a fast and universal computing engine designed for large-scale data processing. Spark is a general parallel framework of Hadoop MapReduce like open source by UC Berkeley amp lab. spark owns Hadoop The advantages of MapReduce; however, unlike MapReduce, the intermediate output of job can be saved in memory, so there is no need to read or write HDFS. Therefore, spark can be better applied to MapReduce algorithms that need to be iterated, such as data mining and machine learning.

Spark is a kind of open-source cluster computing environment similar to Hadoop, but there are some differences between them. These useful differences make spark perform better in some workload aspects. In other words, spark enables memory distributed data set, which can not only provide interactive query, but also optimize iterative workload.

Spark is implemented in the Scala language, which uses Scala as its application framework. Unlike Hadoop, spark and scala can be tightly integrated, and scala can operate distributed datasets as easily as local collection objects.

Although spark was created to support iterative jobs on distributed datasets, it is actually a supplement to Hadoop and can run in parallel in Hadoop file system. This behavior can be supported through a third-party clustering framework called mesos. Spark, developed by the University of California, Berkeley, amp labs, can be used to build large, low latency data analysis applications.

preparation

Environmental Science

JDK:1.8  
Spark-2.2.0
Hadoop Release:2.7.4  
centos:7.3  
host name IP address Installation services
spark-master 192.168.252.121 jdk、hadoop、spark、scala
spark-slave01 192.168.252.122 jdk、hadoop、spark
spark-slave02 192.168.252.123 jdk、hadoop、spark

Dependent environment

Spark is implemented in the Scala language, which uses Scala as its application framework. Unlike Hadoop, spark and scala can be tightly integrated, and scala can operate distributed datasets as easily as local collection objects. All we install Scala

Scala

Scala-2.13.0 installation and configuration

Hadoop

Hadoop-2.7.4 cluster fast building

install

Download and unzip

su hadoop
cd /home/hadoop/
wget https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz
tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz
mv spark-2.2.0-bin-hadoop2.7 spark-2.2.0

environment variable

If it is effective for all users, modify itvi /etc/profilefile
Modify if it is only effective for the current uservi ~/.bahsrcfile

sudo vi /etc/profile
#spark
export PATH=${SPARK_HOME}/bin:$PATH
export SPARK_HOME=/home/hadoop/spark-2.2.0/

Make environment variables effective, runsource /etc/profilesend/etc/profileEffectiveness of documents

Modify configuration

Modify spark- env.sh

cd /home/hadoop/spark-2.2.0/conf
mv spark-env.sh.template spark-env.sh
vi spark-env.sh
#java
export JAVA_HOME=/lib/jvm

#IP address of spark master node
export SPARK_MASTER_IP=192.168.252.121

#Port number of spark master node
export SPARK_MASTER_PORT=7077

Brief introduction of several variables

  • JAVA_ Home: Java installation directory
  • SCALA_ Home: Scala installation directory
  • HADOOP_ Home: Hadoop installation directory
  • HADOOP_ CONF_ Dir: directory of configuration files of Hadoop cluster
  • SPARK_ MASTER_ IP: IP address of master node of spark cluster
  • SPARK_ WORKER_ Memory: the maximum memory size that each worker node can allocate to the executors
  • SPARK_ WORKER_ Cores: number of CPU cores occupied by each worker node
  • SPARK_ WORKER_ Instances: number of worker nodes opened on each machine

Modify slaves

cd /home/hadoop/spark-2.2.0/conf
mv slaves.template slaves
vi slaves
node1
node2
node3

Configure cluster

Replication node

Go to the spark installation directory, package and send it to other nodes

cd cd /home/hadoop/

tar zcvf spark.tar.gz spark-2.2.0

scp spark.tar.gz [email protected]:/home/hadoop/
scp spark.tar.gz [email protected]:/home/hadoop/

go innode1,node2Node decompression

cd /home/hadoop/

tar -zxvf spark.tar.gz

environment variable

Let’s make sure you have enough environment variables for each node

#jdk
export JAVA_HOME=/lib/jvm
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${SPARK_HOME}/bin:${SCALA_HOME}/bin:${HADOOP_HOME}/bin:${JAVA_HOME}/bin:$PATH

#hadoop
export HADOOP_HOME=/home/hadoop/hadoop-2.7.4/

#scala
export SCALA_HOME=/lib/scala

#spark
export SPARK_HOME=/home/hadoop/spark-2.2.0/

Start cluster

Turn off firewall

systemctl stop firewalld.service

Start Hadoop

cd /home/hadoop/hadoop-2.7.4/sbin

./start-all.sh

Launch spark

cd /home/hadoop/spark-2.2.0/sbin

./start-all.sh

Launch spark shell

cd /home/hadoop/spark-2.2.0/bin

./spark-shell

Spark: 192.168.252.121:8080

Spark shell access: 192.168.252.121:4040

Big data platform building hadoop-2.7.4 + spark-2.2.0 fast building

Big data platform building hadoop-2.7.4 + spark-2.2.0 fast building

Contact

  • Author: Peng Lei
  • source: http://www.ymq.io
  • Email:[email protected]
  • The copyright belongs to the author. Please indicate the source of reprint
  • Wechat: pay attention to official account, Search Cloud database, focus on developing technology research and knowledge sharing.

Big data platform building hadoop-2.7.4 + spark-2.2.0 fast building

Recommended Today

Tomcat tuning notes

This article is the experience (which will be updated continuously) obtained in the Tomcat tuning process. Relevant environment: java version “1.8.0_131” Tomcat 8.5.14 Jmeter 3.1 JMeter parameter: 300 threads 1000 cycles URL:http://localhost:8080/ Tomcat server.xml Parameters: protocol=”org.apache.coyote.http11.Http11Nio2Protocol” acceptCount=”5000″ maxConnections=”20000″ Tomcat JVM parameters:-server -Xms4g -Xmx4g JIT intervention Tomcat server.xml Keep the default values. When Tomcat is not […]