Py = > Ubuntu Hadoop yarn HDFS hive spark installation configuration


environment condition

Java 8
Python 3.7
Scala 2.12.10
Spark 2.4.4
hadoop 2.7.7
hive 2.3.6
mysql 5.7

R 3.1 + (may not be installed)

Install Java

A priori portal:

Install Python

Bring Python 3.7 with Ubuntu

Install Scala

Download: https://downloads.lightbend.c

Tar - zxvf download good Scala

To configure:

vi ~/.bashrc
    export SCALA_HOME=/home/lin/spark/scala-2.12.10
    export PATH=${SCALA_HOME}/bin:$PATH
Save exit

Activate configuration:

source ~/.bashrc

Install Hadoop

Note in advance: if HDFS and yarn are not used, the whole Hadoop installation can be ignored, and spark can be installed directly.
Detailed address:

Tar - zxvf download good Hadoop

Configure Hadoop:

vi ~/.bashrc
    export HADOOP_HOME=/home/lin/hadoop/hadoop-2.7.7
    export PATH=${HADOOP_HOME}/bin:$PATH

Activate configuration:

source ~/.bashrc

HDFS configuration:

Enter the extracted etc / Hadoop (note that this is not the root directory etc, but the extracted Hadoop directory etc)

Echo $Java? Home? Copy the printed path

VI (find the line export java_home and replace with the following)
    export JAVA_HOME=/home/lin/spark/jdk1.8.0_181  
VI core-site.xml: (HDFS followed by host name: port number) (the host name is ~ ~) after the @ displayed by the terminal.)
VI hdfs-site.xml: (also added between < configuration >) (/ home / Lin / HDFS is my existing directory)


Format HDFS:

hadoop namenode -format
#Then go to the path configured above to see if there is anything new in it: / home / Lin / HDFS

Turn on HDFS (enter the SBIN directory first, SBIN and etc bin are the same level, and here are all the directories in Hadoop):


#All the way yes, if you are asked to enter the password, enter the password of the corresponding server. (I'm all local)
#If you are prompted with permission error, continue to look down (branch line)
    Sudo passwd root activate the root user of Ubuntu and set the password
    vi /etc/ssh/sshd_config:
        Permitrootlogin yes ා add this anywhere (this option may be added after existing comments)
    service ssh restart

To view the contents of the root directory / in HDFS:

hadoop fs -ls /

Transfer files to the root directory / in HDFS:

Echo test > test.txt. First, create a file at random

Hadoop FS - put test.txt / ා transfer files to the root directory / in HDFS

Hadoop FS - LS / ා if you look again, you will find the test.txt file.

Read the file test.txt from the root directory / in HDFS:

hadoop fs -text /test.txt

From the Hadoop webui, check whether the file just existed:

Http:// / ා 50070 is the default port

Click the right drop-down box "utilities" - > "browser the file system"
Clearly, our test.txt is lying there~

YARN configuration

Or etc / Hadoop:

cp mapred-site.xml.template mapred-site.xml

vi mapred-site.xml:
vi yarn-site.xml:

Start yarn: (still in SBIN directory)


#Similarly, if there is a password, enter the password of the machine

To view yarn from Hadoop webui:

Install MySQL

MySQL will be used as follows, so please mention the installation and configuration of MySQL separately:
In fact, MySQL doesn’t need to be mentioned separately (but when I install it, I have different experience from the past), so let’s talk about it:

apt-get install mysql-server-5.7

It’s easy to install. There are some disadvantages in different versions of MySQL configuration (Ubuntu 19 is used for me):

vi /etc/mysql/mysql.conf.d/mysqld.cnf:
    Bind address Find and modify it

Modify password + remote connection permission (no password by default):

MySQL ා no parameters need to be added, and you can go in directly
use mysql
update mysql.user set authentication_string=password("123") where user="root";
update user set plugin="mysql_native_password";
flush privileges;

select host from user;
update user set host ='%' where user ='root';
flush privileges;

Restart service:

systemctl restart mysql

Server connection test:

mysql -uroot -p
Password 123



Install Hive



tar -zxvf apache-hive-2.3.6-bin.tar.gz

To configure hive:

vi ~/.bashrc
    export HIVE_HOME=/home/lin/hive/apache-hive-2.3.6-bin
    export PATH=${HIVE_HOME}/bin:$PATH

Activate configuration:

source ~/.bashrc

Other related configurations of hive (similarly, enter the conf directory of hive decompression directory):


Hive MySQL related configuration (the same is in the conf directory):

VI hive-site.xml: (pay special attention to the contents of the last two < property > and change the user name and password yourself)

Download the JDBC MySQL driver and put it into hive. The operation is as follows (because the above hive-site.xml uses MySQL):

  1. First download:
  2. Put this jar file in the Lib directory of hive (same level as CONF):
  3. Copy this jar file again and put it into the jar directory of spark (for the sake of jupyter’s direct connection to MySQL (not through hive)

Initialization (first make sure that the previous HDFS and MySQL are started):

schematool -dbType mysql -initSchema

#Note 1: only hive 2. + needs to do this step of command initialization
#Note 2: initialization once and multiple times will make MySQL have duplicate keys and error reports.

Start the Metastore service:

nohup hive --service metastore &

Check whether to initialize (check in MySQL table):

use hive
show tables;       
#If there is data, initialization is successful

Start hive:


Create database and table test (Note: do not use the keyword user as the table name, etc.):

Input in hive:  
    create database mydatabase;
    use mydatabase;
    create table person (name string);
To view the related information of hive table in MySQL: 
    Select * from tbls; view all table structure information
    Select * from columns; view all column information

To import a file to hive:

VI hive_data.txt: (write the following two lines)
    tom catch jerry
    every one can learn AI
load data local inpath '/home/lin/data/hive_data.txt' into table person;


select * from person;

Pyspark client configuration connection code:

import findspark

from pyspark.sql import SparkSession    

spark = SparkSession.builder\
    .appName("Spark Hive Example")\
    .config("hive.metastore.uris", "thrift://localhost:9083")\
spark.sql("use mydatabase").show()
spark.sql('show tables').show()

Install Spark

Download: spark-2.4.4-bin-hadoop 2.7.tgz:
Rough portal:
Detailed gate:


Tar-zxvf download the spark bin Hadoop

To configure spark:

vi ~/.bashrc
    export SPARK_HOME=home/lin/spark/spark-2.4.4-bin-hadoop2.7
    export PATH=${SPARK_HOME}/bin:$PATH

Activate configuration:

source ~/.bashrc

Last step (possible error in Python environment)

By default, the name of “Python” is called by the python spark script, while Ubuntu only has “Python” and “python3” by default.
So we need to do the following soft connection to make it possible to enter Python and directly search for Python 3.7 command (do not use alias)

ln -s /usr/bin/python3.7 /usr/bin/python


The server directly input the command:


Or remote browser input:


Remote connection with jupyter

Jupyter notebook is an alternative to pyspark’s console execution.

To install jupyter Notebook:

pip3 install jupyter   
#If the new environment needs to install PIP: apt get install python3 pip

There are two connection modes:
First, configure environment variables (conflicts with pyspark’s console, not recommended, omitted)
Second, use the third-party module findpark

PIP install findpark

PIP install findpark (Linux server)

Start the jupyter notebook service (- – IP must specify, (– allow root may report an error if not added)

Jupyter notebook -- allow root -- IP (Linux server)

Here’s the jupyter notebook client (windows10)
The following two lines of code must be placed on the first line of each py script

import findspark

Then you can write other codes normally

from pyspark import SparkConf, SparkContext

sc = SparkContext(
    Master = 'local [*]', which will be discussed next
    AppName = 'mypyspark', ා name as you like
#In this sentence, start spark, and then you can access it through the browser Four thousand and forty
#If you play 6 in Python magic, you should automatically think of the with statement when you mention the context
#Do not write parameters, run locally, which is OK, SC = sparkcontext() 

raw_data = [1,2,3]
RDD ﹐ data = sc.parallelize (raw ﹐ data) ﹐ RDD from Python list type to spark
Raw ﹐ data = RDD ﹐ data. Collect() ﹐ RDD of spark goes back to Python list type

Sc.stop () -- close spark, and the browser will not be able to access it.

Explain the master parameter of sparkcontext:

  1. “Local”: only one thread, running locally.
  2. “Local [*]”: indicates that there are (number of CPUs) threads running locally.
  3. “Local [n]”: indicates n threads, running locally.
  4. “Spark: / / IP: host”: connect to other clusters

Review environmental issues and explain the concept of “local”:

  1. The spark full environment is installed in Linux.
  2. Installed jupyter in Linux and started the jupyter notebook service.
  3. Write business code for “jupyter notebook” service in win10 remote connection Linux (equivalent to client connection)

So, the word “local” mentioned before is relative to Linux in the final analysis, and Linux is what we write code and operate all the time.

Spark submit is usually used

First of all: we write a script with various pyspark API
If you use the jupyter notebook I recommended above, you will find that the file is in. Ipynb format, and you can easily switch to. Py
Py = > Ubuntu Hadoop yarn HDFS hive spark installation configuration
Finally, submit the PY script:

spark-submit --master local[*] --name myspark /xx/xx/

#You will find that -- master and -- name are the options configured in our code above, and you can write them to their seats.
#/ XX / XX / is the absolute path of the PY script Feed spark and let him do it. Yes.

Standalone deployment spark


The standalone deployment needs to be started at the same time:
    Master terminal
    Slave terminal 
Press the following configuration, the last. / can start at the same time.

View the Java? Home environment variable.

echo $JAVA_HOME 

#Remember the results, copy them

Enter the conf directory and do some configuration (the bin directory in conf and spark is the same level):

VI spark (inside)
    Java_home = result above 

cp slaves.template slaves
VI Slaves: (change localhost to cost machine name)

After the above configuration, enter the SBIN directory (in the same directory as the above CONF)

. /

#If you are prompted with permission error, continue to look down (branch line)
Sudo passwd root activate the root user of Ubuntu and set the password
vi /etc/ssh/sshd_config:
    Permitrootlogin yes ා add this anywhere (this option may be added after existing comments)
service ssh restart

There is no error in startup. A log file XXX with absolute path will pop up for you

Cat XXX can see the startup status and various log information

There are several pieces of information:
    Successfully started service 'workerui' on port 8082
    Successfully registered with master spark: // Lin: 7077 (code context access)
Among them, some information may not be printed out: it is recommended to try the port (8080-8082) in the browser.

Enter the command to view the startup status:

JPS ා if there is a worker and a master at the same time, it indicates that the startup is successful


pyspark --master spark://lin:7077

#On the worker side of webui, you can see that a job has been added

Yarn deploy spark

To configure:

    #My is / home / Lin / Hadoop / hadoop-2.7.7
Enter the conf directory of the path where spark unpacks the package:
VI (before etc / Hadoop, echo came out just now. Everyone in etc / Hadoop is the same.)

Start spark:

spark-submit --master yarn --name myspark  script/
#Note that -- the value of the master is changed to yarn, and the others remain unchanged.

Or you can:
     pyspark --master yarn     
If the startup is successful, the configuration is successful

Spark history service configuration

Pain point: sometimes after our spark context stops, the webui is inaccessible.
If there is unfinished or historical information, it will not be seen.
At this time, we can configure the history service to view the unfinished jobs after context stop.

Create a new HDFS directory, myhistory (under the root path), and use it to get:

hadoop fs -mkdir /myhistory        

First, enter the conf directory of the spark decompression package:

cp spark-defaults.conf.template spark-defaults.conf

VI spark-defaults.conf: (uncomment the following note, Lin native name, myhistory under the root path of HDFS)
    spark.eventLog.enabled           true
    spark.eventLog.dir               hdfs://lin:8020/myhistory
VI (we copied the template once before, so we can edit it directly this time.)

Start (enter the SBIN directory of spark decompression package):


#Information entered by cat (log file) To see if the startup is successful
#Webui default:


Visit history webui in browser:
Find nothing: This is normal, because we haven't run the spark context main program script yet.
Run the spark context main program script:
spark-submit script/
    #This script is written casually. It doesn't make sense But there is a notice we often use!!!
    #I have run out of context in this script, and I have stopped it
    #So we can't access the webui of spark context in its running state
    #But we just painstakingly configured the spark history service and started it.
    #So the context information is written into the spark history we just configured
    #So when we visit the spark history webui again, we can see that some content has been written in.
Visit history webui again:
You will find that there is content in it (spark history service has already worked for us)~~~~

Password free login

Environment Ubuntu (CentOS should also be available, rarely used)
Password free login settings:

cd ~
ssh-keygen -t rsa -P ""
cat .ssh/ >> .ssh/authorized_keys
chmod 600 .ssh/authorized_keys

Note several situations:

If you are root, you need to switch to / root / to execute the above command
If you are a normal user, you need to switch to / home / xxx / to execute the above command

Note that sometimes with sudo-s, the path does not switch automatically.
We need to manually switch the "home" path

Custom script start service

The following content is only for personal convenience. The shell is not familiar. You can use py script at will.
VI (this script is used to start the HDFS, yarn, sparkhistory and jupyter notebook configured above)

import os
import subprocess as sub

######Start HDFS + yarn###############
hadoop_path = os.environ['HADOOP_HOME']
hadoop_sbin = os.path.join(hadoop_path, 'sbin')


######Launch sparkhistory##############
spark_path = os.environ['SPARK_HOME']
spark_sbin = os.path.join(spark_path, 'sbin')

######Start jupyter notebook###############
# home_path = os.environ['HOME']
home_path = '/home/lin'

os.chdir(home_path)'jupyter notebook --allow-root --ip'.split())

After that, every time you restart, you don’t need to enter every directory to start. A direct command:

sudo python
nohup hive --service metastore &

To view the webui related to the startup of this script:


Additional webui attached:


Standalone starts the specified port (if you use the standalone mode instead of local, you may use the following ports):

pyspark --master spark://lin:7077