• Four common data sources of spark SQL


    General load / write method Manually specify options The dataframe interface of spark SQL supports the operation of multiple data sources. A dataframe can operate in RDDS mode or be registered as a temporary table. After registering the dataframe as a temporary table, you can perform SQL queries on the dataframe. The default data source […]

  • Learning notes tf065: tensorflowonspark


    Hadoop ecological big data system is divided into yam, HDFS and MapReduce computing frameworks. Tensorflow distributed is equivalent to MapReduce computing framework and kubernetes is equivalent to yam scheduling system. Tensorflowonspark uses remote direct memory access (RDMA) to solve storage functions and scheduling, and realizes deep learning and big data fusion. Tensorflowonspark (TFOs), an open […]

  • An example of spark SQL data loading and saving


    1、 Pre knowledge explanation Spark SQL is important to operate dataframe, which provides the operations of save and load,Load: dataframe can be created,Save: save the data in dataframe to a file or a specific format to indicate the type of file we want to read and the specific format to indicate the type of file […]

  • Deployment of spark development environment under IntelliJ idea windows


    0x01 environment description Blog address: http://www.cnblogs.com/ning-wang/p/7359977.html 1.1 local OS: windows 10jdk: jdk1.8.0_121scala: scala-2.11.11IDE: IntelliJ IDEA ULTIMATE 2017.2.1 1.2 server OS: CentOS_6.5_x64jdk: jdk1.8.111hadoop: hadoop-2.6.5spark: spark-1.6.3-bin-hadoop2.6scala: scala-2.11.11 0x02 windows configuration 2.1 install JDK Configure environment variables JAVA_HOME CLASSPATH Path 2.2 configure hosts 2.2.1 document location C:\Windows\System32\drivers\etc 2.2.2 new content The contents of the hosts file are the […]

  • Spark series (1): Spark’s first understanding


    Spark series (1): Spark’s first understanding By studytimeOriginal: https://www.studytime.xin/ What is spark Official website: http://spark.apache.org/ Spark is a high-performance DAG computing engine, a fast and universal cluster computing platform. It is a general memory parallel computing framework developed by the amp Laboratory of the University of California, Berkeley, to build large, low latency data analysis […]

  • Spark series (2): Spark pseudo Distributed installation


    Spark series (2): Spark pseudo Distributed installation By studytimeOriginal: https://www.studytime.xin/ Download spark installation package Http://spark.apache.org/downloads.html Preparation before installation Java8 installed Hadoop 2.7.5 is installed Modify Hadoop configuration file Modify Hadoop yarn-site.xml configuration vim ~/App/hadoop-2.7.3/etc/hadoop/yarn-site.xml <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <name>yarn.log.server.url</name> <value>http://bigdata:19888/jobhistory/logs</value> </property> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> Restart yarn Service stop-yarn.sh start-yarn.sh […]

  • Spark series (3): building spark development environment idea


    Spark series (3): building spark development environment idea By studytimeOriginal: https://www.studytime.xin I. create Maven project 2. Set groupid and artifactid III. set up project directory IV. completion of construction V. upload project to GitHub cd spark-learning-example git init git remote add origin [email protected]:baihe/test.git git add . git commit -m “Initial commit” git push -u origin […]

  • Spark series (4): Spark RDD


    Spark series (4): Spark RDD By studytimeOriginal: https://www.studytime.xin I. overview of RDD 1.1 what is RDD? RDD (resilient distributed dataset) is called elastic distributed dataset. It is the most basic data abstraction in spark, representing an immutable, divisible and parallel computing set of elements. 1.2. What are the main properties of RDD? RDD is the […]

  • Spark series (6): shared variables of spark


    Spark series (6): shared variables of spark By studytimeOriginal: https://www.studytime.xin What are shared variables? All transformation operators in spark are parallelized by parallel tasks distributed to multiple nodes. When a custom function is passed to the spark operator (such as map or reduce), the variables contained in the function will be propagated to the remote […]

  • Queries raised by sortbykey (job, shuffle, cache)


    Just for fun, wrote a demo, val rdd = sc.parallelize(Seq((1, “a”), (2, “c”), (3, “b”), (2, “c”))) val sorted = rdd.sortByKey() sorted.foreach(println) val c = sorted.count() 1.job Open spark UI, as shown: sortByKey, a transform operator. Why does the transform operator cause a job?Look at the source code, def sortByKey(ascending: Boolean = true, numPartitions: Int […]

  • Wordcount examples under spark 2. X and Java 8


    Running environment JDK 1.8.0 Hadoop 2.6.0 Scala 2.11.8 Spark 2.1.2 RDD, no lambda, reducebykey import import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.sql.SparkSession; import scala.Tuple2; import java.util.Arrays; import java.util.Iterator; main public class WordCount { public static void main(String[] args) { //Input file String wordFile = “/user/walker/input/wordcount/idea.txt”; SparkSession spark = […]

  • Spark RDD conversion operation and action operation


    This article is excerpted from spark rapid big data analysis Summary RDD supports two operations: transformation and action. A new RDD operation, such as map() and filter(), is returned during the conversion operation. The action operation is to return the result to the driver or write the result to the external system, which will trigger […]