Tag:spark

  • Learning notes tf065: tensorflowonspark

    Time:2020-1-9

    Hadoop ecological big data system is divided into yam, HDFS and MapReduce computing frameworks. Tensorflow distributed is equivalent to MapReduce computing framework and kubernetes is equivalent to yam scheduling system. Tensorflowonspark uses remote direct memory access (RDMA) to solve storage functions and scheduling, and realizes deep learning and big data fusion. Tensorflowonspark (TFOs), an open […]

  • An example of spark SQL data loading and saving

    Time:2019-12-29

    1、 Pre knowledge explanation Spark SQL is important to operate dataframe, which provides the operations of save and load,Load: dataframe can be created,Save: save the data in dataframe to a file or a specific format to indicate the type of file we want to read and the specific format to indicate the type of file […]

  • Deployment of spark development environment under IntelliJ idea windows

    Time:2019-12-21

    0x01 environment description Blog address: http://www.cnblogs.com/ning-wang/p/7359977.html 1.1 local OS: windows 10jdk: jdk1.8.0_121scala: scala-2.11.11IDE: IntelliJ IDEA ULTIMATE 2017.2.1 1.2 server OS: CentOS_6.5_x64jdk: jdk1.8.111hadoop: hadoop-2.6.5spark: spark-1.6.3-bin-hadoop2.6scala: scala-2.11.11 0x02 windows configuration 2.1 install JDK Configure environment variables JAVA_HOME CLASSPATH Path 2.2 configure hosts 2.2.1 document location C:\Windows\System32\drivers\etc 2.2.2 new content The contents of the hosts file are the […]

  • Spark series (1): Spark’s first understanding

    Time:2019-12-6

    Spark series (1): Spark’s first understanding By studytimeOriginal: https://www.studytime.xin/ What is spark Official website: http://spark.apache.org/ Spark is a high-performance DAG computing engine, a fast and universal cluster computing platform. It is a general memory parallel computing framework developed by the amp Laboratory of the University of California, Berkeley, to build large, low latency data analysis […]

  • Spark series (2): Spark pseudo Distributed installation

    Time:2019-12-5

    Spark series (2): Spark pseudo Distributed installation By studytimeOriginal: https://www.studytime.xin/ Download spark installation package Http://spark.apache.org/downloads.html Preparation before installation Java8 installed Hadoop 2.7.5 is installed Modify Hadoop configuration file Modify Hadoop yarn-site.xml configuration vim ~/App/hadoop-2.7.3/etc/hadoop/yarn-site.xml <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <name>yarn.log.server.url</name> <value>http://bigdata:19888/jobhistory/logs</value> </property> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> Restart yarn Service stop-yarn.sh start-yarn.sh […]

  • Spark series (3): building spark development environment idea

    Time:2019-12-4

    Spark series (3): building spark development environment idea By studytimeOriginal: https://www.studytime.xin I. create Maven project 2. Set groupid and artifactid III. set up project directory IV. completion of construction V. upload project to GitHub cd spark-learning-example git init git remote add origin [email protected]:baihe/test.git git add . git commit -m “Initial commit” git push -u origin […]

  • Spark series (4): Spark RDD

    Time:2019-12-3

    Spark series (4): Spark RDD By studytimeOriginal: https://www.studytime.xin I. overview of RDD 1.1 what is RDD? RDD (resilient distributed dataset) is called elastic distributed dataset. It is the most basic data abstraction in spark, representing an immutable, divisible and parallel computing set of elements. 1.2. What are the main properties of RDD? RDD is the […]

  • Spark series (6): shared variables of spark

    Time:2019-12-2

    Spark series (6): shared variables of spark By studytimeOriginal: https://www.studytime.xin What are shared variables? All transformation operators in spark are parallelized by parallel tasks distributed to multiple nodes. When a custom function is passed to the spark operator (such as map or reduce), the variables contained in the function will be propagated to the remote […]

  • Queries raised by sortbykey (job, shuffle, cache)

    Time:2019-11-29

    Just for fun, wrote a demo, val rdd = sc.parallelize(Seq((1, “a”), (2, “c”), (3, “b”), (2, “c”))) val sorted = rdd.sortByKey() sorted.foreach(println) val c = sorted.count() 1.job Open spark UI, as shown: sortByKey, a transform operator. Why does the transform operator cause a job?Look at the source code, def sortByKey(ascending: Boolean = true, numPartitions: Int […]

  • Wordcount examples under spark 2. X and Java 8

    Time:2019-11-28

    Running environment JDK 1.8.0 Hadoop 2.6.0 Scala 2.11.8 Spark 2.1.2 RDD, no lambda, reducebykey import import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.sql.SparkSession; import scala.Tuple2; import java.util.Arrays; import java.util.Iterator; main public class WordCount { public static void main(String[] args) { //Input file String wordFile = “/user/walker/input/wordcount/idea.txt”; SparkSession spark = […]

  • Spark RDD conversion operation and action operation

    Time:2019-11-27

    This article is excerpted from spark rapid big data analysis Summary RDD supports two operations: transformation and action. A new RDD operation, such as map() and filter(), is returned during the conversion operation. The action operation is to return the result to the driver or write the result to the external system, which will trigger […]

  • RDD partition algorithm in spark

    Time:2019-11-26

    Partition algorithm of RDD in spark def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = { (0 until numSlices).iterator.map { i => val start = ((i * length) / numSlices).toInt val end = (((i + 1) * length) / numSlices).toInt (start, end) } } /** Number of numslices partitions (0 until numslices). Iterator is to change […]