Tag:spark

  • N-map installation practice

    Time:2019-11-17

    It is strongly recommended that you download nmap from the official website instead of other third parties. Official website address: https://nmap.org/download.html It’s like this when it’s turned on. It feels a bit gloomy. BTW, who can escape the true fragrance law. Windows installation Directly choose to download the EXE installation package, and then choose step-by-step […]

  • Sparkstreaming integrates flume’s pull mode to start error reporting solution

    Time:2019-11-16

    Flume profile: simple-agent.sources = netcat-source simple-agent.sinks = spark-sink simple-agent.channels = memory-channel #Describe/configure the source simple-agent.sources.netcat-source.type = netcat simple-agent.sources.netcat-source.bind = centos simple-agent.sources.netcat-source.port= 44444 # Describe the sink simple-agent.sinks.spark-sink.type=org.apache.spark.streaming.flume.sink.SparkSink simple-agent.sinks.spark-sink.hostname= centos simple-agent.sinks.spark-sink.port= 41414 simple-agent.channels.memory-channel.type = memory simple-agent.channels.memory-channel.capacity = 1000 simple-agent.channels.memory-channel.transactionCapacity = 100 simple-agent.sources.netcat-source.channels = memory-channel simple-agent.sinks.spark-sink.channel = memory-channel However, when flume is started, the following error is […]

  • Sparkstreaming integrates flume’s pull error reporting solution

    Time:2019-11-15

    Let’s start with the following version:Spark 2.4.3Scala 2.11.12Flume-1.6.0 Flume profile: simple-agent.sources = netcat-source simple-agent.sinks = spark-sink simple-agent.channels = memory-channel #Describe/configure the source simple-agent.sources.netcat-source.type = netcat simple-agent.sources.netcat-source.bind =centos simple-agent.sources.netcat-source.port= 44444 # Describe the sink simple-agent.sinks.spark-sink.type=org.apache.spark.streaming.flume.sink.SparkSink simple-agent.sinks.spark-sink.hostname= centos simple-agent.sinks.spark-sink.port= 41414 simple-agent.channels.memory-channel.type = memory simple-agent.sources.netcat-source.channels = memory-channel simple-agent.sinks.spark-sink.channel = memory-channel Start script: flume-ng agent –name simple-agent –conf $FLUME_HOME/conf […]

  • Spark core resolution: RDD

    Time:2019-11-13

    Introduction Spark core is the core part of spark, which is the foundation of spark SQL, spark streaming, spark mllib and other modules. Spark core provides a scaffold for developing distributed applications, so developers of other modules or applications don’t need to care about how to realize complex distributed computing. They just need to use […]

  • Scala concurrent programming practice: executor thread pool

    Time:2019-11-12

    Creating a thread is a heavyweight operation. Because the API of the operating system kernel needs to be called, it is better not to create and destroy threads frequently. In order to reuse the created threads, the common way is to create a thread pool. Executor The java.util.concurrent package provides several interfaces and classes to […]

  • Py = > Ubuntu Hadoop yarn HDFS hive spark installation configuration

    Time:2019-11-11

    environment condition Java 8Python 3.7Scala 2.12.10Spark 2.4.4hadoop 2.7.7hive 2.3.6mysql 5.7mysql-connector-java-5.1.48.jar R 3.1 + (may not be installed) Install Java A priori portal: https://segmentfault.com/a/11 Install Python Bring Python 3.7 with Ubuntu Install Scala Download: https://downloads.lightbend.cDecompression: Tar – zxvf download good Scala To configure: vi ~/.bashrc export SCALA_HOME=/home/lin/spark/scala-2.12.10 export PATH=${SCALA_HOME}/bin:$PATH Save exit Activate configuration: source ~/.bashrc Install […]

  • PY => PySpark-Spark Core(RDD)

    Time:2019-11-10

    Preface The first portal: https://segmentfault.com/a/119000020841646 RDD cognition What is RDD? RDD: resilient distributed datasets There are several ways to convert to RDD format: 1. parallelize: RDD = sc.parallelize ([1,2,3,4,5]), which passes ordinary Python types 2. Read files / read databases / read es and other ways. Here, take reading files as an example: rdd = […]

  • Introduction to spark concept and framework

    Time:2019-11-9

    What is spark Apache spark is a fast and general-purpose computing engine designed for large-scale data processing. It is developed by the amp Laboratory (algorithms, machines, and people Lab) at the University of California, Berkeley. It can be used to build large, low latency data analysis applications. Spark entered Apache in June 2013 and became […]

  • Spark framework: build Scala development environment under win10 system

    Time:2019-11-8

    I. Scala environment foundation Scala wraps Java related classes and interfaces, so it depends on the JVM environment. JDK 1.8 Scala dependency Scala 2.11 installation version Idea 2017.3 development tools 2. Configure Scala decompression version 1) note that there is no space or Chinese in the path 2) configure environment variables Add to path directory […]

  • Interactive analysis of sub second response implemented by spark relational cache

    Time:2019-11-7

    Video link: https://developer.aliyun.com/live/1548? SPM = a2c6h. 12873581.0.0.71671566xoloy3z & group code = apachespark PPT information: https://www.slidestalk.com/alispark/sparkrelationalcache2019_ This sharing is mainly divided into the following four aspects: Project introduction technical analysis How to use performance analysis I. project introduction Project background Alibaba cloud EMR is an open-source big data solution. At present, many open-source components have been […]

  • Use of spark learning notes spark streaming

    Time:2019-11-6

    1. Spark Streaming Spark streaming is a real-time computing framework based on spark core, which can consume and process data from many data sources One of the most basic abstractions in spark streaming is called dstream (agent), which is essentially a series of continuous RDD. Dstream is actually the encapsulation of RDD Dstream can be […]

  • Spark in action on kubernetes – playground construction and architecture analysis

    Time:2019-11-5

    Preface Spark is a very popular big data processing engine. Data scientists use spark and related ecological big data suite to complete a large number of data analysis and mining of rich scenes. Spark has gradually become the industry standard in the field of data processing. But Spark’s own design is more inclined to use […]