    Many people asked what Alibaba’s Apsara big data platform, ladder 2, maxcompute and real-time computing really are, and what’s the difference between Alibaba’s own Hadoop platform and Alibaba’s own. Let’s talk about Hadoop first. What is Hadoop? Hadoop is an open source, highly reliable and extensible distributed big data computing framework system, which is mainly […]

    In fact, there are a lot of articles or books about Hadoop and its ecosystem. When the concept of big data rose in 2016, I was lucky to enter the data industry. Although, in the past two years, I didn’t meet my initial expectations, I took such a step.Here, let’s talk about Hadoop and its […]

    [TOC] In the first article, I discussed what constitutes a small file and why Hadoop has a small file problem. I define a small file as any file smaller than 75% of Hadoop block size, and explain that Hadoop prefers smaller larger files due to namenode memory usage and MapReduce performance. In this article, I’ll […]

    Through the study of some less commonly used alternatives to solve MapReduce performance problems and the factors to be considered when choosing a solution. Solve MapReduce performance problems The following solutions alleviate MapReduce performance problems: Change ingestion process / interval Batch file merge Sequence file HBase S3distcp (if Amazon EMR is used) Using combinefileinputformat Hive […]

    environment condition Java 8Python 3.7Scala 2.12.10Spark 2.4.4hadoop 2.7.7hive 2.3.6mysql 5.7mysql-connector-java-5.1.48.jar R 3.1 + (may not be installed) Install Java A priori portal: https://segmentfault.com/a/11 Install Python Bring Python 3.7 with Ubuntu Install Scala Download: https://downloads.lightbend.cDecompression: Tar – zxvf download good Scala To configure: vi ~/.bashrc export SCALA_HOME=/home/lin/spark/scala-2.12.10 export PATH=${SCALA_HOME}/bin:$PATH Save exit Activate configuration: source ~/.bashrc Install […]

    Environmental Science Environment use: Hadoop 3.1, python 3.6, Ubuntu 18.04 Hadoop is developed in Java. It is recommended to use java to operate HDFS. Sometimes we need to use Python to operate HDFS. This time, we will discuss how to use Python to operate HDFS, upload files, download files, view folders, and use Python to […]

    After each HDFS namenode-format, the cluster ID of the namenode is automatically updated. In this case, we first look at the logs log of the datanode to determine that the cluster ID is inconsistent. At this time, we should go to the tmp/dfs/current file of HDFS and update the cluster ID of the datanode to […]

    019-07-01 16:45:24,933 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from Call#3 Retry#0 java.io.IOException: File /a1.txt could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1620) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3350) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:678) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:213) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:491) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) […]

    Pseudo-distributed HBase service construction, the system operation is normal, you can also query the list of all tables, but query the details of the table, the call will be wrong. java.net.connectexception: call to localhost/ failed on connection exception It can also be seen from the error message that the master node should be the name […]

    Configuration environment Hadoop environment is really fatal, and unexpected problems can arise at any time, such as Failing this attempt.Diagnostics: Call From Hadoop 001/ to Hadoop 001:8020 failed on connection exception: java.net.ConnectException: Deny connection; For more details see: http://wiki.apache.org/hadoop/Connection Refused It’s a strange problem. All the configurations are OK. The problem is on ipv6. Just […]