• Elaticsearch (I) — basic principle and usage


    1、 Basic concepts 1. Introduction to elasticsearch Lucene is a full-text (all text content is analyzed and indexed so that it can be searched) search engine Toolkit (the architecture of full-text search engine) written in Java language. It is used to process plain text data and provide interfaces such as indexing and search execution, but […]

  • 03_ MapReduce framework principle_ 3.4 inputsplit slice class (source code)


    g[ed\:togtopicid],g[ed\:hyperlink],g[ed\:comment],g[ed\:note] {cursor:pointer;} g[id] {-moz-user-select: none;-ms-user-select: none;user-select: none;} svg text::selection,svg tspan::selection{background-color: #4285f4;color: #ffffff;fill: #ffffff;} .st10 {fill:#f96628;font-family:Apple LiSung Light;font-size:11.25pt} .st9 {fill:#f96628;font-family:Apple LiSung Light;font-size:9pt} .st7 {fill:#ffffff;font-family:Apple LiSung Light;font-size:11.25pt} .st6 {fill:#ffffff;font-family:Apple LiSung Light;font-size:14.25pt} .st8 {fill:#ffffff;font-family:Apple LiSung Light;font-size:9pt} 2. Inputsplit slice class Role of category 1.0 Inputsplit logically contains all the key values provided to the mapper processing this inputsplit […]

  • Tencent focuses on building a real-time data warehouse and real-time query system based on Flink


    This paper is organized by Lu Peijie, a community volunteer, and shared by Wang zhanxiong, a senior engineer of Tencent’s focus data team, in Flink forward Asia 2020. The contents include: Background introduction architecture design Real time data warehouse Real time data query system Summary of real-time system application results 1、 Background introduction 1. Business […]

  • Redis cache performance practice and summary


    1、 Foreword In Internet applications, caching has become a key component of high concurrency architecture. This blog mainly introduces the typical scenarios of cache usage, practical case analysis, redis usage specification and conventional redis monitoring. 2、 Common cache comparison Common caching schemes include local caching, including HashMap / concurrenthashmap, ehcache, Memcache, guava cache, etc. caching […]

  • How to use horovod to realize multi GPU distributed training in sagemaker pipeline mode


    At present, we can use a variety of technologies to train the deep learning model through a small amount of data, including transfer learning for image classification tasks, small sample learning or even one-time learning. We can also fine tune the language model based on the pre trained Bert or gpt2 model. However, in some […]

  • [MYCAT] as the core developer of MYCAT, how can we not have a wave of MYCAT series articles?


    Write in front MYCAT is developed based on Alibaba’s open source Cobar products. The stability, reliability, excellent architecture and performance of Cobar and many mature use cases make MYCAT have a good starting point from the beginning. Standing on the shoulders of giants, we can see further. Excellent open source projects and innovative ideas in […]

  • Popular science: comparison of pulsar and Kafka architectures


    The writer is David kjerrumgaard, currently a contributor to Splunk, Apache pulsar and Apache nifi projects. Translator as [email protected] 。 Original link:https://searchdatamanagement…., translation authorized. About Apache pulsar Apache pulsar is a top-level project of the Apache Software Foundation. It is a native distributed message flow platform for the next generation cloud. It integrates message, storage […]

  • P2P download — implementation of fragment download server golang


    P2P download — implementation of fragment download server golang There are many kinds of P2P downloads. The simple one is that after the file is downloaded, the whole file is provided to others for download. This practice can be used as the source only after the file download is completed.I’ve seen the file transfer system […]

  • ClickHouse – 01


    1. Clickhouse and its properties In the big data processing scenario, the technologies used in stream processing and batch processing are roughly as follows: Big data processing scenario flow png Batch processing passes the data in the source business system through the data extraction tool (for example, sqoop) extract data into HDFS. In this process, […]

  • Subsequent design planning of elasticjob


    Overview of this article product positioning architecture design Elasticjob lite and elasticjob cloud adjustmentsModule planning ·Task trigger ·Resource governance ·Task governance ·Product form About community 1. Product positioning Elasticjob is currently a piecemeal scheduling middleware based on timed tasks. The ability of resource governance is added to elasticjob cloud.The future elasticjob hopes to divide the […]

  • Elasticjob’s product positioning and new version design concept


    Guide: scheduling Scheduling is a huge concept in the computer field. CPU scheduling, memory scheduling, process scheduling, etc. can be called scheduling. It refers to allocating reasonable resources to deal with predetermined tasks at a specific time, which is used to trigger an application containing business logic at an appropriate time. Scheduling is a very […]

  • hot wire! Elastic Job 3.0. The 0-beta version was officially released


    Highlights of this issue This week, the Apache shardingsphere team is pleased to announce that ApacheShardingSphere ElasticJob-3.0. 0-beta and shardingsphere elasticjob ui-3.0 The new version of 0-beta has been officially released! ElasticJobIt is a distributed scheduling solution, which provides fragmentation, elastic scaling, automatic discovery of distributed tasks, multi task types based on time driven, data-driven, […]