Tag:The Conduit

  • Robinhood’s next generation data Lake practice based on Apache Hudi


    abstractRobinhood’s mission is to democratize finance for all. Continuous data analysis and data-driven decision-making at different levels within Robinhood are the basis for this mission. We have a variety of data sources – OLTP databases, event flows, and various third-party data sources. Fast, reliable, secure and privacy centric data Lake ingestion services are required to […]

  • Tencent cloud logstash actual combat 2- synchronize the data in MySQL to elasticsearch


    Logstash can also be used to synchronize data in relational databases such as MySQL and PostgreSQL to other storage media. The following describes how to use itTencent cloud logstashSynchronize data into the MySQL product. Create pipe On the “pipeline management” page, click the “new pipeline” button to create a pipeline: Enter the pipeline configuration page, […]

  • Python crawler framework


    Crawler framework (III) Developing a crawler using the framework scrapy requires only four steps:Create project: scratch startproject project name (project name, case insensitive)Clear goal (write items. Py): clear the goal you want to captureMaking spiders (spiders / xxspider. Py): making spiders starts crawling web pagesStored content (pipelines. Py): Design pipelines to store crawled content 1. […]

  • Shell (bash) script programming III: redirection


    stayThis oneIn, we introduced some basic knowledge about input-output redirection and pipeline. This article will continue the topic of redirection.Before we start, let’s talk about the in the shellquote。 quote Like many programming languages, bash also supports character escape, which is used to change the original meaning of characters and make someMetacharacter(e.g&)Can appear in the […]

  • Go programming mode I – Pipeline


    Series articles: Go programming mode I – Pipeline Programming mode 2 of go — funoption package main import “fmt” /* Pipeline mode 1. Pipefunc pipeline function. The pipeline function receives a pipeline and returns a new pipeline 2. Pipeline parameter, 1 data source channel, 2 pipeline function list */ type PipeFunc func(<-chan int) <-chan int […]

  • Fish Companion: upgrade the machine learning feature system with the help of Flink


    Introduction: Flink is used for machine learning feature engineering, which solves the problem of difficult feature online; And how SQL + Python UDF is used in production practice. Chen Yisheng, the author of this paper, introduced the upgrading of the machine learning feature system of the fish companion platform. In terms of architecture, it changed […]

  • Pipeline communication between Linux processes


    1. Overview of inter process communication What is interprocess communication? What is inter thread communication? Interprocess communication: it is impossible to realize interprocess communication in user space through Linux kernel. Inter thread communication: it can be realized through user space, such as communication through global variables. 2. Interprocess communication mode used by Linux Pipeline communication: […]

  • Interprocess communication (IPC)


    Inter process communication is referred to as IPC for short. The following two IPC methods are introduced: The Conduit system V Article catalogue The Conduit What is a pipe Classification of pipes: Anonymous Pipe Concept of anonymous pipeline Principle of anonymous pipeline `pipe(int fd[2])` Slow reading and fast writing Fast reading and slow writing The […]

  • HDFS data storage process


    HDFS is Hadoop distributed file system. The process of storing data in HDFS is as follows: 1. Client interacts with namenode1.1. The client sends a message to the namenode. The namenode checks whether the client has write permission. If you have permission, namenode checks whether there is a file with the same name. If there […]

  • [advanced mongodb query] aggregation pipeline (I) — first knowledge


    Foreword: the general query can be through the find method, but if it is a more complex query or data statistics, find may be powerless. At this time, maybe what you need is aggregate What is an aggregation pipeline In English documents, it is the aggregation pipeline, which translates directly into the aggregation pipeline. It […]

  • Implementing multi process in python (3)


    This paper continues the thought of Python multitasking programming (I) and the thought of Python multitasking programming (II), discusses the topic of Python multiprocess, and launches the last knowledge point in Python multiprocess programming, the method of Python interprocess communication. Due to the spatial independence between processes, resources cannot be obtained directly from each other. […]

  • [azure Devops series] multi-stage construction of azure Devops


    In fact, it is particularly useful for stage pipeline. We can divide construction, test, or deployment into multiple stages for processing. Deploy your application to multiple environments and gradually transition from one environment to another. For example, you can automatically deploy to the dev environment after running unit tests in CI, then deploy to the […]