Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

Time:2021-6-10

Introduction: 5 minutes to read data bricks data insight

Open source big data Community & Alibaba cloud EMR series live broadcast phase 4

Subject: databricks data insight – enterprise fully hosted spark big data analysis platform and case study
Lecturer: Zongze, Alibaba cloud technology expert, head of open platform ecological enterprise team of computing platform division

Content framework:

Introduction to data insight products
Function introduction
Typical scene
Customer case
Product demo

1、 Introduction to data insight products

1. Introduction to databricks
2. What is the data insight product of alicloud databricks
01 data bricks

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

① Apache spark is the founder and the biggest code contributor of spark, and the commercial company behind the spark technology ecology.
In 2013, it was founded by the founder of Apache spark, the founding team of amplab at the University of California, Berkeley.

② Core products and technologies, leading and promoting spark open source ecosystem
ApacheSpark、DeltaLake、Koalas 、MLFlow、OneLakehousePlatform

③ Company positioning
Databricksis the data + AI company provides customers with data analysis, data engineering, data science and artificial intelligence services, and an integrated Lakehouse architecture
Open source version vs Commercial Version: most of the company’s technology R & D resources are invested in commercial products
Multi cloud strategy, cooperating with top cloud service providers, providing data development, data analysis, machine learning and other products, data + AI integrated analysis platform
④ Market position
Technology unicorn, the industry benchmark, leads the trend and vane of Spark’s overall technology ecology
The most anticipated technology listed companies in 2021

02. Valuation and financing history of databricks

(from the official website of databricks)
① October 2019 round g, valued at $6.2 billion
② Round f in early February 2021, valued at $28 billion

  • In this round of financing, the three major cloud service providers, AWS, GCP, Ms azure and salesforce, have made follow-up investment, which is enough to see that cloud manufacturers attach importance to the development of databricks
  • IPO expectation: the IPO is planned in 2021, when many predict that the valuation of databricks may reach 35 billion US dollars, or even as high as 50 billion US dollars

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

03 high quality spark big data analysis platform jointly built by databricks and Alibaba cloud

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

  • Business company behind Apache spark, spark founding team, American technology Unicorn
  • In the world has more than 5000 customers and 450 partners, strong brand awareness
  • In 2020, in the Magic Quadrant report of data science and machine learning (DSML) platform released by Gartner, it is in the leader quadrant

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

04 \ \ databricks + alicloud = databricks data insight

Dingtalk_20210524160041.jpg

Product core:

  • Full hosting big data analysis based on commercial Spark & AI platform
  • Built in commercial spark engine, databricks runtime, provides efficient and stable guarantee at the computing level
  • It integrates with Alibaba cloud products to provide enterprise level features such as data security, dynamic expansion, monitoring and alarm

Product engine and service:

  • It is 100% compatible with open source spark and has been jointly developed and optimized by Alibaba cloud and databricks
  • Provide commercial SLA guarantee and 7 * 24-hour expert support services for databricks

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

Core component of DDI product capability
Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

Product key information and advantages
Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

2、 Function introduction of DDI products

1. Overall structure
2. Engine capability
3. Performance
4. Function
5. Cost

01 \ \ alicloud data bridges data insight (DDI) architecture

640 (7).png

02 engine: enterprise level performance optimization to improve the efficiency of computing engine and data reading and writing

Enterprise class high performance, stability and reliability
Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

03 \ \ enterprise class databricks runtime vs community version open source spark

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

04 comparison of cost between HDFS and OSS based on the architecture of computing and storage separation

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

05 \ \ speed up OSS access optimization and optimize data access performance based on jindofs

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

06 \ \ interactive analysis of notebook and data aggregation

Optimized Apache Zeppelin

  • Multilingual support
  • Scala、Python、Spark SQL、R
  • interactive analysis
  • Data visualization
  • Integrated scheduling capability
  • One stop development platform
  • Multi user collaborative development

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

07 data development job submission & workflow scheduling

  • Support jar package submit job and job scheduling ability
  • Support spark / spark streaming / notebook
  • Hybrid scheduling of workflow with different job types
  • Support scheduling operation and maintenance, audit log, version control, etc

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

08 \ \ rich data source support

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

09 metadata management

Three ways of metadata selection
Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

3、 Typical scene

1. How to solve the problem of customer’s pain point and DDI
2. Lambda architecture to batch flow integrated architecture
3. Evolution of Lakehouse architecture
4. Product mix of DDI in alicloud

01 common pain points of open source big data platform customers

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

02 data insight helps customers improve production efficiency in four scenarios

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

03 project background and problems to be solved of delta Lake

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

04 big data enters the era of lake house

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

05 \ \ using DDI to build batch flow integrated data warehouse to simplify complex architecture

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

06 the combination of DDI in alicloud products

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

07 data bridges data insight typical architecture

Deep integration of DDI and alicloud products (typical scenarios)

Data acquisition

Receive real-time generated streaming data and batch data on external cloud storage.

Data ETL

Continuously and efficiently process incremental data, support data rollback and deletion, and provide acid transactional support.

Bi report data analysis & interactive analysis

Support ad hoc query, notebook visual analysis, seamless docking of a variety of Bi analysis tools.

AI data exploration

Support machine learning, mllib and other spark ecological AI scenarios.

Open up the upstream and downstream network

For example, the upstream connects Kafka, OSS, EMR HDFS, etc., and the downstream connects elasticsearch, RDS, OSS storage, etc.

4、 Typical scenario customer case introduction

1. Stepone self built cloud case
2. Data analysis case of industrial manufacturing head company

Customer case 01: cloud migration of stepone databricks

This architecture describes how to solve the problem of big data computing by using data bricks data insight

  • Data storage: self built hive data warehouse – “OSS”
  • Big data analysis: self built CDH – Data bridges data insight (full custody spark, high performance runtime engine, notebook interactive analysis, workflow DAG scheduling, easy installation of Python library, etc.)
  • Metadata: self built CDH – RDS MySQL meta database or using DDI unified meta database
  • Data migration: use distcp or Jindo distcp to migrate data to OSS, synchronize data results, and continue to use sqoop timing task

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

Customer cost benefit analysis

  • Fully managed spark cluster is free of operation and maintenance, saving labor cost (save 1 operation and maintenance + 1 big data, in addition to eliminating performance tuning)
  • Compared with the self built machine, the resource is three times more. In addition, compared with the open source spark, the overall performance of the databricks runtime is nine times better
  • Notebook interactive analysis + DAG workflow scheduling to improve data development / analysis experience
  • Technical solutions are unified, computing and storage are separated, and OSS storage saves the storage cost of customers, and paves the way for future data lake and multi computing architecture
  • Delta Lake solves the problem of customer incremental data update

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

Customer case 02: industrial manufacturing head air conditioning company – big data analysis solution architecture

  • Data collection / storage: receive streaming data generated in real time and batch data on external cloud storage
  • Data ETL: continuously and efficiently process incremental data, support data rollback and deletion, and provide acid transactional support
  • BI data analysis & interactive analysis: support query, notebook visual analysis, seamless docking of a variety of Bi analysis tools
  • Data science: supporting machine learning / deep learning
  • State docking: upstream docking Kafka, OSS, EMR HDFS, etc., downstream docking elasticsearch, RDS, OSS storage, etc

Super detailed strategy! Databricks data insight enterprise level fully hosted spark big data analysis platform and case study

Original link

This article is the original content of Alibaba cloud and cannot be reproduced without permission.