search
HomeCommon ProblemBreaking down data silos using a unified data warehouse: CDP based on Apache Doris

As enterprise data sources become increasingly diverse, the problem of data silos has become common. When insurance companies build customer data platforms (CDPs), they face the problem of component-intensive computing layers and scattered data storage caused by data silos. In order to solve these problems, they adopted CDP 2.0 based on Apache Doris, using Doris' unified data warehouse capabilities to break data silos, simplify data processing pipelines, and improve data processing efficiency.

Breaking down data silos using a unified data warehouse: CDP based on Apache Doris

The data silo problem is like arthritis for online businesses because almost everyone encounters it as they age. Businesses interact with customers through websites, mobile apps, HTML5 pages, and end devices. For some reason, integrating data from all these sources is tricky. The data remains in place and cannot be correlated with each other for further analysis. This is how data silos form. The larger your business becomes, the more diverse sources of customer data you have, and the more likely you are to become trapped in data silos.

That’s exactly what happened with the insurance company I’m going to discuss in this article. By 2023, they have served more than 500 million customers and signed 57 billion insurance contracts. When they began building their Customer Data Platform (CDP) to accommodate such massive data scale, they used multiple components.

Data silos in CDP

Like most data platforms, their CDP 1.0 has both batch pipelines and real-time streaming pipelines. Offline data is loaded into Impala via a Spark job, where it is labeled and divided into groups. At the same time, Spark also sends it to NebulaGraph for OneID calculation (more on this later in this article). On the other hand, real-time data is tagged by Flink and then stored in HBase for query.

This results in a component-intensive computing layer in CDP: Impala, Spark, NebulaGraph and HBase.

As a result, offline labels, live labels and graph data are scattered across multiple components. Integrating them to provide further data services is costly due to redundant storage and large data transfers. More importantly, due to storage differences, they had to expand the scale of the CDH cluster and NebulaGraph cluster, increasing resource and maintenance costs.

CDP based on Apache Doris

For CDP 2.0, they decided to introduce a unified solution to clean up the mess. In the computing layer of CDP 2.0, Apache Doris is responsible for real-time and offline data storage and calculation.

In order to ingest offline data, they utilize the stream loading method. Their 30-thread ingest test showed that it can perform over 300,000 update inserts per second. To load real-time data, they used a combination of Flink-Doris-Connector and Stream Load. Additionally, in real-time reporting that requires pulling data from multiple external data sources, they leverage multi-catalog capabilities for federated queries.

The customer analysis workflow on this CDP is as follows. First, they organize customer information and then label each customer. They group customers according to tags for more targeted analysis and actions.

Next, I'll dig into these workloads and show you how Apache Doris accelerates them.

One ID

Have you ever encountered this situation when your products and services have different user registration systems? You could collect User ID A's email from one product page, and then collect User ID B's Social Security number from another product page. You will then discover that UserID A and UserID B actually belong to the same person because they use the same phone number.

This is why OneID emerged as an idea. It is to collect the user registration information of all business lines into a large table in Apache Doris, organize it, and ensure that each user has a unique OneID.

This is how they leverage functionality in Apache Doris to determine which registrations belong to the same user.

Tag Service

This CDP accommodates 500 million customer information, which comes from more than 500 source tables, with a total of more than 2,000 tags attached.

According to timeliness, tags can be divided into real-time tags and offline tags. Real-time tags are computed by Apache Flink and written to flat tables in Apache Doris, while offline tags are computed by Apache Doris as they originate from user attribute tables, business tables, and user behavior tables in Doris. The following are the company’s best practices in data labeling:

1. Offline tags

During the peak period of data writing, due to the large scale of data, full updates are very difficult. It is easy to cause OOM errors. To avoid this, they leveraged Apache Doris' INSERT INTO SELECT functionality and enabled partial column updates. This will significantly reduce memory consumption and maintain system stability during data loading.

set enable_unique_key_partial_update=true;
insert into tb_label_result(one_id, labelxx)
select one_id, label_value as labelxx
from .....

2. Live tags

Partial column updates can also be used for live tags, because even live tags update at different speeds. All that is required is to set partial_columns to true.

curl --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H "columns:id,balance,last_access_time" -T /tmp/test.csv http ://127.0.0.1:48037/api/db1/user_profile/_stream_load

3. High concurrency point query

With the current business scale, the company is using Receiving tag query requests at a concurrency level of over 5000 QPS. They use a combination of strategies to ensure high performance. First, they use Prepared Statement to precompile and preexecute SQL. Second, they fine-tune the parameters of the Doris backend and tables to optimize storage and execution. Finally, they enable row caching as a complement to column-oriented Apache Doris.

Fine-tune Doris’ backend parameters be.conf:

disable_storage_row_cache = false
storage_page_cache_limit=40%

Fine-tuning table parameters when creating a table:

enable_unique_key_merge_on_write = true
store_row_column = true
light_schema_change = true

4. Tag calculation (Join)

In practice, many tag services are implemented through multi-table connections in the database. This typically involves more than 10 tables. In order to obtain the best computing performance, they adopted a co-located group policy in Doris.

Customer Grouping

The customer grouping pipeline in CDP 2.0 is like this: Apache Doris receives SQL from the customer service, performs calculations, and sends the result set through SELECT INTO OUTFILE Send to S3 object storage. The company has divided its customers into 1 million groups. A customer grouping task that used to take 50 seconds in Impala now takes only 10 seconds in Doris.

In addition to grouping customers for more fine-grained analysis, sometimes they also perform reverse analysis. That is, for a certain customer, find out which groups he/she belongs to. This helps analysts understand the characteristics of customers and how different customer groups overlap.

In Apache Doris, this is achieved through the BITMAP function: BITMAP_CONTAINS is a quick way to check whether a customer belongs to a certain group, BITMAP_OR, BITMAP_INTERSECT and BITMAP_XOR are the choices for cross analysis.

Conclusion

From CDP 1.0 to CDP 2.0, insurance companies use the unified data warehouse Apache Doris to replace Spark Impala HBase NebulaGraph. Improved data processing efficiency by breaking down data silos and simplifying data processing pipelines. In CDP 3.0, they hope to group customers by combining real-time tags and offline tags for more diverse and flexible analysis. The Apache Doris community and VeloDB team will continue to be support partners during this upgrade.

The above is the detailed content of Breaking down data silos using a unified data warehouse: CDP based on Apache Doris. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
如何在PHP中实现用户注册时发送短信验证码如何在PHP中实现用户注册时发送短信验证码Sep 28, 2023 pm 12:01 PM

如何在PHP中实现用户注册时发送短信验证码随着移动互联网的普及,手机号码已经成为用户注册和登录的重要凭证之一。为了保证用户账号的安全性,很多网站和应用都会在用户注册时发送短信验证码进行验证。本文将介绍如何在PHP中实现用户注册时发送短信验证码的功能,并附上具体的代码示例。一、创建短信验证码发送接口首先,我们需要创建一个短信验证码发送接口,用于向用户的手机号码

如何使用PHP和CGI实现用户注册和登录功能如何使用PHP和CGI实现用户注册和登录功能Jul 21, 2023 pm 02:31 PM

如何使用PHP和CGI实现用户注册和登录功能用户注册和登录是许多网站必备的功能之一。在本文中,我们将介绍如何使用PHP和CGI来实现这两个功能。我们将通过代码示例来演示整个过程。一、用户注册功能的实现用户注册功能允许新用户创建一个账户,并将其信息保存到数据库中。以下是实现用户注册功能的代码示例:创建数据库表首先,我们需要创建一个数据库表,用于存储用户信息。可

通过MySQL开发实现数据加工与数据仓库的项目经验分享通过MySQL开发实现数据加工与数据仓库的项目经验分享Nov 03, 2023 am 09:39 AM

在当今数字化时代,数据已被普遍认为是企业决策的基础与资本。但是,处理大量数据并将其转化为可靠的决策支持信息的过程并不容易。这时,数据加工和数据仓库开始发挥重要作用。本文将分享一个通过MySQL开发实现数据加工和数据仓库的项目经验。一、项目背景本项目是基于一个商业企业数据化建设的需要,旨在通过数据加工和数据仓库实现数据汇聚、一致性、清洗和可靠性。本次实施的数据

如何用PHP实现CMS系统的用户注册/登录功能如何用PHP实现CMS系统的用户注册/登录功能Aug 07, 2023 am 11:31 AM

如何用PHP实现CMS系统的用户注册/登录功能?随着互联网的发展,CMS(ContentManagementSystem,内容管理系统)系统成为了网站开发中非常重要的一环。而其中的用户注册/登录功能更是不可或缺的一部分。本文将介绍如何使用PHP语言实现CMS系统的用户注册/登录功能,并附上相应的代码示例。以下是实现步骤:创建用户数据库首先,我们需要建立一

如何利用PHP函数实现用户注册和登录的验证码生成和验证?如何利用PHP函数实现用户注册和登录的验证码生成和验证?Jul 24, 2023 pm 06:09 PM

如何利用PHP函数实现用户注册和登录的验证码生成和验证?在网站的用户注册和登录页面中,为了防止机器人批量注册和攻击,通常需要添加验证码功能。本文将介绍如何利用PHP函数实现用户注册和登录的验证码生成和验证。验证码生成首先,我们需要生成随机的验证码图片供用户填写。PHP提供了GD库和图像处理函数,可以方便地生成验证码图片。<?php//创建一个画布

在Go语言中使用Hive实现高效的数据仓库在Go语言中使用Hive实现高效的数据仓库Jun 15, 2023 pm 08:52 PM

近年来,数据仓库成为了企业数据管理中不可或缺的一部分。直接使用数据库进行数据分析可以满足简单的查询需求,但当我们需要进行大规模数据分析时,单个数据库已经无法满足需求,这时我们需要使用数据仓库来处理海量数据。而Hive则是数据仓库领域中最流行的开源组件之一,它可以将Hadoop分布式计算引擎和SQL查询集成在一起,并支持海量数据的并行处理。同时,在Go语言中使

使用Laravel框架实现用户注册和登录功能的步骤使用Laravel框架实现用户注册和登录功能的步骤Jul 28, 2023 pm 03:17 PM

使用Laravel框架实现用户注册和登录功能的步骤Laravel是一个流行的PHP开发框架,提供了许多强大的功能和工具,使得开发者可以轻松构建各种Web应用程序。用户注册和登录是任何应用程序的基本功能之一,下面我们将使用Laravel框架来实现这两个功能。步骤1:创建新的Laravel项目首先,我们需要在本地计算机上创建一个新的Laravel项目。打开终端或

如何利用PHP实现用户注册功能如何利用PHP实现用户注册功能Sep 25, 2023 pm 06:13 PM

如何利用PHP实现用户注册功能在现代的网络应用程序中,用户注册功能是一个非常常见的需求。通过注册功能,用户可以创建自己的账户并使用相应的功能。本文将通过PHP编程语言来实现用户注册功能,并提供详细的代码示例。首先,我们需要创建一个HTML表单,用于接收用户的注册信息。在表单中,我们需要包含一些输入字段,如用户名、密码、邮箱等。可以根据实际需求自定义表单字段。

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool