As enterprise data sources become increasingly diverse, the problem of data silos has become common. When insurance companies build customer data platforms (CDPs), they face the problem of component-intensive computing layers and scattered data storage caused by data silos. In order to solve these problems, they adopted CDP 2.0 based on Apache Doris, using Doris' unified data warehouse capabilities to break data silos, simplify data processing pipelines, and improve data processing efficiency.
The data silo problem is like arthritis for online businesses because almost everyone encounters it as they age. Businesses interact with customers through websites, mobile apps, HTML5 pages, and end devices. For some reason, integrating data from all these sources is tricky. The data remains in place and cannot be correlated with each other for further analysis. This is how data silos form. The larger your business becomes, the more diverse sources of customer data you have, and the more likely you are to become trapped in data silos.
That’s exactly what happened with the insurance company I’m going to discuss in this article. By 2023, they have served more than 500 million customers and signed 57 billion insurance contracts. When they began building their Customer Data Platform (CDP) to accommodate such massive data scale, they used multiple components.
Data silos in CDP
Like most data platforms, their CDP 1.0 has both batch pipelines and real-time streaming pipelines. Offline data is loaded into Impala via a Spark job, where it is labeled and divided into groups. At the same time, Spark also sends it to NebulaGraph for OneID calculation (more on this later in this article). On the other hand, real-time data is tagged by Flink and then stored in HBase for query.
This results in a component-intensive computing layer in CDP: Impala, Spark, NebulaGraph and HBase.
As a result, offline labels, live labels and graph data are scattered across multiple components. Integrating them to provide further data services is costly due to redundant storage and large data transfers. More importantly, due to storage differences, they had to expand the scale of the CDH cluster and NebulaGraph cluster, increasing resource and maintenance costs.
CDP based on Apache Doris
For CDP 2.0, they decided to introduce a unified solution to clean up the mess. In the computing layer of CDP 2.0, Apache Doris is responsible for real-time and offline data storage and calculation.
In order to ingest offline data, they utilize the stream loading method. Their 30-thread ingest test showed that it can perform over 300,000 update inserts per second. To load real-time data, they used a combination of Flink-Doris-Connector and Stream Load. Additionally, in real-time reporting that requires pulling data from multiple external data sources, they leverage multi-catalog capabilities for federated queries.
The customer analysis workflow on this CDP is as follows. First, they organize customer information and then label each customer. They group customers according to tags for more targeted analysis and actions.
Next, I'll dig into these workloads and show you how Apache Doris accelerates them.
One ID
Have you ever encountered this situation when your products and services have different user registration systems? You could collect User ID A's email from one product page, and then collect User ID B's Social Security number from another product page. You will then discover that UserID A and UserID B actually belong to the same person because they use the same phone number.
This is why OneID emerged as an idea. It is to collect the user registration information of all business lines into a large table in Apache Doris, organize it, and ensure that each user has a unique OneID.
This is how they leverage functionality in Apache Doris to determine which registrations belong to the same user.
Tag Service
This CDP accommodates 500 million customer information, which comes from more than 500 source tables, with a total of more than 2,000 tags attached.
According to timeliness, tags can be divided into real-time tags and offline tags. Real-time tags are computed by Apache Flink and written to flat tables in Apache Doris, while offline tags are computed by Apache Doris as they originate from user attribute tables, business tables, and user behavior tables in Doris. The following are the company’s best practices in data labeling:
1. Offline tags
During the peak period of data writing, due to the large scale of data, full updates are very difficult. It is easy to cause OOM errors. To avoid this, they leveraged Apache Doris' INSERT INTO SELECT functionality and enabled partial column updates. This will significantly reduce memory consumption and maintain system stability during data loading.
set enable_unique_key_partial_update=true; insert into tb_label_result(one_id, labelxx) select one_id, label_value as labelxx from .....
2. Live tags
Partial column updates can also be used for live tags, because even live tags update at different speeds. All that is required is to set partial_columns to true.
curl --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H "columns:id,balance,last_access_time" -T /tmp/test.csv http ://127.0.0.1:48037/api/db1/user_profile/_stream_load
3. High concurrency point query
With the current business scale, the company is using Receiving tag query requests at a concurrency level of over 5000 QPS. They use a combination of strategies to ensure high performance. First, they use Prepared Statement to precompile and preexecute SQL. Second, they fine-tune the parameters of the Doris backend and tables to optimize storage and execution. Finally, they enable row caching as a complement to column-oriented Apache Doris.
Fine-tune Doris’ backend parameters be.conf:
disable_storage_row_cache = false storage_page_cache_limit=40%
Fine-tuning table parameters when creating a table:
enable_unique_key_merge_on_write = true store_row_column = true light_schema_change = true
4. Tag calculation (Join)
In practice, many tag services are implemented through multi-table connections in the database. This typically involves more than 10 tables. In order to obtain the best computing performance, they adopted a co-located group policy in Doris.
Customer Grouping
The customer grouping pipeline in CDP 2.0 is like this: Apache Doris receives SQL from the customer service, performs calculations, and sends the result set through SELECT INTO OUTFILE Send to S3 object storage. The company has divided its customers into 1 million groups. A customer grouping task that used to take 50 seconds in Impala now takes only 10 seconds in Doris.
In addition to grouping customers for more fine-grained analysis, sometimes they also perform reverse analysis. That is, for a certain customer, find out which groups he/she belongs to. This helps analysts understand the characteristics of customers and how different customer groups overlap.
In Apache Doris, this is achieved through the BITMAP function: BITMAP_CONTAINS is a quick way to check whether a customer belongs to a certain group, BITMAP_OR, BITMAP_INTERSECT and BITMAP_XOR are the choices for cross analysis.
Conclusion
From CDP 1.0 to CDP 2.0, insurance companies use the unified data warehouse Apache Doris to replace Spark Impala HBase NebulaGraph. Improved data processing efficiency by breaking down data silos and simplifying data processing pipelines. In CDP 3.0, they hope to group customers by combining real-time tags and offline tags for more diverse and flexible analysis. The Apache Doris community and VeloDB team will continue to be support partners during this upgrade.
The above is the detailed content of Breaking down data silos using a unified data warehouse: CDP based on Apache Doris. For more information, please follow other related articles on the PHP Chinese website!