Home >Technology peripherals >AI >Linyu open source the first industrial-grade multi-party secure data analysis system SCQL: 'easy to use' private computing like writing SQL

Linyu open source the first industrial-grade multi-party secure data analysis system SCQL: 'easy to use' private computing like writing SQL

王林
王林forward
2023-04-04 12:45:06917browse

The first industrial-grade multi-party secure data analysis system SCQL open sourced by Linyu has filled the gap in the industry and will further extend the link of data security collaboration and expand the scenarios of data value circulation.

The rise of large models has once again profoundly recognized the importance of data as a new factor of production. As an important path to balance data utilization and data security, privacy computing has further highlighted its academic value and application value. In recent years, driven by policy and market demand, private computing technology and industry have developed steadily and have been applied in many fields such as finance, communications, the Internet, government affairs, and medical care. But overall, due to issues such as technical thresholds and construction costs, there are truly large-scale production applications, but the number is very small.

On March 29, at the first Hidden Language Open Source Community Open Day, the Hidden Language SecretFlow open source framework released a new version, launching an important feature that the industry is looking forward to - the multi-party secure data analysis system SCQL (Secure Collaborative Query Language ). This is the first application in the industry to apply SQL to multi-party secure computing (MPC) technology, realizing industrial-grade multi-party secure data analysis functions. It is currently open sourced in the Linguo GitHub community and is open to developers around the world for free.

Linyu open source the first industrial-grade multi-party secure data analysis system SCQL: easy to use private computing like writing SQL

Wang Lei, the person in charge of the Hidden Language framework SecretFlow, released the Hidden Language SCQL system at the Hidden Language Open Source Community Open Day

This is also the reason why the Hidden Language team has spent three years and gone through multiple rounds of technology. It is verified that important product functions are open sourced and released after meeting industrial-grade application scenarios in terms of performance and security. We hope to further extend the link of data security collaboration and expand the scenarios of data value circulation, especially to meet the needs of long-tail companies and the majority of small and medium-sized institutions. Data analysis needs.

Wang Lei, head of the Linguistic Framework and general manager of Ant Group’s Privacy Intelligent Computing Technology Department, said in an exclusive interview with Machine Heart that when privacy computing falls into the industry and solves practical problems of large-scale users, compared with AI data analysis and BI application scenarios will be wider. SQL is currently the most familiar BI analysis tool. Linyu released the SCQL system this time in the hope that users can understand and use privacy at a low cost based on familiar workflows. Computing technology.

Currently, privacy computing is entering a new stage. The security compliance of single-point technologies has been verified by pilots. Especially since the promulgation of the Twenty Data Articles, the combination of big data and privacy computing has been realized to achieve privacy. The availability and ease of use of computing BI and lowering the technical threshold have become the core challenges at the current stage. Only by continuing to expand the breadth and depth of technology applications can we truly embrace the future era of comprehensive densification of large-scale data elements.

Wang Lei said that the Linyu team has made technical breakthroughs in SQL language parsing, MPC computing performance optimization and the security of result inversion, and has provided some very good new solutions to some problems. We also implemented the project and achieved good results. He also emphasized that there are more challenging and open problems and looks forward to more people participating in building a private computing open source community and exploring more application scenarios together.

Lingu SCQL: The first open source industrial-grade

Multi-party security data analysis system

According to the "Intelligence Maturity of Chinese Enterprises" released by the China Academy of Information and Communications Technology in January this year Report (2022)", currently 84% of enterprises are still in the basic stage of digital construction, and are still a certain distance away from realizing intelligent operations and innovative development. For this part of the enterprise, there are a lot of BI business needs.

Most of the BI technologies currently available are able to protect data when it is stored or transmitted, but lack the functionality to protect the security of the data calculation process. For organizations with privacy and security-related demands, this just broadens the scope of It breaks the boundaries of traditional BI technology and can be applied to more scenarios. As the marketization of data elements advances, there will be serious industry gaps in privacy computing BI analysis.

In this context, Linyu launched the SCQL project, which combines the most commonly used SQL in BI analysis with multi-party secure computing (MPC) in privacy computing, as a way to bring privacy computing to the industry and throughout the industry. The first step towards large-scale application in a large and complex ecosystem.

Focus on the scenario of multi-party data joint analysis because compared to the Trusted Execution Environment (TEE) technical route, it requires a hardware root of trust, and the current maturity of localization still needs time to be verified and polished. Multi-party security The computing (MPC) technology route has unique advantages: stronger data control, no reliance on special hardware, etc. In addition, in some scenarios where one-party data is relatively thin, the quality of data analysis can also be improved through the expansion of samples or data dimensions, that is, combining multi-party data for joint decision-making, and ultimately in business effect analysis, business strategy upgrades, and business model innovation. Get better results. For example:

  • In financial scenarios: different financial institutions cooperate to identify whether potential customers are high-risk customers by querying the number of loans, loan amounts, trustworthy records and other rules without revealing user privacy. ;
  • In marketing scenarios: Cooperate between different platforms to achieve complementary user profiles, analyze user preferences for content, and increase user activity through more reasonable content recommendations;
  • Medical scenarios Bottom: Different hospitals or even different departments within the same hospital jointly analyze patients’ medical records to provide decision-making guidance for registration or pre-diagnosis, improving the efficiency of medical services.

However, there are great technical challenges in realizing the combination of SQL and MPC. First of all, SQL is a complex architecture. When it is used in private computing scenarios, how to solve the complexity of architecture design involves parsing the SQL language, and the technical threshold for this parsing is very high. Secondly, in SQL usage scenarios, users have very high requirements on the response time after query submission, and generally expect to see the results immediately. However, the computing performance of MPC is very low. How to optimize this? Third, how to avoid the flexible SQL query language from querying sensitive information that users do not want to see.

Based on the underlying abstract SPU device (SecretFlow Processing Unit, SPU for short) of the MPC technology core, the Linyu team is the dense state computing unit of the Lingu platform and provides secure computing services for the Lingu framework. It innovatively realizes multi-party security data analysis. System SCQL. SCQL supports SQL-like query language. This language inherits the popularity, ease of learning and high maturity of SQL as a commonly used data analysis language. It can complete the statistics of joint analysis without users barely perceiving the semantics of multi-party secure computing. Result generated.

The SCQL architecture is shown in the figure below. It is divided into two parts. The upper SCDB can be regarded as the database of SCQL. It is responsible for translating the query into a dense state execution graph and issuing it to the database deployed on the data participant. SCQL Engine execution; SCQL Engine is the execution engine of SCQL. It will cooperate with the SCQL Engine of other participants to complete the execution of the dense state graph and report the results to SCDB.

Linyu open source the first industrial-grade multi-party secure data analysis system SCQL: easy to use private computing like writing SQL

Specifically, external users can directly initiate a traditional SQL request. This request will first go through Parser and be converted into an abstract syntax tree, and then through Planner, it will become Logical plan. The biggest challenge is from the Logical plan to the Execution Graph, where the Translator needs to perform a selection of the optimal protocol under multiple constraints. This is a key to making SQL private calculations, because there are security constraints in the entire calculation, which requires Comprehensively consider the data type, data source, and data status, and the data status will continue to migrate and change with the computing process.

Here, the Linyu team innovatively implemented the CCL (Column Control List) mechanism as an ingenious solution to SQL flexibility and functionality in multi-party secure computing scenarios. CCL provides an auxiliary tool that allows data owners to use CCL to describe the constraints on the use of each column of data before prior review. Only if the constraints are strictly met, the data analysis engine will execute it.

Linyu open source the first industrial-grade multi-party secure data analysis system SCQL: easy to use private computing like writing SQL

Currently provides 6 types of constraints, and the lingo framework will continue to be improved and refined in this regard in the future.

In summary, the lingo SCQL solution has made a very good attempt to address technical challenges such as correctness, timeliness and security, and has achieved the following functional features:

  • Easy to use and integrate: SCQL supports SQL-like query language, which is easy to use and low-cost to get started. In addition, SCQL provides a simple and easy-to-use API interface, which is easy to integrate and encapsulate, and supports commonly used data sources (currently supports MySQL, and plans to support CSV, Postgres, Hive and other data sources in the future), which can meet the multi-party data of the business at a low cost. Collaborative analysis requirements;
  • Fine data authorization mechanism: SCQL innovatively proposes the CCL (Column Control List) mechanism, allowing data parties to authorize how their data is used. The control granularity can be as fine as data table fields (columns);
  • Rich functions and flexible scenarios: Supports most commonly used SQL syntax and functions, which can meet the joint analysis needs of most scenarios;
  • Meet actual production performance requirements: On the premise of protecting data privacy, multi-level optimization has been carried out with the optimization idea of ​​reducing the amount of calculation in dense states as much as possible.

Currently, Ant Insurance has used the lingo SCQL function in cooperation with insurance companies for claim verification scenarios. Based on the claims technology platform and lingo framework, the "Claims Brain" intelligent claims system has been created. In the entire system, the multi-party data joint analysis solution is one of the core modules, helping insurance companies and their external medical data ISVs to collect original data without leaving the local area. , Conduct joint analysis on the premise that data value is protected.

In this plan, the insurance company provides "user compensation data", which includes the type of insured disease, policy effective time, time of accident, etc. In addition, the "pre-existing disease exemption rules" provided by the insurance company also includes the type of insured disease. With its corresponding exemption rules, ISV provides "user medical treatment data" including the type of diagnosed disease, time of medical treatment and other dimensions. In joint analysis, the description and execution of joint analysis tasks can be completed through the combination of "SELECT FROM", "INNER JOIN", "Where" clauses, and comparison expressions, while ensuring the data privacy of insurance companies and ISVs. Next, analyze whether the user meets the claim conditions.

This customized multi-party data joint analysis solution for health insurance can help effectively discover positive clues, reduce the risk of wrong claims, and control claims operating costs through digital investigation and review.

In the future, the Linyu open source community will also release SCQL co-construction tasks, including enriching data sources (such as supporting csv files), improving grammar and functions, enhancing ease of use, etc. You are welcome to continue to pay attention to the Linyu GitHub community. You can even submit your co-construction plan through a pull request to become an integral part of the co-construction task, and we will continue to build and improve this function together.

Expand data circulation application scenarios

Need easy-to-use and easy-to-use privacy computing

Wang Lei believes that the current overall privacy computing technology and market are still in their infancy, and the industry is still in its infancy. There is insufficient understanding of privacy computing technologies. Linyu hopes to set an easy-to-use, universal benchmark and help activate applications in the entire private computing industry. Of course, "Security is the core of privacy computing. Only on the premise of ensuring security can we talk about accuracy, performance, and ease of use. Otherwise, it can be achieved with other technologies."

The general concept runs through Argot research and development from beginning to end.

Wang Lei introduced that Ant Group began to explore privacy computing in 2016, and conducted practice and thinking through internal business and industry research. During this process, technicians discovered that there are many privacy computing technology routes and different architectures. Therefore, an ideal privacy computing architecture should first be complete, support mainstream technology frameworks, and take into account the development of new technologies. Secondly, in order to facilitate continuous iteration, the architecture must be decoupled from the bottom layer to the upper layer. In addition, the architecture should also have a good layered design to separate security and algorithms, facilitate application, and increase the application breadth and participation threshold of privacy computing technology. Business integration and large-scale production capabilities are also important, requiring good interface design and large-scale production capabilities, such as grayscale, rollback, elastic expansion, and multi-version management.

Under this concept, the lingo framework SecretFlow is proposed as a general privacy computing framework, adhering to the following principles to make the framework most inclusive and scalable to cope with future privacy computing technologies and applications. development of.

  • Completeness: Supports a variety of privacy computing technologies and can be flexibly assembled to meet the needs of different scenarios.
  • Transparency: Build a unified technical framework, try to make the underlying technology iterative and transparent to the upper layer, with high cohesion and low coupling.
  • Openness: People with different professional directions can easily participate in the construction of the framework and jointly accelerate the development of privacy computing technology.
  • Connectivity: Data in scenarios supported by different underlying technologies can be connected to each other.

The lingo framework supports the current mainstream privacy computing technology routes to better adapt to the needs of different scenarios. This also makes it easier to integrate and migrate multiple technology routes and learn from each other's strengths. At the same time, at a higher planning level, Yaoyu is designing a technical solution to support the "separation of three rights" of data element ownership, use rights, and operation rights to technically realize the data ownership proposed in the Twenty Data Articles. The guiding ideology of "separation of three rights": rights, usage rights and management rights.

Privacy computing is not a fancy job, but really thinking about what the industry wants. Wang Lei’s team has been thinking about how to provide safer, more efficient, stronger performance, and more flexible solutions. Currently, large-scale applications in the industry are mainly divided into two categories: BI and AI. BI can be subdivided into many subcategories, such as traditional SQL data analysis and Python-based data analysis, as well as big data processing, stream batch processing, etc. The industry of privacy-preserving machine learning for AI scenarios is now relatively mature, and there are many optional technical solutions and products on the market.

I think of those small-scale data institutions, which tend to start with smaller data volumes of millions or tens of millions, because such processing can cover many application scenarios, and in terms of investment and It is also more feasible in terms of output.

"Since the digitalization level of small and medium-sized institutions is in its early stages, and the amount of data is in the small sample stage, Al machine learning methods are unnecessary and not cost-effective, and BI analysis using SQL language as the main method is the most feasible Solution."

For large-scale data scenarios, BI data analysis is also an indispensable and important analysis method. "From big data to small data, SCQL can meet the needs of dense data security analysis," Wang Lei emphasized.

Of course, for AI applications, lingo will continue to iterate. For example, if the XGB algorithm is widely used in the industry, a faster version will be released.

Wang Lei said that the future trend of privacy computing must include technology integration, which not only includes the integration between different technical fields, but also the integration between technical tools. From the perspective of the entire privacy computing technology stack, the final solution must be a cross-section of multiple technologies to solve various problems. Based on the applicability of the deployment scenario and the security requirements, privacy computing technologies for different scenarios are selected.

Lingyu hopes to use technology to build the industry's trust in privacy computing as a data element circulation infrastructure, and jointly support various applications through the hub model and pipeline model, with a view to supporting the large-scale expansion of the entire industry in the future. .

Linyu open source the first industrial-grade multi-party secure data analysis system SCQL: easy to use private computing like writing SQL

The argot open source has two purposes: one is to hope that more people can use private computing; the other is to hope that more people can build a community together. At present, the number of people participating in co-construction is relatively small. Linyu plans to increase the direction of co-construction and improve the co-construction process this year. We look forward to actively exploring more possibilities for privacy computing with everyone.

Lingyu official website:

​https://www.secretflow.org.cn​

Lingyu community:

​https://github.com/secretflow​

​https://gitee.com/secretflow​

The above is the detailed content of Linyu open source the first industrial-grade multi-party secure data analysis system SCQL: 'easy to use' private computing like writing SQL. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete