Home  >  Article  >  Technology peripherals  >  Privacy Protection: AI Anonymizes Healthcare Clinical Data

Privacy Protection: AI Anonymizes Healthcare Clinical Data

王林
王林forward
2023-04-12 15:19:08980browse

Privacy Protection: AI Anonymizes Healthcare Clinical Data

In the face of the sudden COVID-19 epidemic, we have witnessed record-breaking data breaches. A recent IBM report found that the cost of data breaches is also rising dramatically.

Healthcare is undoubtedly one of the industries most affected by data breaches, with each data breach costing an average of $9.2 million. The type of information most often exposed in such breaches is sensitive customer data.

Pharmaceutical and healthcare companies are required to organize and operate in accordance with strict guidance while protecting patient data. Therefore, any breach can be costly. For example, companies are required to collect, process and store personally identifiable information (PII) throughout the drug discovery phase, and when trials are concluded and clinical applications submitted, care must be taken to protect patient privacy in published results.

The European Medicines Agency (EMA) Regulation No. 0070 and the "Public Release of Clinical Information" regulations issued by Health Canada both put forward specific suggestions on data anonymization, hoping to minimize the use of results to restore patient identity information. risk.

In addition to advocating for data privacy, these regulations also require the sharing of trial data to ensure that the community can work on it. But this undoubtedly puts companies in a dilemma.

So, how do pharmaceutical companies strike a balance between data privacy and transparency, while publishing research results in a timely, cost-effective and efficient manner? Facts have proven that AI technology can take on more than 97% of the workload in the submission process, greatly reducing the operational burden of enterprises.

Why is it so difficult to anonymize clinical research results (CSR)?

In the process of implementing anonymization of clinical submissions, companies mainly face three core challenges:

Unstructured data is difficult to process: Among clinical trial data, there are many Most of it is unstructured data. Research results contain a large amount of text data, scanned images and tables, making processing inefficient. Research reports often run into thousands of pages, and identifying sensitive information in them is like finding a needle in a haystack. Furthermore, there are no standardized technical training solutions that can automate this type of processing.

Manual processes are cumbersome and error-prone: Today, pharmaceutical companies employ hundreds of employees to anonymize clinical study submissions. The entire team needs to go through more than 25 complex steps, and a typical summary document may take up to 45 days to process. And when manually reviewing thousands of pages of material, the tedious process often leads to errors.

Open interpretation of regulatory guidelines: Although there are many detailed suggestions in the regulations, the details are still incomplete. For example, Health Canada's "Public Release of Clinical Information" regulations require that the risk of recovery of identity information should be less than 9%, but it does not detail the specific risk calculation method.

Below, we will envision specific solutions that can handle such anonymization needs from a problem-solving perspective.

Using augmented analytics to identify sensitive information in human language

The following three elements help build technology-driven anonymization solutions:

For natural language AI language model for processing (NLP)

Nowadays, AI can create like an artist and diagnose like a doctor. Deep learning technology has promoted many advances in AI, and AI language models are one of the backbones. As a branch of algorithms designed to process human language, AI language models are particularly good at detecting named entities, such as patient names, social security numbers, and zip codes.

Unconsciously, these powerful AI models have penetrated into every corner of the public domain and been trained on a large scale using public documents. In addition to the well-known Wikipedia, the MIMIC-III v1.4 database containing desensitized data of 40,000 patients has also become a valuable resource for training AI models. Of course, in order to improve model performance, domain experts also need to conduct subsequent retraining of the model based on internal clinical trial reports.

Improving accuracy through human-machine loop design

The 9% risk threshold standard proposed by Health Canada can be roughly converted into a model accuracy requirement of about 95% (usually using recall rate or measured by accuracy). AI algorithms are able to look at large amounts of data and run multiple training cycles to improve their accuracy. However, technological improvements alone are not enough to prepare them for clinical application; these models also require human guidance and support.

To address the subjectivity of clinical trial data and improve outcomes, analytics solutions are designed to work alongside humans—this is called augmented intelligence. That is to say, humans are regarded as part of the human-machine loop. They are not only responsible for data labeling and model training, but also provide regular feedback after the solution is effective. In this way, the accuracy and output performance of the model will be improved.

Solving Problems in a Collaborative Approach

Let’s assume that a study involves 1,000 patients, 980 of whom are from the continental United States and the remaining 20 from South America. So, does the data of these 20 patients need to be edited (blacked out) or anonymized? Is it necessary to select patient samples within the same country or continent? In what ways might an attacker combine this anonymized information with age, postal code, and other data to ultimately restore the patient's identity?

Unfortunately, there are no standard answers to these questions. To more clearly interpret clinical submission guidance, pharmaceutical manufacturers, clinical research organizations (CROs), technology solution providers and researchers from academia need to join forces and collaborate.

AI-driven anonymization method

With the above basic ideas, the next step is to piece them together into a complete solution process. The various technologies in the entire anonymization solution should be based on the actual methods we already use in our work.

Clinical study reports contain a variety of structured data (numeric and identity entities, such as demographic information and address entries), as well as various unstructured data elements that we discussed previously. This must be handled properly to prevent malicious hackers from restoring these to sensitive named entities. Structured data is relatively easy to process, but AI algorithms still need to overcome the difficulty of unstructured data.

So, unstructured data (usually in a format such as a scanned image or PDF) is first converted into a readable form using technologies such as optical character recognition (OCR) or computer vision. Afterwards, AI algorithms are applied to the documents to detect personally identifiable information. To improve algorithm performance, users can share feedback on sample results to help the system understand how to handle these lower-confidence analyses.

Privacy Protection: AI Anonymizes Healthcare Clinical Data

AI-driven anonymization method

After anonymization is completed, the corresponding identity restoration risks must also be assessed. This work usually requires reference to the background of the population and combined with data from other similar trials. The risk assessment focuses on identifying three major risk scenarios – prosecutors, journalists and marketers – through a set of elements. These three groups will try to restore patient information based on their own needs.

Until the risk level reaches 9% of the prescribed recommendations, the anonymization process will continue to introduce more business rules and algorithm improvements, trying to enhance effectiveness in a repetitive cycle. Then by integrating with other technology applications and establishing a machine learning operations (ML Ops) process, the entire anonymization solution can be incorporated into the actual workflow.

A more difficult challenge than algorithms—data quality

For pharmaceutical companies, such anonymization solutions can shorten the submission cycle by up to 97%. More importantly, this semi-automated workflow improves efficiency while ensuring human involvement. But what are the biggest challenges in building AI-powered anonymization solutions?

In fact, like most data science practices, the biggest obstacle to this work is not the AI ​​algorithm used to identify named entities, but how to convert research reports into high-quality data that can be processed by AI. For documents with different formats, styles and structures, the corresponding content ingestion pipeline is often at a loss.

Therefore, AI anonymization solutions need to be constantly fine-tuned to adapt to new document encoding formats, or to accurately detect the starting and ending positions in picture/table scans. Obviously, this aspect of work is the most time-consuming and energy-consuming area of ​​​​AI anonymization.

New challenges of anonymization in clinical research

With the rapid advancement of technology, will the anonymization of clinical research continue to be less difficult and more efficient? While AI-driven solutions are indeed impressive, there will be new challenges that require attention.

First, consumer data collected through social media, device usage and online tracking are greatly increasing the risk of identity restoration. Attackers can combine this public information with clinical research data to accurately identify patients. What is even more worrying is that malicious hackers are very active in applying AI results and may even get ahead of pharmaceutical companies.

Finally, regulations continue to evolve to accommodate country-specific practices. Perhaps soon some countries will announce specific regulations on the anonymization of clinical submissions, which will certainly increase the complexity and cost burden for companies to maintain compliance. But as the saying goes, the future is bright but the road is tortuous. The mature development of AI technology at least brings hope to the entire industry to overcome problems.

The above is the detailed content of Privacy Protection: AI Anonymizes Healthcare Clinical Data. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete