Home  >  Article  >  Technology peripherals  >  The impact of data management on generative AI

The impact of data management on generative AI

WBOY
WBOYforward
2023-06-07 11:15:36870browse

2023 will be the year we remember as the mainstream start of the AI ​​era, driven by the technology everyone is talking about: ChatGPT.

Generative AI language models like ChatGPT have captured our imagination because for the first time we are able to see AI talk to us like real people and generate Essays, poetry, and other new content we find creative. With generative AI solutions, there may be breakthrough potential to increase the speed of innovation, productivity, and efficiency in delivering value. Despite the limitations, there is much room for improvement in awareness of their data privacy and management best practices.

Lately, many in the technology and security fields have sounded the alarm over the lack of understanding and adequate regulatory guardrails around the use of artificial intelligence technology. Concerns have been expressed about the reliability of AI tool outputs, intellectual property rights, exposure of sensitive data, and violations of privacy and security issues.

Samsung’s incident with ChatGPT made headlines because the tech giant inadvertently leaked its secrets to ChatGPT. Samsung isn't the only company doing this: A Cyberhaven study found that 4% of employees had fed sensitive company data into large language models. Many people don’t realize that when they use corporate data to train models, AI companies may reuse that data on other occasions.

As if we didn’t need any more cybercrime fodder, cybersecurity intelligence company RecordedFuture revealed: “Within days of ChatGPT’s release, we discovered Many threat actors who share flawed but fully functional malware, social engineering tutorials, money-making schemes and more - all by using ChatGPT."

Private On the privacy side, when a person signs up using a tool like ChatGPT, it has access to IP addresses, browser settings and browsing activity - just like today's search engines, said Jose Blaya, director of Internet Access Engineering. But the stakes are higher because it could reveal political beliefs or sexual orientation without a person's consent, and could mean embarrassing or even career-destroying information being released.

Clearly, we need better regulations and standards to implement these new AI technologies. But there is a lack of discussion about the important role of data governance and data management, which can play a key role in enterprise adoption and safe use of artificial intelligence.

It’s all about the data

Here are the three areas you should focus on:

About proprietary pre-trained AI Model or Large Language Model (LLM), the core issue of data governance and transparency lies in the training data. Machine learning programs using llm contain large data sets from many sources. The problem is, LLM is a black box and provides little transparency into the source data. We are unbiased and unbiased about the credibility of the sources, but avoid including unlawful personally identifiable information or fraudulent data. Open AI, for example, does not share its source data. The Washington Post analyzed Google's C4 data set, which covers 15 million websites, and found dozens of objectionable sites that included inflammatory and personally identifiable information data and other questionable content. Data governance requires transparency into data sources and ensures the validity and trustworthiness of knowledge gained from these data sources. For example, your AI bot might have been trained on data from unverified sources or fake news sites, biasing its knowledge that is now part of your company’s new policy or R&D program.

Currently, different artificial intelligence vendors have different strategies for handling user data privacy, including data isolation and data domains. Your employees may unknowingly provide data to LLM, however they may not be aware that this data will be incorporated into the model's knowledge base. It is possible that companies may unintentionally leak trade secrets, software code, and personal data to the public. Some AI solutions offer workarounds, such as APIs to protect data privacy by excluding data from pre-trained models, but this limits their value as the ideal use case is to augment pre-trained models with case-specific data, While keeping data private. Having pre-trained AI tools understand the concept of data “domains” is one solution to the problem. "Common" fields of training data are used for pre-training and shared between entities, while training model extensions based on "proprietary data" are safely restricted within the boundaries of the organization. Data management ensures that these boundaries are created and preserved.

AI-induced derivative works cover the third area of ​​data management, related to the AI ​​process and ultimately the data owners. Let's say I use an AI bot to solve a coding problem. Normally, I would know who is responsible for investigating and fixing something because if something is not handled correctly, a bug or error will occur. But with AI, my organization is responsible for any errors or adverse consequences that result from the tasks I ask the AI ​​to perform—even if we are not transparent about the process or the source data. You can't blame the machine: somewhere, it was a human who made a mistake or a bad outcome. What about IP? Do you own the IP of a work created using generative AI tools? How do you defend it in court? According to Harvard Business Review, the art world is already starting to file lawsuits.

Data Management Strategies to Consider Now

In these early stages, we don’t know what we don’t know about AI, including bad data, privacy and security, knowledge Risks to property rights and other sensitive data sets. Artificial Intelligence is also a broad field with multiple approaches such as LLM, logic-based automation, these are just some of the topics that are explored by combining data governance policies and data management practices:

  • Pause experimentation with generative AI until you have an oversight strategy, policy,

, and procedures for mitigating risks and validating results.

  • Guidelines for Consolidated Data Management: Start with a solid understanding of your data, no matter where it resides. Where is your sensitive personal information and customer data? How much IP data do you have and where are these files? Can you monitor usage to ensure these types of data are not inadvertently fed into AI tools and prevent security or privacy breaches ?

Avoid providing unnecessary data to AI applications, and do not share any sensitive proprietary data. Lock/encrypt IP and customer data to prevent it from being shared.

  • Understand how and whether AI tools can be transparent to data sources.

Can vendors protect your data? Google shared this statement on its blog, but the "how" is unclear: "Whether a company is working on VertexAI Whether training a model or building a customer service experience on GenerativeAIAppBuilder, private data is kept confidential and will not be used in the broader base model training corpus. Please review the contract terms of each AI tool carefully to understand any Will the data be kept confidential?

Mark data as a derivative work of the owner or the person or department commissioned the project. This is helpful as you may ultimately be responsible for any work produced by the company, You want to know how AI is incorporated into the process and by whom.

  • Ensure data portability across domains. For example, a team may want to divest its IP and identify feature data and feed it to a common training dataset for future use. Automation and tracking of this process is critical.

  • Stay informed of any industry regulations and guidance that are being developed, and engage with peers in other organizations to understand how they are implementing risk mitigation and data management.

  • Before embarking on any generative AI project, consult a legal expert to understand the risks and processes to be followed in the event of a data breach, privacy and intellectual property violations, malicious actors, or false/erroneous results.

Practical Approaches to Artificial Intelligence in Enterprises

Artificial intelligence is developing at an unprecedented rate, with great potential to innovate, reduce costs and improve user experience. As with most powerful Like tools, artificial intelligence needs to be applied carefully in the appropriate environment and equipped with appropriate data governance and data management measures to ensure security. In the field of artificial intelligence data management, there are no clear standards and continued research is needed. When using artificial intelligence Before applying, enterprises should exercise caution and ensure they fully understand data exposure, data breaches and possible data security risks.

The above is the detailed content of The impact of data management on generative AI. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete