Hidden Gem
Background Review: In previous articles, it was mentioned that "the goal of IntaLink is to achieve automated data linking in the field of data integration." From the discussion, it is clear that IntaLink addresses the issue of automatic linking of "relational data and multiple tables."
Now, let's discuss whether this issue has broad application scenarios or if it is merely a pseudo-proposition without practical demand.
Although large models, big data platforms, and other technologies can utilize various types of information, including documents, images, audio, and video, such as multimodal generative AI capable of producing videos and facilitating voice interactions, the results are often open-ended and subjective, occasionally leading to "hallucinations." Thus, while using them for reference or assistance is acceptable, in certain rigorous working environments, we cannot rely on this information or large models to complete tasks. In sectors like banking, finance, transportation, trading, accounting, production, and energy, core business data must be managed using structured relational data.
(1) The Design Paradigm of Relational Databases requires data to be reasonably divided to avoid significant redundancy. If the data generated during the construction phase contains a lot of redundancy, not only is the data collection workload duplicated, but data consistency is also difficult to ensure. From another perspective, if all related data are stored in a single table, but the data items come from different business sources, with varying data collectors and generation times, maintaining such data records becomes impossible. Thus, data construction will inherently organize data based on object orientation and business activities, leading to its distribution across different tables.
(2) Data Must Originate from Multiple Systems. Since information technology construction is not completed in one go, there will inevitably be a sequence of developments. Even within the same system, there may be variations in implementation timelines. Moreover, different application scenarios require different technological choices; for instance, business data, real-time data, and log information may be realized through various technologies, making data inherently multi-sourced.
Data needs to be integrated for application. The demand for data integration applications has various possibilities. For example, integrating production data and planning data can assess the status of plan completion; integrating production data and sales data can identify product backlogs or fulfillment of order deliveries; and integrating production data with financial data can evaluate production costs and profitability. Therefore, data integration is the most effective way to maximize data value and empower business processes.
In summary, the integration application of relational data will remain one of the most important data application scenarios for a long time. As long as this scenario exists, IntaLink will have broad adaptability.
T2SQL (Text to SQL) and NL2SQL (Natural Language to SQL) automatically generate the required data queries through text or natural language input. The terms T2SQL and NL2SQL essentially describe the same concept: utilizing AI technology to transform semantic understanding into data operation methods, which is the same idea but with different terminologies. This is a research direction in data applications. In recent years, with the emergence of large model technologies, this field has seen significant advancement. I have researched technical reports from Alibaba and Tencent and tried out open-source projects like DB-GPT. These technologies are largely similar, at least in their underlying technical logic, while IntaLink’s approach is entirely different.
Let’s set aside the underlying technical logic for now and conduct a comparative analysis based on implementation methods:
Suppose we have a set of tables named T1, T2, ..., Tn, each containing several data items labeled C1, C2, ..., Cn, with varying counts of items per table. Consider a simulated dataset for table T1 as follows:
C1 | C2 | C3 | C4 | C5 | C6 |
---|---|---|---|---|---|
Orange | 5 | 3 | 3 | 2 | 1 |
From this content alone, we cannot derive any useful information. We are unclear about the meaning of the data above. Let’s simulate two meanings for the data:
Fruit Type | Warehouse No. | Shelf No. | Stock | Shelf Life | Warehouse Manager ID |
---|---|---|---|---|---|
Orange | 5 | 3 | 3 | 2 | 1 |
Hotel Name | Warehouse Hotness Ranking | Star Rating | Years in Business | Remaining Rooms | Discount Available |
---|---|---|---|---|---|
Orange | 5 | 3 | 3 | 2 | 1 |
We won't dwell on the validity of these datasets or the existence of such tables. However, it is evident that without understanding the meaning of the tables and data items, the data cannot be applied. One cannot link data application needs to the data itself, let alone discuss more complex data operations.
Using a dataset for testing NL2SQL, let’s illustrate the application pattern of large model technology in this field.
The Spider dataset is a T2S dataset for multi-database, multi-table, single-round queries and is recognized as the most challenging large-scale cross-domain evaluation leaderboard. It was proposed by Yale University in 2018, annotated by eleven Yale students. The dataset contains ten thousand one hundred eighty-one natural language questions and five thousand six hundred ninety-three SQL statements, covering over two hundred databases across one hundred thirty-eight different domains. Seven thousand questions are used for training, one thousand thirty-four for development, and two thousand one hundred forty-seven for testing. In other words, by providing questions along with their corresponding answers (SQL), the large model learns to utilize the data. For simplicity, we can condense the logic as follows:
After training the model with such a dataset, we can pose the following test question:
From this, we see that NL2SQL emphasizes deriving possible SQL queries based on semantic and contextual understanding, relying on a trained dataset.
IntaLink's data integration does not require users to provide any training data. The relationships between data are generated through an inter-table relationship analysis model. This relationship generation does not require understanding the actual significance of the tables and data items but is derived through a set of methods that analyze the data's characteristic values to deduce associations between tables. Below, we illustrate the establishment of inter-table relationships using two sample tables.
Tab_1
Name | Student_ID | CLASS | Age | Sex |
---|---|---|---|---|
Zhang San | 2021_0001 | 2021_01 | 19 | Male |
Li Si | 2021_0002 | 2021_01 | 18 | Female |
Wang Wu | 2021_0003 | 2021_01 | 19 | Male |
Tab_2
Student_ID | Course | Grade | Rank |
---|---|---|---|
2021_0001 | Math | 135 | 18 |
2021_0001 | Chinese | 110 | 23 |
2021_0002 | Math | 120 | 25 |
2021_0002 | Chinese | 125 | 10 |
In Tab_1, the Student_ID matches the Student_ID in Tab_2, sharing the same characteristic values. Therefore, to link these two tables, the condition Tab_1.Student_ID = Tab_2.Student_ID holds true. This analysis of inter-table linkage requires consideration of numerous factors. In IntaLink, we replicate the data characteristic value memory database as an analysis tool, utilizing a set of optimized analytical methods to produce inter-table relationship analysis results. Due to the complexity of the content involved, we will not elaborate further here. A separate article will discuss the implementation logic.
Large model technologies excel in semantic understanding and generative content, while IntaLink has advantages in data association analysis with lower upfront workload and higher accuracy. Ideally, large model technologies could be integrated to understand user input requirements, converting that information into the necessary data tables and items, which IntaLink would then generate for data sets, followed by the large model generating the desired outcomes (e.g., reports, charts, etc.) for user presentation.
We would love for you to be a part of the IntaLink journey! Connect with us and contribute to our project:
? GitHub Repository: IntaLink
? Join our Discord Community
Be a part of the open-source revolution and help us shape the future of intelligent data integration!
The above is the detailed content of IntaLink: A New NLL Technology Distinct from Large Models. For more information, please follow other related articles on the PHP Chinese website!