《電子技術(shù)應(yīng)用》
您所在的位置:首頁 > 其他 > 設(shè)計應(yīng)用 > 面向國產(chǎn)數(shù)據(jù)庫的Text-to-SQL數(shù)據(jù)集設(shè)計
面向國產(chǎn)數(shù)據(jù)庫的Text-to-SQL數(shù)據(jù)集設(shè)計
網(wǎng)絡(luò)安全與數(shù)據(jù)治理
李國深1,劉瑩君2,于莉娜2,紀(jì)濤2,張航1,吳繼冰1
1.大數(shù)據(jù)與決策國家級重點實驗室;2.智能空間信息國家級重點實驗室
摘要: 隨著智能技術(shù)的發(fā)展,數(shù)據(jù)庫數(shù)量和規(guī)模激增,傳統(tǒng)數(shù)據(jù)存取技術(shù)在應(yīng)對海量數(shù)據(jù)處理需求時存在耗時長、效率低等問題,Text-to-SQL技術(shù)成為銜接用戶需求和數(shù)據(jù)庫存取的重要橋梁。然而,現(xiàn)有技術(shù)通常在開源非國產(chǎn)數(shù)據(jù)集上訓(xùn)練,在實際應(yīng)用中存在數(shù)據(jù)庫操作語言不一致、領(lǐng)域知識欠缺和可靠性差等問題。為此,結(jié)合數(shù)據(jù)庫領(lǐng)域軟硬件國產(chǎn)化趨勢,設(shè)計面向國產(chǎn)數(shù)據(jù)庫的Text-to-SQL數(shù)據(jù)集,采用基于合成數(shù)據(jù)方法的大語言模型兩階段訓(xùn)練技術(shù),提出一種基于大語言模型的國產(chǎn)數(shù)據(jù)庫Text-to-SQL方法,通過實驗對方法的有效性進(jìn)行了充分驗證。
中圖分類號:TP311.138文獻(xiàn)標(biāo)識碼:ADOI:10.19358/j.issn.2097-1788.2025.11.009引用格式:李國深,劉瑩君,于莉娜,等. 面向國產(chǎn)數(shù)據(jù)庫的Text-to-SQL數(shù)據(jù)集設(shè)計[J].網(wǎng)絡(luò)安全與數(shù)據(jù)治理,2025,44(11):52-59.
The design of Text-to-SQL datasets for domestic databases
Li Guoshen1, Liu Yingjun2, Yu Lina 2, Ji Tao2, Zhang Hang1, Wu Jibing1
1. National Key Laboratory of Big Data and Decision; 2. National Key Laboratory of Intelligent Geospatial Information
Abstract: With the development of intelligent technology, the number and scale of databases have surged. Traditional data access technologies face problems such as long-time consumption and low efficiency when meeting the needs of massive data processing. Text-to-SQL technology has thus become an important bridge connecting user needs and database access. However, existing technologies are usually trained on open-source non-domestic datasets, and their application is plagued by issues like inconsistent database operation languages, lack of domain knowledge, and poor reliability. To address this, this paper, in line with the localization trend of software and hardware in the database field, designs a Text-to-SQL dataset for domestic databases, adopts a two-stage training technology for large language models based on synthetic data methods, proposes a Text-to-SQL method for domestic databases based on large language models, and fully verifies the effectiveness of the method through experiments.
Key words : fine-tuning of large language models; synthetic dataset; preference learning; domestic databas

引言

文本到結(jié)構(gòu)化查詢語言(Text-to-SQL,T2S)是自然語言問題和數(shù)據(jù)庫工具結(jié)合的重要研究領(lǐng)域,具體是指將自然語言轉(zhuǎn)化為計算機(jī)可執(zhí)行的SQL查詢語句的過程,它解決了從非結(jié)構(gòu)化的自然語言和數(shù)據(jù)庫模式到結(jié)構(gòu)化SQL的轉(zhuǎn)換等系列問題。T2S技術(shù)的核心在于從文本數(shù)據(jù)里自動識別專業(yè)術(shù)語、所屬領(lǐng)域、關(guān)聯(lián)關(guān)系及結(jié)構(gòu)特征,進(jìn)而構(gòu)建相應(yīng)映射體系。傳統(tǒng)映射構(gòu)建模式高度依賴領(lǐng)域?qū)<业娜斯ひ?guī)范操作,這種方式在知識體系持續(xù)迭代更新,或者領(lǐng)域?qū)<屹Y源匱乏的場景下,往往會暴露出耗時久、成本高、易出錯等諸多弊端。而隨著自然語言處理技術(shù)的迅猛發(fā)展,大語言模型與T2S技術(shù)的融合應(yīng)用已成為新的發(fā)展趨勢。

傳統(tǒng)的T2S方法是基于規(guī)則模式的語法解析和模板匹配,需要大量人工標(biāo)注或手動構(gòu)建規(guī)則[1]。而大語言模型具有強(qiáng)大的語言理解和生成能力[2],能夠理解文本內(nèi)容、提取關(guān)鍵信息、識別語義關(guān)系。利用大語言模型對大規(guī)模文本進(jìn)行預(yù)訓(xùn)練,可從中自動學(xué)習(xí)實體和關(guān)系以及數(shù)據(jù)庫模式,進(jìn)而構(gòu)建和更新從文本到SQL的映射關(guān)系,減輕領(lǐng)域?qū)<以跀?shù)據(jù)標(biāo)注、規(guī)則構(gòu)建階段的工作量。然而,當(dāng)前Text-to-SQL研究的進(jìn)展仍受限于數(shù)據(jù)集的質(zhì)量與規(guī)模[3]。現(xiàn)有主流數(shù)據(jù)集如Spider、WikiSQL、Bird雖在多領(lǐng)域覆蓋與復(fù)雜查詢標(biāo)注上取得一定成果,但仍存在領(lǐng)域分布不均衡、真實業(yè)務(wù)場景模擬不足、標(biāo)注成本高昂等問題[4],難以滿足實際應(yīng)用中多樣化的SQL查詢需求。與此同時,合成數(shù)據(jù)技術(shù)憑借其高效、低成本的優(yōu)勢展現(xiàn)出巨大潛力[5],特別是訓(xùn)練數(shù)據(jù)數(shù)量匱乏條件下,在數(shù)據(jù)增強(qiáng)與模型泛化能力提升方面表現(xiàn)突出。

綜上,本文采用國產(chǎn)達(dá)夢數(shù)據(jù)庫(DM)開展數(shù)據(jù)集設(shè)計,達(dá)夢數(shù)據(jù)庫作為國產(chǎn)數(shù)據(jù)庫系統(tǒng)之一,在軍事、政務(wù)等關(guān)鍵領(lǐng)域逐步替代Oracle等國外數(shù)據(jù)庫。本文針對“執(zhí)勤”業(yè)務(wù)場景,設(shè)計國產(chǎn)數(shù)據(jù)庫系統(tǒng)并構(gòu)建專用數(shù)據(jù)集,該數(shù)據(jù)集包含300條高質(zhì)量標(biāo)注樣本,主要針對軍事典型業(yè)務(wù)查詢場景。達(dá)夢數(shù)據(jù)庫的模式權(quán)限設(shè)計參考《達(dá)夢數(shù)據(jù)庫技術(shù)文檔》[6]。同時,采用基于合成數(shù)據(jù)方法的大語言模型兩階段訓(xùn)練技術(shù),通過對比實驗評估合成數(shù)據(jù)與真實數(shù)據(jù)的分布一致性及對模型性能的提升效果,探索大語言模型在國產(chǎn)數(shù)據(jù)庫環(huán)境下的適配方法,為數(shù)據(jù)保障業(yè)務(wù)提供技術(shù)支撐。實驗結(jié)果表明,本數(shù)據(jù)集不僅能有效補(bǔ)充現(xiàn)有數(shù)據(jù)資源的不足,且通過合成數(shù)據(jù)驗證的方式,為TexttoSQL數(shù)據(jù)集的構(gòu)建與評估提供了新的技術(shù)路徑。


本文詳細(xì)內(nèi)容請下載:

http://ihrv.cn/resource/share/2000006862


作者信息:

李國深1,劉瑩君2,于莉娜2,紀(jì)濤2,張航1,吳繼冰1

(1.大數(shù)據(jù)與決策國家級重點實驗室,湖南長沙410073;

2.智能空間信息國家級重點實驗室,北京100029)


subscribe.jpg

此內(nèi)容為AET網(wǎng)站原創(chuàng),未經(jīng)授權(quán)禁止轉(zhuǎn)載。