1. 技術背景
1. technical background (1)機器翻譯研究曆程 Machine Translation research course 機器翻譯的研究在上世紀五十年代就(jiù)已經展開,早期的工作主要以基于規則的方法爲主, 進展相對來說比較緩慢。之後美國自然語言處理咨詢委員會還作出了一個質疑了機器翻譯的可行性的報 告,對該領域研究造成(chéng)了一定阻礙。到了上世紀九十年代,IBM提出了著名的基于詞的翻譯模型,開啓了 統計機器翻譯時代,随後短語和句法模型相繼被(bèi)提出,翻譯質量得到了顯著提升。最近兩年神經網絡機 器翻譯方法開始興起(qǐ),該方法突破統計機器翻譯方法中的許多限制,成(chéng)爲當前的研究熱點。 Machine Translation's research began in the 50s of the last century, and early work was mainly based on rule based methods, and progress was relatively slow. Later, the Natural Language Processing Advisory Board also made a report that challenged Machine Translation's viability, hindering research in the field. By the 90s of last century, IBM proposed the famous word based translation model, which opened the statistical Machine Translation era, and then the phrase and syntax model were put forward, and the quality of translation was greatly improved. In the last two years, the method of neural network Machine Translation began to emerge. This method breaks through many limitations in the statistical Machine Translation method and becomes the focus of current research. (2)統計機器翻譯 Statistical Machine Translation 統計機器翻譯的基本思想是充分利用機器學習技術從大規模雙語平行語料中自動獲取翻譯 規則及其概率參數,然後利用翻譯規則對源語言句子進行解碼。對于給定的源語言句子,統計機器翻譯 認爲其翻譯可以是任意的目标語言句子,隻是不同目标語言句子的概率不同。而統計機器翻譯的任務, 就(jiù)是從所有的目标語言句子中,找到概率最大的譯文。 The basic idea of Machine Translation is to make full use of machine learning techniques of automatic acquisition of translation rules and probability parameters from the large-scale bilingual parallel corpus, and then use the translation rules to decode the source language sentence. For a given source language sentence, statistical Machine Translation believes that its translation can be arbitrary target language sentences, but different target language sentences have different probabilities. The task of statistical Machine Translation is to find the translation with the greatest probability from all the target language sentences. (3)神經網絡機器翻譯 Neural network Machine Translation 神經網絡機器翻譯(neural machine translation,NMT)是近年來興起(qǐ)的一種全新的機器 翻譯方法,其基本思想是使用神經網絡直接將(jiāng)源語言文本映射爲目标語言文本,這種編碼器解碼器架構 使得它可以采用端到端的方式進行訓練,能(néng)同時優化模型中的所有參數。完全不同于傳統機器翻譯中以 基于離散符号的轉換規則爲核心的做法,需要經過詞對齊,抽規則,概率估計和調參等一系列步驟,容 易産生誤差傳播。神經網絡機器翻譯使用連續的向(xiàng)量表示對翻譯過程進行建模,因而能(néng)從根本上克服傳 統機器翻譯中的泛化性能(néng)不佳、獨立性假設過強等問題。 Neural network is a new Machine Translation Machine Translation method rising in recent years. The basic idea is to directly to the source language text is mapped to the target text using neural network, the encoder decoder architecture makes it possible to use end-to-end approach for training, can also optimize all the parameters in the model. It is different from the traditional Machine Translation based on the discrete symbol conversion rules as the core, and needs to have a series of steps such as word alignment, rule extraction, probability estimation and parameter adjustment, which is prone to error propagation. The Machine Translation neural network uses continuous vector representation to model the translation process. Thus, it can fundamentally overcome the problems of poor generalization performance and too strong independence assumption in the traditional Machine Translation. 2. 譯後編輯/交互式機器翻譯 Post edit / interactive Machine Translation (1)譯後編輯 post-translation editing 譯後編輯簡單而言就(jiù)是通過人工直接修改機器翻譯的自動譯文來完成(chéng)翻譯。譯後編輯是最 簡單的人機交互方式。SDL Trados等計算機輔助翻譯工具通常支持谷歌翻譯等API來直接獲取機器翻譯的 自動譯文,因此譯後編輯是目前最流行的輔助形式。如果機器翻譯的自動譯文質量較高,人工修改量就(jiù) 比較少,這種方式可以有效提升譯員的生産效率。但在行業實踐中,譯後編輯面(miàn)臨諸多現實挑戰,有時 甚至僅僅是聊勝于無。主要原因在于當前的機器翻譯系統對應的譯文質量遠未達到人工翻譯場景的用戶 期望。如果機器翻譯的自動譯文質量較差,譯員不得不爲了少打幾個字而被(bèi)迫分析和修改漏洞百出的整 句譯文,其代價遠超過直接翻譯。僵化的譯文和似是而非的術語翻譯使得譯員使用機器翻譯的熱情并不 高,而重複糾正相同錯誤的乏味感和反複修改仍不能(néng)滿意的挫敗感也使用戶感到沮喪。 In short, post translation is the translation of Machine Translation's automatic translation by manually modifying it. Post editing is the simplest form of human-computer interaction. SDL, Trados and other computer aided translation tools usually support Google translation and other API to obtain the automatic translation of Machine Translation directly. Therefore, post editing is the most popular form of assistance. If the quality of Machine Translation's automatic translation is higher, the amount of manual modification will be relatively small, which can effectively improve the interpreter's productivity. But in practice, post editing reality facing many challenges, sometimes even just Something is better than nothing. The main reason is that the quality of the translation of the current Machine Translation system is far from the expected user expectation of the translation scene. If the poor quality of the Machine Translation automatic translation, translators have to play less words to analyze and modify the sentence at the expense of the Its loopholes appeared one after another., far more than the direct translation. Terminology translation rigid translation and makes use of Machine Translation's specious interpreter enthusiasm is not high, and repeat the same error correcting boring and repeated modification is still not satisfactory the frustration users feel depressed. 近兩年來,神經網絡機器翻譯發展迅猛,譯文質量顯著提升,同時也帶來了新的挑戰,如 “順而不信”和翻譯結果難以幹預等問題。因此,神經網絡機器翻譯仍需要相當長時間才可能(néng)在實踐中 顯著改善譯後編輯的人機交互體驗。 In the past two years, the development of Machine Translation has been rapid, and the quality of translation has been greatly improved. At the same time, it has brought new challenges, such as "Shun and not believe" and difficult to interfere with the translation results. Therefore, it still takes a long time for the neural network Machine Translation to significantly improve the interactive experience of post editing editors in practice. (2)交互式機器翻譯 Interactive Machine Translation 交互式機器翻譯指系統根據用戶已翻譯的部分譯文動态生成(chéng)後續譯文候選供用戶參考。譯 員從零開始翻譯,因此譯員無需修改自動譯文,僅在翻譯過程中選擇可接受的部分即可。該技術指在通 過翻譯人員與機器翻譯引擎之間的交互作用,從而實現人類譯員的準确性和機器翻譯引擎的高效性。 Interactive Machine Translation means that the system dynamically generates candidate candidates for subsequent translations according to the translated parts of the user's translations. The interpreter starts from scratch, so the interpreter does not need to modify the automatic translation and only accepts the accepted part in the translation process. The technology refers to the interaction between the translator and the Machine Translation engine, thus achieving the accuracy of the human interpreter and the efficiency of the Machine Translation engine. 與譯後編輯相比,交互式機器翻譯系統對技術實現有更高的要求:從左至右的強制解碼和 流暢的實時響應。同時,因爲需要譯員反複閱讀和理解最新的譯文部分,這種模式也給用戶帶來了額外 負擔。因此,目前流行的在線翻譯系統和計算機輔助翻譯工具并不支持交互式機器翻譯模式。目前的交 互式機器翻譯系統仍處于原型階段。可喜的是,從近期機器翻譯技術的發展,尤其是基于神經網絡機器 翻譯的交互式機器翻譯的進步可以預見,交互式機器翻譯有望成(chéng)爲未來人工翻譯的候選項之一。 Compared with post editing, interactive Machine Translation systems have higher requirements for technology implementation: forced decoding from left to right and smooth real-time response. At the same time, this model also brings an additional burden to the user because it requires interpreters to read and understand the latest translation parts again and again. As a result, current popular online translation systems and computer aided translation tools do not support interactive Machine Translation models. The current interactive Machine Translation system is still in its prototype stage. Fortunately, from the recent development of Machine Translation technology, especially the interactive Machine Translation Machine Translation progress based on neural network can be predicted, Machine Translation is expected to become a candidate for one of the interactive artificial translation in the future. 3. 融合機器翻譯的中文輸入法 Chinese input method for fusing Machine Translation 結合實際的人工翻譯過程, 通過分析我們發現,一般在自動譯文中總能(néng)找到可以直接使用 的完美片斷。因此,就(jiù)目前的技術條件而言,我們認爲最重要的是以盡可能(néng)簡單的方式,充分利用機器 翻譯結果中的正确部分,同時應該盡量避免讓譯員受到錯誤部分的幹擾。 In combination with the actual human translation process, through analysis, we find that in the automatic translation it is always possible to find perfect fragments that can be used directly. Therefore, on the current technical conditions, we think the most important thing is to make it as simple as possible, make full use of the correct part of Machine Translation's results, at the same time should be avoided for the interpreter to part of the interference error. 爲了達到這個目的, 我們提出一種融合統計機器翻譯技術的中文輸入方法。該輸入方法面(miàn) 向(xiàng)人工翻譯場景,根據用戶按鍵,將(jiāng)統計翻譯中的翻譯規則、翻譯假設列表和n-best列表等相關信息融 合進輸入方法,隻需較少的按鍵次數就(jiù)可以生成(chéng)準确的譯文結果。使用該輸入法,譯員可以完全不閱讀 機器翻譯的自動譯文,但仍可以得到機器翻譯的幫助。因此,相對譯後編輯而言,即使機器翻譯自動譯 文的質量較低,該輸入法也能(néng)顯著改善譯員的人機交互體驗。此外,爲了指導統計機器翻譯系統生成(chéng)更 适合輸入方法的翻譯結果,我們提出了面(miàn)向(xiàng)輸入方法的機器翻譯譯文自動評價指标,使該輸入方法利用 更合适的統計翻譯結果,進一步提升人工翻譯效率。 In order to achieve this goal, we propose a Chinese input method that combines statistical Machine Translation technology. The input method for artificial translation according to the scene, the user presses a key, the statistical translation rules, translation hypothesis and N-best lists and other relevant information into the input method, requiring only a few keystrokes can generate accurate translation results. Using this input method, the interpreter can not read Machine Translation's automatic translation at all, but it can still get the help of Machine Translation. Therefore, compared with post translation editors, even if the quality of Machine Translation's automatic translation is low, the input method can also significantly improve the interpreter's human-computer interaction experience. In addition, in order to guide Machine Translation generation system is more suitable for the input method of the translation results, we put forward the evaluation index for automatic input method Machine Translation translation, the input method using statistical translation more appropriate results, to further enhance the efficiency of artificial translation. 4. 術語翻譯方法 Terminology translation method (1)基于雙語括号句子的術語翻譯挖掘方法 A method of terminological translation mining based on Bilingual parenthesis sentences 站在改善最終機器翻譯譯文質量的角度,我們認爲術語翻譯知識的質量優先于規模。因此 ,我們將(jiāng)目光轉向(xiàng)互聯網上單語網頁上大量存在的雙語括号的句子。所謂雙語括号句子需要同時滿足下 列三個條件:包含一個或多個括号;緊臨括号的左邊是一個術語;該術語的譯文在括号内。雙語括号句 子包含豐富的術語翻譯知識,如目标語言術語的上下文信息。相對于平行語料或可比語料而言,雙語括 号句子的限制更少,更新比較及時且相對更容易抽取術語翻譯知識。因此我們認爲雙語括号句子是挖掘 術語翻譯知識的理想語料。如以下示例所示,挖掘術語翻譯知識的主要任務是确定目标術語的左邊界, 因爲右邊界已經由括号給出,且源語言術語的邊界是确定的。 From the point of view of improving the quality of the final Machine Translation translation, we believe that the quality of terminology translation knowledge is prior to scale. So we turn our attention to the large number of bilingual parentheses in the monolingual web pages. The so-called bilingual parentheses need to satisfy the following three conditions at the same time: include one or more parentheses; the left side of the parentheses is a term; the translation of the term is in parentheses. Bilingual parentheses contain sentences rich in terminology, translation knowledge, such as contextual information in target language terms. Compared with parallel corpora or comparable corpora, there are less restrictions on Bilingual parentheses and sentences, relatively prompt updating and relatively easier extraction of terminology translation knowledge. Therefore, we believe that bilingual parentheses and sentences are ideal corpora for translating terminology into translation knowledge. As shown in the following example, the primary task of mining terminology translation knowledge is to determine the left boundary of the target term, because the right boundary has been given by parentheses, and the boundary of the source language term is determined. 各個進程有自己的内存空間、數據棧等,所以隻能(néng)使用進程間通訊(interprocess communication,IPC),而不能(néng)直接共享信息。 Each process has its own memory space, data stack and so on, so you can only use inter process communication (interprocess, communication, IPC), and can not directly share information. 該方法的輸入爲種子 URL 和種子術語詞典,最終輸出爲帶概率的術語翻譯規則表,類似于 統計翻譯的短語翻譯規則表。在工作流中,中間結果包括主題爬蟲獲取的Web網頁和URL,雙語括号句子 過濾器篩選出的雙語括号句子,術語左邊界分類器的術語翻譯候選列表,以及增量更新後的種子術語詞 典。 The input of the method is the seed URL and the dictionary of seed terms, and finally outputs to the probabilistic terminology translation rules table, which is similar to the statistical translation phrase translation rules table. In the workflow, including intermediate results to obtain the Web web crawler and URL, bilingual sentence brackets screened bilingual sentence filter brackets, the candidate list in terms of terminology translation left boundary classifier, and incremental update after the seed dictionary. (2)融合雙語術語識别的聯合詞對齊方法 Joint word alignment method for bilingual term recognition 詞對齊是統計機器翻譯的一項核心任務,它從雙語平行語料中發掘互爲翻譯的語言片斷, 是翻譯知識的主要來源。在實踐中,一部分詞對齊錯誤就(jiù)是術語産生的,最終的譯文質量也會受到影響 。如果能(néng)自動識别出平行句對中的術語對應關系,詞對齊質量就(jiù)能(néng)得到改善,進而有望改善術語和句子 的翻譯質量。 Word alignment is a core task of statistical Machine Translation. It explores the translation fragments from bilingual parallel corpora, and is the main source of translation knowledge. In practice, part of the word alignment errors are terminology, and the quality of the final translation will be affected. The quality of word alignment can be improved if it can automatically identify the corresponding terms in parallel sentences, and then it is expected to improve the translation quality of terms and sentences. 術語識别方面(miàn),基于規則的方法已基本退出曆史舞台。基于統計方法的方法雖然不受領域 限制,但是對于多詞術語和低頻術語的識别并不理想, 因而抽取的術語也存在較多噪聲。所以,如果直 接將(jiāng)術語識别結果作爲詞對齊的約束,術語識别錯誤就(jiù)會傳遞給後續階段,最終譯文質量反而難以得到 提升。因此,研究如何提高術語識别和詞對齊性能(néng),并提高最終的機器翻譯譯文質量是迫切需要解決的 一個難題。 In terms of terminology recognition, rule-based methods have basically exited the stage of history. Although statistical methods are not limited by the field, the recognition of multi - term terms and low-frequency terms is not ideal, so the terms extracted also have more noise. Therefore, if the term recognition results are directly aligned as words, the term recognition errors will be passed to the next stage, and the quality of the translation will be difficult to improve. Therefore, it is an urgent problem to study how to improve the term recognition and word alignment performance and to improve the quality of the final Machine Translation translation. 爲了盡量降低訓練流程中錯誤傳遞的影響以改進術語翻譯知識抽取,我們提出了融合雙語 術語識别的聯合詞對齊方法。首先,爲了降低對訓練數據的依賴,該聯合詞對齊方法從單語術語識别弱 分類器開始。該分類器由維基百科等自然标注數據訓練得到的。其次,爲了降低因術語識别和詞對齊的 錯誤傳遞帶來的負面(miàn)影響,該方法利用雙語術語和詞對齊的相互約束,將(jiāng)單語術語識别、雙語術語對齊 和詞對齊聯合在一起(qǐ)執行,最後得到效果更好(hǎo)的雙語術語識别和詞對齊結果。 In order to reduce the influence of error transfer in training process and improve terminology translation knowledge extraction, we propose a joint word alignment method for bilingual term recognition. First, in order to reduce the dependence on training data, the joint alignment method starts with the monolingual term recognition of the weak classifier. The classifier by Wikipedia and other natural annotation data obtained from the training. Secondly, in order to reduce terminology recognition and word alignment error propagation of the negative impact, the mutual constraint of bilingual terminology and word alignment, bilingual terminology recognition, bilingual terminology alignment and word alignment together, finally get the better effect of the bilingual terminology recognition and word alignment results. (3)融合術語識别邊界信息的統計翻譯術語解碼方法 Statistical translation term decoding method incorporating terminology identifying boundary information 人名、地名、機構名等命名實體有明顯的邊界特征,相對容易進行識别與對齊。一般而言 ,將(jiāng)命名實體直接翻譯方法用于統計翻譯解碼器就(jiù)可以取得比較好(hǎo)的翻譯效果。但是,用與翻譯命名實 體的方式“直接翻譯” 術語并不能(néng)明顯改善機器翻譯自動譯文的質量。最主要的原因就(jiù)是目前的術語識 别模型還不夠好(hǎo),識别準确率大幅弱于命名實體識别。另外,由于術語本身是與領域高度相關的,爲目 标領域訓練高性能(néng)的術語識别分類器需要大量高質量且同領域的人工标注訓練語料,這進一步加大了術 語識别的難度。在這種情況下,如果直接將(jiāng)術語識别結果作爲詞對齊的約束,術語識别錯誤就(jiù)會傳遞給 後續階段,最終譯文質量反而難以得到提升。因此,研究如何提高術語識别和詞對齊性能(néng),并提高最終 的機器翻譯譯文質量是迫切需要解決的一個難題。 Named entities such as names, places and institutions have obvious boundary features and are relatively easy to identify and align. Generally speaking, the direct translation method of named entity can be used in statistical translation decoder to achieve better translation results. However, the term "direct translation" does not significantly improve the quality of Machine Translation's automatic translation. The main reason is that the current terminology recognition model is not good enough, and the recognition accuracy is much weaker than named entity recognition. In addition, because the term itself is highly correlated with the field, for the training corpus annotation terminology recognition classifier training goal in the field of high performance requires a large number of high quality and in the same field, which further increased the difficulty of term recognition. In this case, the term recognition error will be passed to the next stage if the term recognition result is directly aligned with the word, and the quality of the translation will be difficult to improve. Therefore, it is an urgent problem to study how to improve the term recognition and word alignment performance and to improve the quality of the final Machine Translation translation. 爲了盡量降低訓練流程中錯誤傳遞的影響以改進術語翻譯知識抽取,我們提出了融合雙語 術語識别的聯合詞對齊方法。首先,爲了降低對訓練數據的依賴,該聯合詞對齊方法從單語術語識别弱 分類器開始。該分類器由維基百科等自然标注數據訓練得到的。其次,爲了降低因術語識别和詞對齊的 錯誤傳遞帶來的負面(miàn)影響,該方法利用雙語術語和詞對齊的相互約束,將(jiāng)單語術語識别、雙語術語對齊 和詞對齊聯合在一起(qǐ)執行,最後得到效果更好(hǎo)的雙語術語識别和詞對齊結果。 In order to reduce the influence of error transfer in training process and improve terminology translation knowledge extraction, we propose a joint word alignment method for bilingual term recognition. First, in order to reduce the dependence on training data, the joint alignment method starts with the monolingual term recognition of the weak classifier. The classifier by Wikipedia and other natural annotation data obtained from the training. Secondly, in order to reduce terminology recognition and word alignment error propagation of the negative impact, the mutual constraint of bilingual terminology and word alignment, bilingual terminology recognition, bilingual terminology alignment and word alignment together, finally get the better effect of the bilingual terminology recognition and word alignment results. 上一篇:上海翻譯公司化工翻譯