Improving Machine Translation Performance Without Punctuation Processing

In this article, the authors discuss their research on developing a machine translation model for Korean ancient texts. They explain that they used BERT as a pre-trained model and added a linear layer to create their own model. The BERT model is effective in understanding context due to its Transformer architecture and ease of fine-tuning. The authors used the "bert-base-chinese" model, which consists of 12 transformer layers with a hidden size of 768, and adapted it for token classification for predicting punctuation and spacing.
The authors encountered two main issues during preprocessing: inconsistent punctuation marks in the two datasets they used and deciding how to set labels when using a token classification model. They sought expert advice for preprocessing and ultimately chose six significant punctuation marks that were used similarly in both datasets.
Overall, the article aims to improve machine translation for Korean ancient texts by developing a robust model that can handle various punctuation marks without compromising its performance. By using BERT as a pre-trained model and adapting it for token classification, the authors have demonstrated a promising approach towards achieving their goal.

ARXIV/2312.11881 authored by Taehong Jang, Joonmo Ahn, Sojung Lucia Kim.

Improving Machine Translation Performance Without Punctuation Processing

LLama 2 7B Chat

Categories

Tags

Archives

Improving Machine Translation Performance Without Punctuation Processing

LLama 2 7B Chat

Accurate Analysis of Image Captions with CoT-Based Methods

Unsupervised Audio-Caption Alignment via Correspondence Learning

Efficient Method for ML Model Accuracy Improvement in Non-IID Data Settings

Categories

Tags

Archives