Bridging the gap between complex scientific research and the curious minds eager to explore it.

Computation and Language, Computer Science

Improving Machine Translation Performance Without Punctuation Processing

Improving Machine Translation Performance Without Punctuation Processing

In this article, the authors discuss their research on developing a machine translation model for Korean ancient texts. They explain that they used BERT as a pre-trained model and added a linear layer to create their own model. The BERT model is effective in understanding context due to its Transformer architecture and ease of fine-tuning. The authors used the "bert-base-chinese" model, which consists of 12 transformer layers with a hidden size of 768, and adapted it for token classification for predicting punctuation and spacing.
The authors encountered two main issues during preprocessing: inconsistent punctuation marks in the two datasets they used and deciding how to set labels when using a token classification model. They sought expert advice for preprocessing and ultimately chose six significant punctuation marks that were used similarly in both datasets.
Overall, the article aims to improve machine translation for Korean ancient texts by developing a robust model that can handle various punctuation marks without compromising its performance. By using BERT as a pre-trained model and adapting it for token classification, the authors have demonstrated a promising approach towards achieving their goal.