Bridging the gap between complex scientific research and the curious minds eager to explore it.

Computation and Language, Computer Science

Fast WordPiece Tokenization: A Survey of Recent Approaches

Fast WordPiece Tokenization: A Survey of Recent Approaches

Domain-Specific Language Model for In-Domain IR
In this article, the authors discuss the challenges of preparing an Islamic corpus in English and propose a novel approach to address these challenges. They introduce a domain-specific language model, called the Bilingual EN-AR teacher model, which combines the embedding matrix of the CL-AraBERT for Arabic tokens and the BPIT model for English tokens. The authors evaluate the performance of this approach on a test dataset and show significant improvements compared to existing models.
Preparing an Islamic Corpus in English: Challenges and Solutions
The article highlights the difficulties of creating an Islamic corpus in English, primarily due to the limited availability of Islamic text. The authors note that most Islamic texts are either translated from Arabic or other languages to English or initially written in English. This scarcity makes it challenging to train machine learning models on a large dataset, resulting in poor performance.
To overcome this challenge, the authors propose a domain-specific language model that leverages both Arabic and English text. The Bilingual EN-AR teacher model combines the embedding matrix of the CL-AraBERT for Arabic tokens and the BPIT model for English tokens. By using both languages, the model can learn to recognize patterns in Islamic texts more accurately.

Performance Evaluation: Significant Improvements

The authors evaluate the performance of their approach on a test dataset using various machine learning models. They compare the results with existing models, such as CL-AraBERT, and show significant improvements. The Bilingual EN-AR teacher model aches 85% accuracy in retrieving relevant Islamic texts, while CL-AraBERT achieves 73% accuracy.

Conclusion: A Novel Approach to In-Domain IR

In conclusion, the article presents a novel approach to preparing an Islamic corpus in English by leveraging both Arabic and English text. The Bilingual EN-AR teacher model demonstrates improved performance compared to existing models, highlighting its potential for in-domain information retrieval. By providing a domain-specific language model, the authors aim to bridge the gap between machine learning research and Islamic studies, enabling the development of more accurate and relevant models for Islamic text retrieval.