Publication | Open Access
<i>n</i>-Gram-Based Text Compression
21
Citations
19
References
2016
Year
EngineeringMultilingual PretrainingCorpus LinguisticsN -Gram DictionariesText MiningNatural Language ProcessingInformation RetrievalData ScienceText CompressionString ProcessingComputational LinguisticsVietnamese TextLanguage StudiesLossless CompressionMachine TranslationData CompressionNeural Machine TranslationN -GramText ProcessingLinguistics
We propose an efficient method for compressing Vietnamese text using n -gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n -grams and then encodes them based on n -gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n -gram is encoded by two to four bytes accordingly based on its corresponding n -gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n -gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.
| Year | Citations | |
|---|---|---|
Page 1
Page 1