I’ve done a fair amount of Chinese language segmentation programming - and yeah it’s not easy, especially as you reach for higher levels of accuracy.
You need to put in significant amounts of effort just for less than a few % point increases in accuracy.
For my own tools which focus on speed (and used for finding frequently used words in large bodies of text) I ended up opting for a first longest match algorithm.
It has a relatively high error rate, but it’s acceptable if you’re only looking for the first few hundred frequently used words.
What segmented are you using, or have you developed your own?
routerl 22 hours ago [-]
Thanks for the kind words!
I'm using Jieba[0] because it hits a nice balance of fast and accurate. But I'm initializing it with a custom dictionary (~800k entries), and have added several layers of heuristic post-segmentation. For example, Jieba tends to split up chengyu into two words, but I've decided they should be displayed as a single word, since chengyu are typically a single entry in dictionaries.
I’ve done a fair amount of Chinese language segmentation programming - and yeah it’s not easy, especially as you reach for higher levels of accuracy.
You need to put in significant amounts of effort just for less than a few % point increases in accuracy.
For my own tools which focus on speed (and used for finding frequently used words in large bodies of text) I ended up opting for a first longest match algorithm.
It has a relatively high error rate, but it’s acceptable if you’re only looking for the first few hundred frequently used words.
What segmented are you using, or have you developed your own?
I'm using Jieba[0] because it hits a nice balance of fast and accurate. But I'm initializing it with a custom dictionary (~800k entries), and have added several layers of heuristic post-segmentation. For example, Jieba tends to split up chengyu into two words, but I've decided they should be displayed as a single word, since chengyu are typically a single entry in dictionaries.
[0] https://github.com/fxsjy/jieba
I'm also working on showing all the pronunciations/definitions for a given hanzi, it should be ready later this week.