NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Show HN: Mandarin Word Segmenter with Translation (mandobot.netlify.app)
imron 22 hours ago [-]
Nice work OP.

I’ve done a fair amount of Chinese language segmentation programming - and yeah it’s not easy, especially as you reach for higher levels of accuracy.

You need to put in significant amounts of effort just for less than a few % point increases in accuracy.

For my own tools which focus on speed (and used for finding frequently used words in large bodies of text) I ended up opting for a first longest match algorithm.

It has a relatively high error rate, but it’s acceptable if you’re only looking for the first few hundred frequently used words.

What segmented are you using, or have you developed your own?

routerl 22 hours ago [-]
Thanks for the kind words!

I'm using Jieba[0] because it hits a nice balance of fast and accurate. But I'm initializing it with a custom dictionary (~800k entries), and have added several layers of heuristic post-segmentation. For example, Jieba tends to split up chengyu into two words, but I've decided they should be displayed as a single word, since chengyu are typically a single entry in dictionaries.

[0] https://github.com/fxsjy/jieba

sarabande 20 hours ago [-]
纔 in this case should use the definition of 才 (cai2) not (shan1) which is extremely uncommon. Otherwise, cool app!
routerl 20 hours ago [-]
Could you post the text you used? This kind of thing goes straight into my unit tests.

I'm also working on showing all the pronunciations/definitions for a given hanzi, it should be ready later this week.

maxglute 22 hours ago [-]
Very well executed.
bnly 20 hours ago [-]
Nicely done, this looks quite useful!
hassleblad23 22 hours ago [-]
Great work OP.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 17:57:07 GMT+0000 (Coordinated Universal Time) with Vercel.