Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Show HN: Mandarin Word Segmenter with Translation (mandobot.netlify.app)

48 points by routerl 23 days ago | 35 comments

gwd 20 days ago [-]

> Thus, learning Mandarin by reading requires first memorizing hundreds or thousands of words, before you can even know where one word ends and the next word begins.

That's not been my experience at all: As long as the content I'm reading is at the right level, I've been able to learn to segment as my vocabulary has grown, and there's always only a few new words that I haven't learned how to recognize in-context yet. Having a good built-in dictionary wherever you're reading (e.g., a Chrome plugin, or Pleco, or whatever) has been helpful here.

My fear would be that the longer you put off learning to segment in your head, the harder it will be.

My advice for this would be that you present the text as you'd normally see it (e.g., no segmentation), but add aids to help learners see or understand the segmentation. At very least you could have the dictionary pop-up be on the level of the full segmentation, rather than individual characters; and you could consider having it so that as you mouse over a character, it draws a little line under / border around the characters in the same segment. That could allow you to give your brain the little hint it needs to "see" the words segmented "in situ".

AlchemistCamp 19 days ago [-]

I had yet another experience. Before Chrome existed, I learned thousands of words by hearing them before I could read them.

A lot of the pain I see in foreigners learning Chinese is that they try to tackle the written language too early. Actually, I think that's sub-optimal in any language, but it's even more expensive in a language like Chinese or Thai where word segmentation isn't a part of the writing system. I totally get it since characters are cool and curiosity about them draws many learners to the language, but it's a lot easier to take on one challenge at a time!

imron 23 days ago [-]

Nice work OP.

I’ve done a fair amount of Chinese language segmentation programming - and yeah it’s not easy, especially as you reach for higher levels of accuracy.

You need to put in significant amounts of effort just for less than a few % point increases in accuracy.

For my own tools which focus on speed (and used for finding frequently used words in large bodies of text) I ended up opting for a first longest match algorithm.

It has a relatively high error rate, but it’s acceptable if you’re only looking for the first few hundred frequently used words.

What segmented are you using, or have you developed your own?

routerl 23 days ago [-]

Thanks for the kind words!

I'm using Jieba[0] because it hits a nice balance of fast and accurate. But I'm initializing it with a custom dictionary (~800k entries), and have added several layers of heuristic post-segmentation. For example, Jieba tends to split up chengyu into two words, but I've decided they should be displayed as a single word, since chengyu are typically a single entry in dictionaries.

[0] https://github.com/fxsjy/jieba

rmccrear 20 days ago [-]

Great project! It's fascinating how hard segmentation is and how many approaches there are. I thought I'd mention a trick that can let you segment without a backend. When you double click Chinese text in the browser, it will highlight an entire word. For example, try double clicking on the text here: 一步登天：走一步就到天堂美好境地。 It highlights/segments the first 4 characters as a chengyu, and the others as one or two character words. I haven't been able to discover what method Apple and Microsoft use to segment, but it seems to do a good job. You can even use JavaScript's Range.expand() function to do this programmatically. I once even made a little JS library that can run in the background and segment words on a page.

yorwba 20 days ago [-]

Last I checked, browsers basically wrap ICU's word-break iterator: https://unicode-org.github.io/icu/userguide/boundaryanalysis...

imron 19 days ago [-]

That’s neat!

greyman 20 days ago [-]

OP, thank you for your work, I will continue to watch it.

I tried to built something similar, but what I didn't discover and think is crucial is the proper FE: yes, word segmenting is useful, but if I have to click on each word to see its meaning, for how I learn Chinese by reading texts, I still find Zhongwen Chrome extension to be more useful, since I see English meaning quicker, just by hover cursor over the word.

In my project, I was trying to display English translation under each Chinese word, which would I think require AI to determine the correct translation, since one cannot just put CC-CEDIT entry there.

P.S: I dont know how you built your dictionary, it translated 气功师 as "Aerosolist", which I am not sure what is exactly, but this should be actually two words, not one - correct segmentation and translation is 气功师, "qigong master".

routerl 20 days ago [-]

Thanks for the kind words, and the bug report!

The (awful and incorrect) translation you've pointed out comes from the segmenter being too greedy, not finding the (non-existent) word in any dictionary, and therefore dispatching the word to be machine translated, without context. This is the final fallback in the segmentation pipeline, to avoid displaying nothing at all, and my priority right now is making the segmentation pipeline more robust so this rarely (or never) happens, since it sometimes produces hilariously bad results!

gs17 20 days ago [-]

What are you using for machine translation? I'm surprised anything could mistranslate 气功师 like that.

routerl 19 days ago [-]

For anonymous users, I'm using OpenNMT, via Argos. Logged in users get DeepL translations, which correctly translates 气功师.

rahimnathwani 20 days ago [-]

This is cool. If you haven't already, you might like to take a look at Du Chinese and The Chairman's Bao. They might provide ideas or inspiration.

Also the 'clip reader' feature in Pleco is decent.

Also, supporting simplified as well as traditional might increase your potential audience.

routerl 20 days ago [-]

It supports traditional and simplified, as well as pinyin and bopomofo :)

It's already possible to switch instantly between pinyin and bopomofo, and I'm working on letting users switch between simplified/traditional, but this is also a non-trivial problem. For now, the app will follow the user's lead: if you enter traditional text, it will return traditional text, and same goes for simplified.

jonknebel 18 days ago [-]

i've been working on a new script converter (https://github.com/creolio/mandarinTamer) for simplified and traditional (taiwanese) mandarin. almost finished--needs a few tests yet, but should be more accurate than opencc and google translate. would love to chat more about your tool. i'll send you an email.

cannam 20 days ago [-]

This was my attempt at doing something a little bit like it, 27 years ago. It's mostly interesting as a historical artifact - certainly yours is a lot more sophisticated and much much prettier! This one just does greedy matching against CEDICT.

https://all-day-breakfast.com/chinese/

What is kind of interesting is that the script itself (a single Perl CGI script) has survived the passage of time better than the text documenting it.

Besides all the broken links, the text refers throughout to Big-5 encoding, and the form at https://all-day-breakfast.com/chinese/big5-simple.html has a warning that the popups only work in Netscape or MSIE 4. You can now ignore all of that because browsers are more encoding aware (it still uses Big-5 internally but you can paste in Unicode) and the popups work anywhere.

ipnon 20 days ago [-]

Awesome! Your Chinese must be pretty good now. Do you still have the Perl source? How did you even devise such a project in 1998?

cannam 18 days ago [-]

There's a link to the Perl code hidden in the third para of text ("The [Perl source] for this script is available...") Of course a big reason it still works is that it was written for Perl 5, which is still current!

What that link doesn't give you is the dictionary files I used as input for the preprocessing step - which of course were also 1998 vintage. There are copies on the server (https://all-day-breakfast.com/chinese/cedict.b5_saved, https://all-day-breakfast.com/chinese/big5-PY.tit)

My Chinese got somewhat better, then a lot worse, then a little bit better again - obviously mostly to do with whether I was actually using it, which on the whole I haven't been. But back then I was really working on it and I just wanted something to help - there were a few useful resources I knew of (CEDICT obviously, and Rick Harbaugh's zhongwen.com was mindblowing at the time) and this seemed like a way to glue them together that I actually knew how to do.

Writing learning tools is obviously not the same thing as learning though.

thaumasiotes 20 days ago [-]

> Thus, learning Mandarin by reading requires first memorizing hundreds or thousands of words, before you can even know where one word ends and the next word begins.

That's not true at all; you can go a long way just by clicking on characters in Pleco, and Pleco's segmentation algorithm is awful. (Specifically, it's greedy "find the longest substring starting at the selected character for which a dictionary entry exists".)

Sometimes I go back through very old conversations in Chinese and notice that I completely misunderstood something. That's an unfortunate but normal part of the language-learning process. You don't need full comprehension to learn. What would babies do?

wonnage 20 days ago [-]

That's not terrible for a dictionary where you control what you're looking up. I found poor segmentation to be way more annoying when trying to do stuff like selecting words for highlighting on my Kindle - it invariably breaks up a chengyu or selects nonsensical groups of words. Same with the Mac 3d touch quick look.

thaumasiotes 19 days ago [-]

But that's the same interface every time. You select a character and a lookup pops up. The Pleco controls are "touch a character to pop up an entry", "move end of selection window left", "move end of selection window right", "jump to previous 'word'", and "jump to next 'word'". The popup will display an entry for whatever is the longest prefix of the selection window that has an entry. How do your touch interfaces differ?

rasulkireev 18 days ago [-]

You should add it to Built with Django - https://builtwithdjango.com/projects/new/

carom 20 days ago [-]

Did you find the library jieba? That is what I am using for segmentation. It seems to work fine on simplified despite not advertising it.

routerl 20 days ago [-]

I did! Jieba is the first step in my segmentation pipeline. As far as I can tell, Jieba's default config tends to work better for simplified, but in my case the custom dictionary I feed it has significantly more traditional entries than simplified entries, especially for historical terms and slang.

mindvirus 19 days ago [-]

This is great. I'd love it for flashcard creation - paste in a block of text I'm reading and extract vocabulary from it.

routerl 16 days ago [-]

OP here, I'm adding a feature that will allow users to save specific words to lists, and export the lists in formats that can be imported to flashcard apps.

sarabande 23 days ago [-]

纔 in this case should use the definition of 才 (cai2) not (shan1) which is extremely uncommon. Otherwise, cool app!

routerl 23 days ago [-]

Could you post the text you used? This kind of thing goes straight into my unit tests.

I'm also working on showing all the pronunciations/definitions for a given hanzi, it should be ready later this week.

sarabande 22 days ago [-]

I used the example sentence from your link.

routerl 22 days ago [-]

Got it, thanks!

georgeplusplus 20 days ago [-]

Have you used the app Pleco?

That app has been invaluable as someone learning Chinese.

that app breaks down mandarin sentences into individual characters. I believe it’s made by a Taiwanese developer too.

I tried your app with a few sentences and it works really well!

AlchemistCamp 19 days ago [-]

I don't think it's a Taiwanese developer. If so, he got a lot of stroke orders wrong for traditional characters as per the MOE standards.

Anytime I need to look up a character or word, I go to https://www.moedict.tw/ first. Pleco is still great for having so many add-ons (including dictionaries) and, from what I've heard, some decent graded readers.

routerl 20 days ago [-]

Thank you, and thanks for checking it out!

I use Pleco almost every day :)

bnly 23 days ago [-]

Nicely done, this looks quite useful!

maxglute 23 days ago [-]

Very well executed.

hassleblad23 23 days ago [-]

Great work OP.

Rendered at 03:36:37 GMT+0000 (Coordinated Universal Time) with Vercel.