unicode - Latin<->Han Conversion in ICU? -



getting started implementing icu transforms using icu4c in c++ program. particularly looking @ transliteration , chinese.

according this document, package supports both "han-latin" , "latin-han" conversion. student of chinese, seems surprising me, latin-han conversion particularly difficult without highly advanced statistical techniques (the closest have seen google transliterate, great job without user input, unfeasible present project), less conversion without tone marks. skeptical possible, without resorting de facto foreign-name borrowing characters such 比尔·莫瑞. approach taken google maps in international domains, can see in paper (pdf)

anyhow, willing suspend disbelief, , after consulting documentation , tutorials, able construct 2 transliterator objects (to , from) , perform simple transliteration using them.

while han-latin worked pretty passably (about 80% accuracy simple data), latin-han seemed not work @ all, returning same "latin" string input, consistent results using online transform sample, , consistent know chinese. managed find this table, think used both sources, can see here:

{ "latin-han", "file", "t_hani_latn", "reverse" }, { "han-latin", "file", "t_hani_latn", "forward" }, 

i presume meant given pinyin string potentially work reproduce original, not seem case.

i guess general question this: kind of transform possible icu, or besides google transliterate? expected output? relatedly, there listing somewhere of script-pairs icu actually supports, if not possible?

thank time

note data cldr project, http://cldr.unicode.org . script pairs icu supports many, icu attempt use pivot script ( such han latin russian ) why can create transliterators such "any-latin". might try browsing icu , cldr data set. note @ top of han-latin file says not round trip.


Comments