Enumerate a character's Unicode properties in Ruby? -

is there way enumerate of character's unicode properties in ruby? can use ruby 1.9's regexp class test whether given character has particular property (e.g., some_char =~ /\p{p}/ test whether some_char punctuation, etc.)... since characters can have multiple properties ((, example, both punctuation and ascii, etc.), nice able list of of character's properties.

i hand using unicode_data.txt, or whatever it's called, seems sort of thing that's been done somewhere. unicodeutils doesn't appear have along these lines, , googling didn't turn obvious. thanks!

you can call out uniprops script.

$ uniprops -p delta greek:delta greek:delta     u+1e9f ‹ẟ› \n{ latin small letter delta }:         \w \pl \p{lc} \p{l_} \p{l&} \p{ll}     u+03b4 ‹δ› \n{ greek small letter delta }:         \w \pl \p{lc} \p{l_} \p{l&} \p{ll}     u+0394 ‹Δ› \n{ greek capital letter delta }:         \w \pl \p{lc} \p{l_} \p{l&} \p{lu}  $ uniprops \# ç π     u+0023 ‹#› \n{ number sign }:         \pp \p{po}         ascii assigned common zyyy po p gr_base            grapheme_base graph grbase other_punctuation punct pat_syn            pattern_syntax patsyn posixgraph posixprint posixpunct            print punctuation     u+00e7 ‹ç› \n{ latin small letter c cedilla }:         \w \pl \p{lc} \p{l_} \p{l&} \p{ll}         alnum alpha alphabetic assigned inlatin1 cased            cased_letter lc changes_when_casemapped cwcm            changes_when_titlecased cwt changes_when_uppercased cwu ll            l gr_base grapheme_base graph grbase id_continue idc            id_start ids letter l_ latin latn lowercase_letter lower            lowercase print word xid_continue xidc xid_start xids     u+03c0 ‹π› \n{ greek small letter pi }:         \w \pl \p{lc} \p{l_} \p{l&} \p{ll}         alnum alpha alphabetic assigned greek is_greek            ingreek cased cased_letter lc changes_when_casemapped cwcm            changes_when_titlecased cwt changes_when_uppercased cwu ll            l gr_base grapheme_base graph grbase grek greek_and_coptic            id_continue idc id_start ids letter l_ lowercase_letter            lower lowercase print word xid_continue xidc xid_start xids   $ uniprops -a 'micro sign' u+00b5 ‹µ› \n{micro sign}     \w \pl \p{lc} \p{l_} \p{l&} \p{ll}     alnum alpha alphabetic assigned inlatin1 cased cased_letter lc changes_when_casefolded cwcf changes_when_casemapped cwcm        changes_when_nfkc_casefolded cwkcf changes_when_titlecased cwt changes_when_uppercased cwu common zyyy ll l gr_base grapheme_base        graph grbase id_continue idc id_start ids letter l_ latin_1 latin_1_supplement lowercase_letter lower lowercase print word        xid_continue xidc xid_start xids x_posix_alnum x_posix_alpha x_posix_graph x_posix_lower x_posix_print x_posix_word     age=1.1 bidi_class=l bidi_class=left_to_right bc=l block=latin_1 block=latin_1_supplement blk=latin1 canonical_combining_class=0        canonical_combining_class=not_reordered ccc=nr canonical_combining_class=nr script=common decomposition_type=com        decomposition_type=compat dt=com decomposition_type=non_canon decomposition_type=non_canonical dt=noncanon east_asian_width=neutral        grapheme_cluster_break=other gcb=xx grapheme_cluster_break=xx hangul_syllable_type=na hangul_syllable_type=not_applicable hst=na        joining_group=no_joining_group jg=nojoininggroup joining_type=non_joining jt=u joining_type=u line_break=al line_break=alphabetic        lb=al numeric_type=none nt=none numeric_value=nan nv=nan present_in=1.1 in=1.1 present_in=2.0 in=2.0 present_in=2.1 in=2.1        present_in=3.0 in=3.0 present_in=3.1 in=3.1 present_in=3.2 in=3.2 present_in=4.0 in=4.0 present_in=4.1 in=4.1 present_in=5.0 in=5.0        present_in=5.1 in=5.1 present_in=5.2 in=5.2 present_in=6.0 in=6.0 sc=zyyy script=zyyy sentence_break=lo sentence_break=lower sb=lo        word_break=aletter wb=le word_break=le _x_begin  $ uniprops -a 2011 u+2011 ‹‑› \n{non-breaking hyphen}     \pp \p{pd}     assigned ingeneralpunctuation changes_when_nfkc_casefolded cwkcf common zyyy dash dash_punctuation pd p general_punctuation        gr_base grapheme_base graph grbase punct pat_syn pattern_syntax patsyn print punctuation x_posix_graph x_posix_print x_posix_punct     age=1.1 bidi_class=on bidi_class=other_neutral bc=on block=general_punctuation canonical_combining_class=0        canonical_combining_class=not_reordered ccc=nr canonical_combining_class=nr script=common decomposition_type=nb        decomposition_type=nobreak dt=nb decomposition_type=non_canon decomposition_type=non_canonical dt=noncanon east_asian_width=neutral        grapheme_cluster_break=other gcb=xx grapheme_cluster_break=xx hangul_syllable_type=na hangul_syllable_type=not_applicable hst=na        joining_group=no_joining_group jg=nojoininggroup joining_type=non_joining jt=u joining_type=u line_break=gl line_break=glue lb=gl        numeric_type=none nt=none numeric_value=nan nv=nan present_in=1.1 in=1.1 present_in=2.0 in=2.0 present_in=2.1 in=2.1 present_in=3.0        in=3.0 present_in=3.1 in=3.1 present_in=3.2 in=3.2 present_in=4.0 in=4.0 present_in=4.1 in=4.1 present_in=5.0 in=5.0 present_in=5.1        in=5.1 present_in=5.2 in=5.2 present_in=6.0 in=6.0 sc=zyyy script=zyyy sentence_break=other sb=xx sentence_break=xx word_break=other        wb=xx word_break=xx _x_begin      $ uniprops -l | grep greek | sort -dfu     blk=greek     block:ancient_greek_musical_notation     block:ancient_greek_numbers     block:greek     block=greek_and_coptic     block:greek_extended     greek     greek_and_coptic     inancientgreekmusicalnotation     inancientgreeknumbers     ingreek     ingreekextended     is_greek     script=greek

you want unichars can go other way. here examples of calling it:

 $ unichars -gns '\p{cased}' '\p{number}'  $ unichars '\r'  $ unichars '\s' '[\v\h]'   $ unichars '\s' '\p{space}'     $ unichars '\pl' '\p{greek}'  $ unichars '\pl' '\p{greek}' | um  $ unichars '\p{age=6.0}'     | um  $ unichars '\p{lowercase}' '\p{lowercase_letter}'   $ unichars '\p{lower}'     '\p{ll}'  # same easier type  $ unichars -a '\p{alphabetic}' '\p{letter}' | wc -l # 1006 code points  $ unichars -gas '\pl' '\p{cased}'  $ unichars -gas '\p{mark}' '\p{diacritic}'   #  209 code points  $ unichars -gas '\pm' '\p{bc=nsm}'  $ unichars -gas '\p{cased}' '[^\p{cwl}\p{cwt}\p{cwu}]'    $ unichars -gas '\p{dash}'  $ unichars -gas '\p{mark}' '\p{diacritic}'   # 1068 code points  $ unichars -gas 'grep { length > 1 } lc, ucfirst, uc'  $ unichars -gas 'uc ne ucfirst'  $ unichars -gasn num

here 1 example of output:

$ unichars -gsn num 'int num ne num' ‭ 0  u+0030 gc=nd      0=nv  sc=common       digit 0 ‭ ¼  u+00bc gc=no    1/4=nv  sc=common       vulgar fraction 1 quarter ‭ ½  u+00bd gc=no    1/2=nv  sc=common       vulgar fraction 1 half ‭ ¾  u+00be gc=no    3/4=nv  sc=common       vulgar fraction 3 quarters ‭ ٠  u+0660 gc=nd      0=nv  sc=common       arabic-indic digit 0 ‭ ۰  u+06f0 gc=nd      0=nv  sc=arabic       extended arabic-indic digit 0 ‭ ߀  u+07c0 gc=nd      0=nv  sc=nko          nko digit 0 ‭ ०  u+0966 gc=nd      0=nv  sc=devanagari   devanagari digit 0 ‭ ০  u+09e6 gc=nd      0=nv  sc=bengali      bengali digit 0 ‭ ৴  u+09f4 gc=no   1/16=nv  sc=bengali      bengali currency numerator 1 ‭ ৵  u+09f5 gc=no    1/8=nv  sc=bengali      bengali currency numerator 2 ‭ ৶  u+09f6 gc=no   3/16=nv  sc=bengali      bengali currency numerator 3 ‭ ৷  u+09f7 gc=no    1/4=nv  sc=bengali      bengali currency numerator 4 ‭ ৸  u+09f8 gc=no    3/4=nv  sc=bengali      bengali currency numerator 1 less denominator ‭ ੦  u+0a66 gc=nd      0=nv  sc=gurmukhi     gurmukhi digit 0 ‭ ૦  u+0ae6 gc=nd      0=nv  sc=gujarati     gujarati digit 0 ‭ ୦  u+0b66 gc=nd      0=nv  sc=oriya        oriya digit 0 ‭ ୲  u+0b72 gc=no    1/4=nv  sc=oriya        oriya fraction 1 quarter ‭ ୳  u+0b73 gc=no    1/2=nv  sc=oriya        oriya fraction 1 half ‭ ୴  u+0b74 gc=no    3/4=nv  sc=oriya        oriya fraction 3 quarters ‭ ୵  u+0b75 gc=no   1/16=nv  sc=oriya        oriya fraction 1 sixteenth ‭ ୶  u+0b76 gc=no    1/8=nv  sc=oriya        oriya fraction 1 eighth ‭ ୷  u+0b77 gc=no   3/16=nv  sc=oriya        oriya fraction 3 sixteenths

etc.

i describe these first of oscon unicode talks. 2 of tools in suite of couple of dozen of them.

Abdelmuti

Search This Blog

Enumerate a character's Unicode properties in Ruby? -

Comments

Post a Comment