is there way enumerate of character's unicode properties in ruby? can use ruby 1.9's regexp class test whether given character has particular property (e.g., some_char =~ /\p{p}/
test whether some_char
punctuation, etc.)... since characters can have multiple properties ((
, example, both punctuation and ascii, etc.), nice able list of of character's properties.
i hand using unicode_data.txt
, or whatever it's called, seems sort of thing that's been done somewhere. unicodeutils
doesn't appear have along these lines, , googling didn't turn obvious. thanks!
you can call out uniprops script.
$ uniprops -p delta greek:delta greek:delta u+1e9f ‹ẟ› \n{ latin small letter delta }: \w \pl \p{lc} \p{l_} \p{l&} \p{ll} u+03b4 ‹δ› \n{ greek small letter delta }: \w \pl \p{lc} \p{l_} \p{l&} \p{ll} u+0394 ‹Δ› \n{ greek capital letter delta }: \w \pl \p{lc} \p{l_} \p{l&} \p{lu} $ uniprops \# ç π u+0023 ‹#› \n{ number sign }: \pp \p{po} ascii assigned common zyyy po p gr_base grapheme_base graph grbase other_punctuation punct pat_syn pattern_syntax patsyn posixgraph posixprint posixpunct print punctuation u+00e7 ‹ç› \n{ latin small letter c cedilla }: \w \pl \p{lc} \p{l_} \p{l&} \p{ll} alnum alpha alphabetic assigned inlatin1 cased cased_letter lc changes_when_casemapped cwcm changes_when_titlecased cwt changes_when_uppercased cwu ll l gr_base grapheme_base graph grbase id_continue idc id_start ids letter l_ latin latn lowercase_letter lower lowercase print word xid_continue xidc xid_start xids u+03c0 ‹π› \n{ greek small letter pi }: \w \pl \p{lc} \p{l_} \p{l&} \p{ll} alnum alpha alphabetic assigned greek is_greek ingreek cased cased_letter lc changes_when_casemapped cwcm changes_when_titlecased cwt changes_when_uppercased cwu ll l gr_base grapheme_base graph grbase grek greek_and_coptic id_continue idc id_start ids letter l_ lowercase_letter lower lowercase print word xid_continue xidc xid_start xids $ uniprops -a 'micro sign' u+00b5 ‹µ› \n{micro sign} \w \pl \p{lc} \p{l_} \p{l&} \p{ll} alnum alpha alphabetic assigned inlatin1 cased cased_letter lc changes_when_casefolded cwcf changes_when_casemapped cwcm changes_when_nfkc_casefolded cwkcf changes_when_titlecased cwt changes_when_uppercased cwu common zyyy ll l gr_base grapheme_base graph grbase id_continue idc id_start ids letter l_ latin_1 latin_1_supplement lowercase_letter lower lowercase print word xid_continue xidc xid_start xids x_posix_alnum x_posix_alpha x_posix_graph x_posix_lower x_posix_print x_posix_word age=1.1 bidi_class=l bidi_class=left_to_right bc=l block=latin_1 block=latin_1_supplement blk=latin1 canonical_combining_class=0 canonical_combining_class=not_reordered ccc=nr canonical_combining_class=nr script=common decomposition_type=com decomposition_type=compat dt=com decomposition_type=non_canon decomposition_type=non_canonical dt=noncanon east_asian_width=neutral grapheme_cluster_break=other gcb=xx grapheme_cluster_break=xx hangul_syllable_type=na hangul_syllable_type=not_applicable hst=na joining_group=no_joining_group jg=nojoininggroup joining_type=non_joining jt=u joining_type=u line_break=al line_break=alphabetic lb=al numeric_type=none nt=none numeric_value=nan nv=nan present_in=1.1 in=1.1 present_in=2.0 in=2.0 present_in=2.1 in=2.1 present_in=3.0 in=3.0 present_in=3.1 in=3.1 present_in=3.2 in=3.2 present_in=4.0 in=4.0 present_in=4.1 in=4.1 present_in=5.0 in=5.0 present_in=5.1 in=5.1 present_in=5.2 in=5.2 present_in=6.0 in=6.0 sc=zyyy script=zyyy sentence_break=lo sentence_break=lower sb=lo word_break=aletter wb=le word_break=le _x_begin $ uniprops -a 2011 u+2011 ‹‑› \n{non-breaking hyphen} \pp \p{pd} assigned ingeneralpunctuation changes_when_nfkc_casefolded cwkcf common zyyy dash dash_punctuation pd p general_punctuation gr_base grapheme_base graph grbase punct pat_syn pattern_syntax patsyn print punctuation x_posix_graph x_posix_print x_posix_punct age=1.1 bidi_class=on bidi_class=other_neutral bc=on block=general_punctuation canonical_combining_class=0 canonical_combining_class=not_reordered ccc=nr canonical_combining_class=nr script=common decomposition_type=nb decomposition_type=nobreak dt=nb decomposition_type=non_canon decomposition_type=non_canonical dt=noncanon east_asian_width=neutral grapheme_cluster_break=other gcb=xx grapheme_cluster_break=xx hangul_syllable_type=na hangul_syllable_type=not_applicable hst=na joining_group=no_joining_group jg=nojoininggroup joining_type=non_joining jt=u joining_type=u line_break=gl line_break=glue lb=gl numeric_type=none nt=none numeric_value=nan nv=nan present_in=1.1 in=1.1 present_in=2.0 in=2.0 present_in=2.1 in=2.1 present_in=3.0 in=3.0 present_in=3.1 in=3.1 present_in=3.2 in=3.2 present_in=4.0 in=4.0 present_in=4.1 in=4.1 present_in=5.0 in=5.0 present_in=5.1 in=5.1 present_in=5.2 in=5.2 present_in=6.0 in=6.0 sc=zyyy script=zyyy sentence_break=other sb=xx sentence_break=xx word_break=other wb=xx word_break=xx _x_begin $ uniprops -l | grep greek | sort -dfu blk=greek block:ancient_greek_musical_notation block:ancient_greek_numbers block:greek block=greek_and_coptic block:greek_extended greek greek_and_coptic inancientgreekmusicalnotation inancientgreeknumbers ingreek ingreekextended is_greek script=greek
you want unichars can go other way. here examples of calling it:
$ unichars -gns '\p{cased}' '\p{number}' $ unichars '\r' $ unichars '\s' '[\v\h]' $ unichars '\s' '\p{space}' $ unichars '\pl' '\p{greek}' $ unichars '\pl' '\p{greek}' | um $ unichars '\p{age=6.0}' | um $ unichars '\p{lowercase}' '\p{lowercase_letter}' $ unichars '\p{lower}' '\p{ll}' # same easier type $ unichars -a '\p{alphabetic}' '\p{letter}' | wc -l # 1006 code points $ unichars -gas '\pl' '\p{cased}' $ unichars -gas '\p{mark}' '\p{diacritic}' # 209 code points $ unichars -gas '\pm' '\p{bc=nsm}' $ unichars -gas '\p{cased}' '[^\p{cwl}\p{cwt}\p{cwu}]' $ unichars -gas '\p{dash}' $ unichars -gas '\p{mark}' '\p{diacritic}' # 1068 code points $ unichars -gas 'grep { length > 1 } lc, ucfirst, uc' $ unichars -gas 'uc ne ucfirst' $ unichars -gasn num
here 1 example of output:
$ unichars -gsn num 'int num ne num' 0 u+0030 gc=nd 0=nv sc=common digit 0 ¼ u+00bc gc=no 1/4=nv sc=common vulgar fraction 1 quarter ½ u+00bd gc=no 1/2=nv sc=common vulgar fraction 1 half ¾ u+00be gc=no 3/4=nv sc=common vulgar fraction 3 quarters ٠ u+0660 gc=nd 0=nv sc=common arabic-indic digit 0 ۰ u+06f0 gc=nd 0=nv sc=arabic extended arabic-indic digit 0 ߀ u+07c0 gc=nd 0=nv sc=nko nko digit 0 ० u+0966 gc=nd 0=nv sc=devanagari devanagari digit 0 ০ u+09e6 gc=nd 0=nv sc=bengali bengali digit 0 ৴ u+09f4 gc=no 1/16=nv sc=bengali bengali currency numerator 1 ৵ u+09f5 gc=no 1/8=nv sc=bengali bengali currency numerator 2 ৶ u+09f6 gc=no 3/16=nv sc=bengali bengali currency numerator 3 ৷ u+09f7 gc=no 1/4=nv sc=bengali bengali currency numerator 4 ৸ u+09f8 gc=no 3/4=nv sc=bengali bengali currency numerator 1 less denominator ੦ u+0a66 gc=nd 0=nv sc=gurmukhi gurmukhi digit 0 ૦ u+0ae6 gc=nd 0=nv sc=gujarati gujarati digit 0 ୦ u+0b66 gc=nd 0=nv sc=oriya oriya digit 0 ୲ u+0b72 gc=no 1/4=nv sc=oriya oriya fraction 1 quarter ୳ u+0b73 gc=no 1/2=nv sc=oriya oriya fraction 1 half ୴ u+0b74 gc=no 3/4=nv sc=oriya oriya fraction 3 quarters ୵ u+0b75 gc=no 1/16=nv sc=oriya oriya fraction 1 sixteenth ୶ u+0b76 gc=no 1/8=nv sc=oriya oriya fraction 1 eighth ୷ u+0b77 gc=no 3/16=nv sc=oriya oriya fraction 3 sixteenths
etc.
i describe these first of oscon unicode talks. 2 of tools in suite of couple of dozen of them.
Comments
Post a Comment