UAX Logo

#Unicode

⋮The⋮ ⋮Per⋮ils⋮ ⋮of⋮ ⋮Seg⋮ment⋮ing⋮ ⋮Text⋮

Published at December 4, 2020 ·  2 min read

Breaking Unicode Text into Segments Text processing applications need to segment text into pieces. Segments may be words, sentences, paragraphs and so on. For western languages this is not too hard of a problem, but it may become an involved endeavor if you consider Arabic or Asian languages. From a typographic viewpoint some of these languages present serious challenges for correct segmenting. The Unicode consortium publishes recommendations and algorithms for various aspects of text segmentation in their Unicode Annexes (UAX)....