UAX Logo

#Segmentation

⋮What⋮ ⋮is⋮ ⋮a⋮ ⋮Type⋮set⋮ting⋮-⋮Stack⋮?

Published at November 20, 2021 ·  7 min read

The other day I had reason to think about what actually comprises a typesetting stack. Typesetting allows the documents you compiled to be read on a output medium. For this post I’ll focus on screen-output. Other have tried to give a systematic overview of what typesetting is. Nevertheless, I’ll jot down my own attempt, focussing on UAX, i.e. the Unicode recommendations and algorithms for various aspects of text segmentation. A Walk through the Layout Process It’s clear that the rules of various Unicode annexes play an important role during typsetting....


⋮The⋮ ⋮Per⋮ils⋮ ⋮of⋮ ⋮Seg⋮ment⋮ing⋮ ⋮Text⋮

Published at December 4, 2020 ·  2 min read

Breaking Unicode Text into Segments Text processing applications need to segment text into pieces. Segments may be words, sentences, paragraphs and so on. For western languages this is not too hard of a problem, but it may become an involved endeavor if you consider Arabic or Asian languages. From a typographic viewpoint some of these languages present serious challenges for correct segmenting. The Unicode consortium publishes recommendations and algorithms for various aspects of text segmentation in their Unicode Annexes (UAX)....