UAX Logo

⋮Seg⋮ment⋮ing⋮ ⋮Uni⋮code⋮

Practical Unicode Text Segmentation in Go


What is a Typesetting-Stack?

Published at November 20, 2021 ·  7 min read

The other day I had reason to think about what actually comprises a typesetting stack. Typesetting allows the documents you compiled to be read on a output medium. For this post I’ll focus on screen-output. Other have tried to give a systematic overview of what typesetting is. Nevertheless, I’ll jot down my own attempt, focussing on UAX, i.e. the Unicode recommendations and algorithms for various aspects of text segmentation. A Walk through the Layout Process It’s clear that the rules of various Unicode annexes play an important role during typsetting....

Bidi: What You See isn't What You Get

Published at February 24, 2021 ·  4 min read

I stumbled across the problem of bidirectional text in terminals while trying to test a variant of the Unicode Bidirectional Algorithm. The Unicode consortium publishes a set of bidi test-cases, which suffer from being somewhat “non-visual”. At the end of the day you want to deal with real sentences in real languages and scripts. Preparing that, you face a peculiar problem: how do you display your test output? After all, UAX#9 is about visual ordering of characters....

Text vs Strings

Published at January 16, 2021 ·  8 min read

In the unlikely event of a software developer and a typographer attending the same party and—even more unlikely—enganging in some small talk, they certainly are prone to a fundamental disagreement: What is the nature of text? As an IT guy with a heavy interest in typography, I even find me in disagreement with myself, clearly an unhealthy state of mind. As a typographer I may think about beautiful books, fine typecases or craftsmanship in layout, as a bibliophile I may think about inspired literature or enlightened non-fiction, but as a programmer I mainly think of the hassle with UTF-8, emojis, limitations of string libaries and so on....

The Perils of Segmenting Text

Published at December 4, 2020 ·  2 min read

Breaking Unicode Text into Segments Text processing applications need to segment text into pieces. Segments may be words, sentences, paragraphs and so on. For western languages this is not too hard of a problem, but it may become an involved endeavor if you consider Arabic or Asian languages. From a typographic viewpoint some of these languages present serious challenges for correct segmenting. The Unicode consortium publishes recommendations and algorithms for various aspects of text segmentation in their Unicode Annexes (UAX)....