Breaking Unicode Text into Segments

Text processing applications need to segment text into pieces. Segments may be

words,
sentences,
paragraphs

and so on. For western languages this is not too hard of a problem, but it may become an involved endeavor if you consider Arabic or Asian languages. From a typographic viewpoint some of these languages present serious challenges for correct segmenting. The Unicode consortium publishes recommendations and algorithms for various aspects of text segmentation in their Unicode Annexes (UAX).

Text Segmentation in Go(lang)

There exist a number of Unicode standards describing best practices for text segmentation. Unfortunately, implementations in Go are sparse. Marcel van Lohuizen from the Go Core Team seems to be working on text segmenting, but with low priority. In the long run, it will be best to wait for the standard library to include functions for text segmentation. However, for now I will implement my own.

Tradeoffs

Handling character data is often part of the inner loop of applications, requiring fast implementations. The flip side are tiring implementations, with lots and lots of boring code. This is something I learned from working on text processing: I am even more susceptible to boredom than I suspected. In the Unicode committee world there are lots of brave people involved who spend a lot of energy on getting the details right. Progress in international language processing wouldn’t be possible without them–my thanks to every single one of them.

I need a different approach, however. If your concern is all about efficiency and performance, you will probably shy away from U⋮A⋮X. I try to find a balance between performance and readability. Unicode algorithms are often stated as formal rules. Implementations in the wild usually follow these descriptions in a procedural manner, resulting in hard-to-read code for travelling back and forth in byte buffers. I won’t do that. Instead I prefer to design a programming environment which allows me to put in the problem description (i.e., the UAX-rules) and get a working algorithm out. Oftentimes that’s tougher and more time-consuming than a straightforward implementation–but hey!, Open Source development is about fun, at least for me.

⋮The⋮ ⋮Per⋮ils⋮ ⋮of⋮ ⋮Seg⋮ment⋮ing⋮ ⋮Text⋮

Breaking Unicode Text into Segments

Text Segmentation in Go(lang)

Tradeoffs

Tags

Recent posts

What is a Typesetting-Stack?
20 November 2021

Bidi: What You See isn't What You Get
24 February 2021

Text vs Strings
16 January 2021

The Perils of Segmenting Text
4 December 2020

Related posts

⋮What⋮ ⋮is⋮ ⋮a⋮ ⋮Type⋮set⋮ting⋮-⋮Stack⋮?

Archives

⋮The⋮ ⋮Per⋮ils⋮ ⋮of⋮ ⋮Seg⋮ment⋮ing⋮ ⋮Text⋮

Breaking Unicode Text into Segments

Text Segmentation in Go(lang)

Tradeoffs

Tags

Recent posts

What is a Typesetting-Stack? 20 November 2021

Bidi: What You See isn't What You Get 24 February 2021

Text vs Strings 16 January 2021

The Perils of Segmenting Text 4 December 2020

Related posts

⋮What⋮ ⋮is⋮ ⋮a⋮ ⋮Type⋮set⋮ting⋮-⋮Stack⋮?

Archives

What is a Typesetting-Stack?
20 November 2021

Bidi: What You See isn't What You Get
24 February 2021

Text vs Strings
16 January 2021

The Perils of Segmenting Text
4 December 2020