UAX Logo

⋮Text⋮ ⋮vs⋮ ⋮Strings⋮

In the unlikely event of a software developer and a typographer attending the same party and—even more unlikely—enganging in some small talk, they certainly are prone to a fundamental disagreement: What is the nature of text? As an IT guy with a heavy interest in typography, I even find me in disagreement with myself, clearly an unhealthy state of mind. As a typographer I may think about beautiful books, fine typecases or craftsmanship in layout, as a bibliophile I may think about inspired literature or enlightened non-fiction, but as a programmer I mainly think of the hassle with UTF-8, emojis, limitations of string libaries and so on.

Texts are not Strings

I’m convinced programmers should stop talking about text as if it were just a long string, and vice versa. In my opinion, text and strings are different animals, and the first step to having better library support for texts is to recognize the difference.

non-fiction text vs computer error message

A single data type to rule them all?

Text consists of letters and words, and so do strings. In real life we’re accustomed to define things by their components or their attributes. But in mathematics (and software development) we are comfortable to define things by sets of operations on them. Taking this perspective, smart people have tried to work out the difference between text and strings or even encode it in their application, but that won’t stop me from pursuing my own attempt.

Similarities

Clearly strings are supposed to convey pieces of information. Presumably so do texts, but in a somewhat broader sense. Software applications typically don’t want to tell a story, but rather send me some kind of concise message. There is obviously a grey area, and I do not intend to argue from this point of view.

Both texts and strings are composed from characters, and often from words. But already slight differences start to creep in: What is a character? For texts, we think of a visual thing, more what’s technically a glyph, while with a string we’re tempted to think byte or uft16. Unicode introduces programmers to the term grapheme to mean a “perceived character”. This dilemma continues with the precise meaning of what a “word” is: in computer science the term is heavily overloaded, and in the real world… it’s complicated Higher-level building blocks of texts do not have sensible meaning for strings at all: sentences, paragraphs, pages, etc. My impression is that many programmers insist on strings having directly addressable characters, something along the lines of

dest_string[i] = source_string[j]

but these times are gone forever. Accessing graphemes in text is an inherently sequential operation, which is perfectly expressed by Rust’s unicode_segmentation crate, but other languages may prefer to mask this, as Swift does. For international text, fixed sized string cells are an illusion.

Different Operations

In search for operations on text we could take a look at software applications dealing with text: web browsers and word processors.

Web browser window and Libre-Office window

Taking text seriously

One of the reasons there are so few text-centric software libraries around may be that there is not much demand for it. If you do not find yourself in a spot to program a web browser or are a contributor to LibreOffice, you would probably be hard pressed to remember the last time you had to manipulate large amounts of sentences and paragraphs in your software (with one exception, which I will talk about in a minute). But perhaps the reverse is also true: there is a huge gap between the “professional text machines” and everyday applications because there are so few tools for text around.

The following (mental) exercise helps me to address this: Why are you not programming a web browser? Of course there are these minor complications with a cumbersome protocol, all kinds of security concerns, an API and virtual machine for a weird scripting language, platform-dependent issues with a host of media types, insanely complicated standards (which your peers prefer to interpret to their liking), restrictions of hand-held devices, and a lot more. But apart from these, what makes it hard to come up at least with a non-graphical browser?*

Imagine you are handed a library for your preferred programming language which addresses the following tasks linked to your brand-new browser project:

  • Allow entities of text to be split, merged, extracted, reordered, etc. with ease,
  • those entities being (at least) code-points, graphemes, syllables, words, sentences, lines, paragraphs, etc.
  • Arbitrary runs of characters within these entities may be attributed (a.k.a. styled), with synchronization of raw text and styles being transparent.
  • International text attributes (e.g., bidrectional text) are identified with easy and functions for correct visual ordering are available
  • and are transparently intertwined with sophisticated line-breaking for a multitude of languages and scripts.
  • At least Unicode recommendations UAX#9 (bidi), UAX#11 (grapheme width), UAX#14 (line breaking), UAX#29 (entities of text), UAX#51 (emojis) are available as actionable library functions.
  • Runs of text may be filled into containers, i.e. boxes on a page.
  • And performance won’t degenerate in the face of large amounts of text.

For a real web browser that would still leave you with tough problems of online-typography, but that’s been the reason we will stick to a text-based browser in the first place. But armed with this super-duper library you now are free to focus on implementing a billion layout rules and a zillion properties, which—you keep repeating to yourself—is the fun part of writing a browser.

Just for the Pros?

The list above may be daunting. Who would take it on her to implement all this? Actually the community already did! In fact, some modules at the heart of browser styling engines are community projects. Every now and then some brave person tackles one of these challenges and with some luck the code ends up in a library, instead of a monolitic application.

Unfortunately all this code (and knowledge) is distributed in different applications and libraries, implemented in different programming languages. Attempts to integrate these into a text processing engine are heroic, but may be of limited benefit to the software development community.

Text is not a Sequence of Bytes

Strings are sequences of bytes. We may overlay these bytes with semantics by using an encoding, but the sequential nature of byte sequences persists. But is this the appropriate tool for managing text? RAM is cheap and copying blocks of data is very fast these days, but I am unsure if byte sequences are the best data structure for text. Real world text engines do not represent text as byte sequences, but more often than not as trees. XML underlies the engines of web browsers (HTML) as well as word processors (ODF, DOCX), and is inherently a tree-like format.

Perhaps tree-structure and byte-sequence should meet at the paragraph level: Although modern literature has produced some really looong paragraphs, usually they provide a good synchronisation point for sequenctial operations like iterating graphemes, breaking lines, bidi-ordering etc. If we are able to quickly access paragraphs in large texts (possibly employing concurrency), we could then treat paragraphs as sequential work packages.

Trees are also easily implemented as persistent data structures, a bonus with multithreading. Some editors use ropes as their underlying data structure, which may be a good basis for other text manipulation purposes.

What Text is not

It’s a telltale that text tenaciously escapes exact definitions: We humans utilize it for all kinds of activities and often it’s easier to tell what a text library should not include.

Text vs Typography

While on one end of the spectrum a data type for text has to have distinct semantics different from strings, on the other end we should be careful with where typography starts. Obviously there is a grey area, too, but we may at least try to categorize.

It is not a matter of macroscopic vs microscopic view, as typography ranges from pixels on a output device to the layout of a book, but rather with having completely different logical entities. Typography is concerned with the visual representation of text. That’s it. This includes fonts, hyphenation, page layout, international scripts etc.

But just a second ago this guy talked about styled text, you ask? And you’re right. A word in italics is about visual representation, but it is also about semantics. In HTML lingo this is called <em> (emphasize). The line is blurry—welcome to the world of human communication!

Text as Data

The one instance above where I alluded to a frequent text use case is with text primarily treated as data. Of course, that sounds strange in a blog post about text processing on a computer, but I’ll put machine learning, full text search etc. in this camp. For now I will not think about inverted indexes, PoS-tagging or n-grams as part of a text-API. But who knows what a web browser in 2030 will be capable of?

References

Edaqa Mortoray: Strings and Text are not the same

Computer Hope: Python Text Processing Modules

Wyrd Smythe: Data, Text and Strings (oh, my!)


*) Browsh provides a good example, as it is backed by a headless Firefox, delegating the heavy lifting to it while focussing on the user interface.