Do Unicode line break rules require the last character to be a required break?

Question

Do Unicode line break rules require the last character to be a required break?

I am trying to use libunibreak ( https://github.com/adah1972/libunibreak ) to mark possible line breaks in some specific Unicode text.

Libunibreak returns four possible options for each block of code in some text:

LINEBREAK_MUSTBREAK
LINEBREAK_ALLOWBREAK
LINEBREAK_NOBREAK
LINEBREAK_INSIDEACHAR

We hope that they themselves explain. I would expect MUSTBREAK to match newlines such as LF. However, for any given text, Libunibreak always indicates that the last character is MUSTBREAK

So, for example, with the string "abc", the output will be [NOBREAK, NOBREAK, MUSTBREAK]. For "abc \ n" the output will be [NOBREAK, NOBREAK, NOBREAK, MUSTBREAK]. I use the MUSTBREAK attribute to start a new line when drawing text, so the first case ("abc") creates an extra line that should not be there.

Is this behavior a Unicode specification or is it a quirk of implementing the library I'm using?

+4

unicode line-breaks

Prismatic Dec 04 '15 at 10:40

source share

1 answer

nwellnhof · Accepted Answer · 2015-12-04T22:51:29+0000

Yes, this is what the Unicode line breaking algorithm defines. Rule LB3 in UAX # 14: Unicode Line Breaking Algorithm, Section 6.1 “Inappropriate Line Break Rules” :

Always break at the end of the text.

:

[ ] , [...] .

Do Unicode line break rules require the last character to be a required break?

More articles: