You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
177 lines
6.8 KiB
177 lines
6.8 KiB
7 months ago
|
#!/bin/sh
|
||
|
|
||
|
# vim: indentexpr= nosmartindent autoindent
|
||
|
# vim: tabstop=2 shiftwidth=2 softtabstop=2
|
||
|
|
||
|
# This is a regex that I reverse engineered from the sentence boundary chain
|
||
|
# rules in UAX #29. Unlike the grapheme regex, which is essentially provided
|
||
|
# for us in UAX #29, no such sentence regex exists.
|
||
|
#
|
||
|
# I looked into how ICU achieves this, since UAX #29 hints that producing
|
||
|
# finite state machines for grapheme/sentence/word/line breaking is possible,
|
||
|
# but only easy to do for graphemes. ICU does this by implementing their own
|
||
|
# DSL for describing the break algorithms in terms of the chaining rules
|
||
|
# directly. You can see an example for sentences in
|
||
|
# icu4c/source/data/brkitr/rules/sent.txt. ICU then builds a finite state
|
||
|
# machine from those rules in a mostly standard way, but implements the
|
||
|
# "chaining" aspect of the rules by connecting overlapping end and start
|
||
|
# states. For example, given SB7:
|
||
|
#
|
||
|
# (Upper | Lower) ATerm x Upper
|
||
|
#
|
||
|
# Then the naive way to convert this into a regex would be something like
|
||
|
#
|
||
|
# [\p{sb=Upper}\p{sb=Lower}]\p{sb=ATerm}\p{sb=Upper}
|
||
|
#
|
||
|
# Unfortunately, this is incorrect. Why? Well, consider an example like so:
|
||
|
#
|
||
|
# U.S.A.
|
||
|
#
|
||
|
# A correct implementation of the sentence breaking algorithm should not insert
|
||
|
# any breaks here, exactly in accordance with repeatedly applying rule SB7 as
|
||
|
# given above. Our regex fails to do this because it will first match `U.S`
|
||
|
# without breaking them---which is correct---but will then start looking for
|
||
|
# its next rule beginning with a full stop (in ATerm) and followed by an
|
||
|
# uppercase letter (A). This will wind up triggering rule SB11 (without
|
||
|
# matching `A`), which inserts a break.
|
||
|
#
|
||
|
# The reason why this happens is because our initial application of rule SB7
|
||
|
# "consumes" the next uppercase letter (S), which we want to reuse as a prefix
|
||
|
# in the next rule application. A natural way to express this would be with
|
||
|
# look-around, although it's not clear that works in every case since you
|
||
|
# ultimately might want to consume that ending uppercase letter. In any case,
|
||
|
# we can't use look-around in our truly regular regexes, so we must fix this.
|
||
|
# The approach we take is to explicitly repeat rules when a suffix of a rule
|
||
|
# is a prefix of another rule. In the case of SB7, the end of the rule, an
|
||
|
# uppercase letter, also happens to match the beginning of the rule. This can
|
||
|
# in turn be repeated indefinitely. Thus, our actual translation to a regex is:
|
||
|
#
|
||
|
# [\p{sb=Upper}\p{sb=Lower}]\p{sb=ATerm}\p{sb=Upper}(\p{sb=ATerm}\p{sb=Upper}*
|
||
|
#
|
||
|
# It turns out that this is exactly what ICU does, but in their case, they do
|
||
|
# it automatically. In our case, we connect the chaining rules manually. It's
|
||
|
# tedious. With that said, we do no implement Unicode line breaking with this
|
||
|
# approach, which is a far scarier beast. In that case, it would probably be
|
||
|
# worth writing the code to do what ICU does.
|
||
|
#
|
||
|
# In the case of sentence breaks, there aren't *too* many overlaps of this
|
||
|
# nature. We list them out exhaustively to make this clear, because it's
|
||
|
# essentially impossible to easily observe this in the regex. (It took me a
|
||
|
# full day to figure all of this out.) Rules marked with N/A mean that they
|
||
|
# specify a break, and this strategy only really applies to stringing together
|
||
|
# non-breaks.
|
||
|
#
|
||
|
# SB1 - N/A
|
||
|
# SB2 - N/A
|
||
|
# SB3 - None
|
||
|
# SB4 - N/A
|
||
|
# SB5 - None
|
||
|
# SB6 - None
|
||
|
# SB7 - End overlaps with beginning of SB7
|
||
|
# SB8 - End overlaps with beginning of SB7
|
||
|
# SB8a - End overlaps with beginning of SB6, SB8, SB8a, SB9, SB10, SB11
|
||
|
# SB9 - None
|
||
|
# SB10 - None
|
||
|
# SB11 - None
|
||
|
# SB998 - N/A
|
||
|
#
|
||
|
# SB8a is in particular quite tricky to get right without look-ahead, since it
|
||
|
# allows ping-ponging between match rules SB8a and SB9-11, where SB9-11
|
||
|
# otherwise indicate that a break has been found. In the regex below, we tackle
|
||
|
# this by only permitting part of SB8a to match inside our core non-breaking
|
||
|
# repetition. In particular, we only allow the parts of SB8a to match that
|
||
|
# permit the non-breaking components to continue. If a part of SB8a matches
|
||
|
# that guarantees a pop out to SB9-11, (like `STerm STerm`), then we let it
|
||
|
# happen. This still isn't correct because an SContinue might be seen which
|
||
|
# would allow moving back into SB998 and thus the non-breaking repetition, so
|
||
|
# we handle that case as well.
|
||
|
#
|
||
|
# Finally, the last complication here is the sprinkling of $Ex* everywhere.
|
||
|
# This essentially corresponds to the implementation of SB5 by following
|
||
|
# UAX #29's recommendation in S6.2. Essentially, we use it avoid ever breaking
|
||
|
# in the middle of a grapheme cluster.
|
||
|
|
||
|
CR="\p{sb=CR}"
|
||
|
LF="\p{sb=LF}"
|
||
|
Sep="\p{sb=Sep}"
|
||
|
Close="\p{sb=Close}"
|
||
|
Sp="\p{sb=Sp}"
|
||
|
STerm="\p{sb=STerm}"
|
||
|
ATerm="\p{sb=ATerm}"
|
||
|
SContinue="\p{sb=SContinue}"
|
||
|
Numeric="\p{sb=Numeric}"
|
||
|
Upper="\p{sb=Upper}"
|
||
|
Lower="\p{sb=Lower}"
|
||
|
OLetter="\p{sb=OLetter}"
|
||
|
|
||
|
Ex="[\p{sb=Extend}\p{sb=Format}]"
|
||
|
ParaSep="[$Sep $CR $LF]"
|
||
|
SATerm="[$STerm $ATerm]"
|
||
|
|
||
|
LetterSepTerm="[$OLetter $Upper $Lower $ParaSep $SATerm]"
|
||
|
|
||
|
echo "(?x)
|
||
|
(
|
||
|
# SB6
|
||
|
$ATerm $Ex*
|
||
|
$Numeric
|
||
|
|
|
||
|
# SB7
|
||
|
[$Upper $Lower] $Ex* $ATerm $Ex*
|
||
|
$Upper $Ex*
|
||
|
# overlap with SB7
|
||
|
($ATerm $Ex* $Upper $Ex*)*
|
||
|
|
|
||
|
# SB8
|
||
|
$ATerm $Ex* $Close* $Ex* $Sp* $Ex*
|
||
|
([^$LetterSepTerm] $Ex*)* $Lower $Ex*
|
||
|
# overlap with SB7
|
||
|
($ATerm $Ex* $Upper $Ex*)*
|
||
|
|
|
||
|
# SB8a
|
||
|
$SATerm $Ex* $Close* $Ex* $Sp* $Ex*
|
||
|
(
|
||
|
$SContinue
|
||
|
|
|
||
|
$ATerm $Ex*
|
||
|
# Permit repetition of SB8a
|
||
|
(($Close $Ex*)* ($Sp $Ex*)* $SATerm)*
|
||
|
# In order to continue non-breaking matching, we now must observe
|
||
|
# a match with a rule that keeps us in SB6-8a. Otherwise, we've entered
|
||
|
# one of SB9-11 and know that a break must follow.
|
||
|
(
|
||
|
# overlap with SB6
|
||
|
$Numeric
|
||
|
|
|
||
|
# overlap with SB8
|
||
|
($Close $Ex*)* ($Sp $Ex*)*
|
||
|
([^$LetterSepTerm] $Ex*)* $Lower $Ex*
|
||
|
# overlap with SB7
|
||
|
($ATerm $Ex* $Upper $Ex*)*
|
||
|
|
|
||
|
# overlap with SB8a
|
||
|
($Close $Ex*)* ($Sp $Ex*)* $SContinue
|
||
|
)
|
||
|
|
|
||
|
$STerm $Ex*
|
||
|
# Permit repetition of SB8a
|
||
|
(($Close $Ex*)* ($Sp $Ex*)* $SATerm)*
|
||
|
# As with ATerm above, in order to continue non-breaking matching, we
|
||
|
# must now observe a match with a rule that keeps us out of SB9-11.
|
||
|
# For STerm, the only such possibility is to see an SContinue. Anything
|
||
|
# else will result in a break.
|
||
|
($Close $Ex*)* ($Sp $Ex*)* $SContinue
|
||
|
)
|
||
|
|
|
||
|
# SB998
|
||
|
# The logic behind this catch-all is that if we get to this point and
|
||
|
# see a Sep, CR, LF, STerm or ATerm, then it has to fall into one of
|
||
|
# SB9, SB10 or SB11. In the cases of SB9-11, we always find a break since
|
||
|
# SB11 acts as a catch-all to induce a break following a SATerm that isn't
|
||
|
# handled by rules SB6-SB8a.
|
||
|
[^$ParaSep $SATerm]
|
||
|
)*
|
||
|
# The following collapses rules SB3, SB4, part of SB8a, SB9, SB10 and SB11.
|
||
|
($SATerm $Ex* ($Close $Ex*)* ($Sp $Ex*)*)* ($CR $LF | $ParaSep)?
|
||
|
"
|