Question Details

No question body available.

Tags

python regex string

Answers (3)

Accepted Answer Available
Accepted Answer
April 19, 2026 Score: 3 Rep: 9,096 Quality: High Completeness: 80%

Reconstructing the input with str.join is a good explanation for re.split behavior when the pattern has no capturing groups. It does not apply to capturing patterns, obviously, because there's no (general) string to use as a "joining" delimiter. However, we still want to be able to reason about re.split in some predictable way.

In this case the invariant is interleaved "bodies" and delimiters. Let's start from the simplest possible capture group (assume import re in all snippets below):

>>> re.split(r"(a)", "bacab")
['b', 'a', 'c', 'a', 'b']

All good, all clear: we get a list of the shape [body, delimiter, body, delimiter, ..., body] - 0 or more pairs of non-matching "bodies" and matching delimiters followed by a trailing "body" remainder.

Let's make the pattern more interesting:

>>> re.split(r"(a|d)", "bacabde")
['b', 'a', 'c', 'a', 'b', 'd', 'e']

Still the same shape. Now let's use a string with consecutive delimiters (matches):

>>> re.split(r"(a|c)", "bacab")
['b', 'a', '', 'c', '', 'a', 'b']

Note that the delimiters cannot be "collapsed" together, as the capture group promises exactly one character inside. If not these empty strings, c would be in the position where we expect a non-matching part of the string. So in your example ? and a space (matching \s) were consecutive, and re.split had to add an empty string to occupy the "body" position between them.

Of course it's all there in the source code, if you are ready to read some simple C.

The motivation (this part is my guess - not feeling like digging through 10+ years of sre history today) is supporting the ability to process the split result as pairs of (token, remainder) without re-checking whether a string is a token (matching delimiter) every time.

In general, you often want to know whether the part of re.split output is a delimiter or the unmatched string part, and the only way to do that in a simple list of strings is by position. Standard library authors did not want to deprive us of such ability, so the list is padded with empty strings if there's no "body part" left between two delimiters.

April 20, 2026 Score: 1 Rep: 28,124 Quality: Medium Completeness: 80%

As the documentation says (albeit while talking about start/end):

That way, separator components are always found at the same relative indices within the result list.

For example if you split by vowels, they're all in the odd indices, and the other pieces are all in the even indices:

>>> import re >>> t = re.split('([aeiou])', 'robertspierre') >>> print(t) ['r', 'o', 'b', 'e', 'rtsp', 'i', '', 'e', 'rr', 'e', ''] >>> t[1::2] ['o', 'e', 'i', 'e', 'e'] >>> t[::2] ['r', 'b', 'rtsp', '', 'rr', '']

If empty strings were discarded, you wouldn't know what's what, you'd have to re-analyze the strings to know whether they're separators or not. In other words, it would be throwing useful information away. And the regularity is quite convenient. For example, the somewhat common exercise "reverse the vowels" can be nicely done with this:

>>> t[1::2] = reversed(t[1::2]) >>> print(t) ['r', 'e', 'b', 'e', 'rtsp', 'i', '', 'e', 'rr', 'o', ''] >>> ''.join(t) 'rebertspierro'
April 20, 2026 Score: 0 Rep: 20,366 Quality: Medium Completeness: 60%

In addition to the other good explanatory Answers, you may find it's easier to design your regex to take matches of runs of not-delimiters rather than splitting

>>> re.findall(r"[^!\s]+", "Hello! How are you?") ['Hello', 'How', 'are', 'you?']

Or if you want a collection of both, expressly giving a group to both words and delimiters

>>> d = r"!\s" >>> re.findall(fr"([^{d}]+|[{d}]+)", "Hello! How are you?") ['Hello', '! ', 'How', ' ', 'are', ' ', 'you?']

>>> [m.groups() for m in re.finditer(fr"([^{d}]+)([{d}])", "Hello! How are you?")] [('Hello', '! '), ('How', ' '), ('are', ' '), ('you?', '')] >>> list(zip(_)) # unzip groups [('Hello', 'How', 'are', 'you?'), ('! ', ' ', ' ', '')]