Text processing

This is a guide for common character and string tasks.

Data types

Scheme has had standard string and character datatypes since forever. They are fully distinct types: a character is neither a string nor an integer. String, bytevector, vector and list are also distinct types: none of them is a subtype of another.

Unicode and multi-byte character encodings

A Scheme character object represents one Unicode codepoint. char→integer and integer→char convert between integer codepoints and character objects. (map char→integer (string→list s)) shows all the codepoints in the string s.

Mutable vs immutable strings

Scheme strings are generally mutable. This means you can change individual characters in the string using string-set! at any time after the string has been created.

However, actual Scheme code that mutates strings is somewhat rare. It is generally best to avoid mutating them if you can manage without. In the medium-to-long term, Scheme may evolve in a direction where strings are immutable, or where mutable strings are second-class citizens and immutable strings are the default thing to use.

Libraries

Scheme has had several standard char- and string- procedures since forever (R2RS). Since R6RS they have been Unicode-aware.

SRFI 13 (String Libraries) is the most popular library for extra convenience. SRFI 130 (Cursor-based string library) is mostly a drop-in replacement, but additionally supports string cursors for walking a string character-by-character while keeping track of the current position.

R7RS: If you don’t need string cursors, you can use the following cond-expand. It will import whichever one of 130 and 13 is available in any given Scheme implementation. Almost all R7RS Schemes come with one or both libraries.

(cond-expand ((library (srfi 130)) (import (srfi 130))) ((library (srfi 13)) (import (srfi 13))))

Classify characters

SRFI 175 (ASCII character library) has ASCII-only versions of these.