A string is a common data type only somewhat less ubiquitious than an integer type. Most computer languages have at least some degree of built-in support for a type that is called or is equivelent to a string.

There are variations as to what is considered to constitute a string, which will be discussed below. However, all string types have this much in common: they are a list of ordered characters of some known length. A string is the most common way to store textual information of any kind; this includes a textual representation of numbers. However, the content of a string does not need to be textual or even human-readable.

Most languages allow in-situ representations of strings, usually by enclosing the appropriate text in either single or double quotes.

Beyond creation and destruction, here are some of the most common operations that are performed on data of type string:

One very important variation in the implementation of strings is how the length of the string is determined. Ideally, the length of a string should be able to be determined quickly and expanded as necessary.

  • One school of thought on this subject of length determination is null-terminated strings: that is, the last actual character of the string is followed by a character having the value zero. This style of strings is also known as ASCIIZ (or ASCIZ) strings, C-style strings, or just C strings. Traditionally, C strings are zero-indexed; meaning, that when referring to a particular character within the string, the first character is numbered 0, the second 1, and so forth.

    Advantages of this implementation is that there is no explicit limitation on the length of the string; it is fairly easy to implement; and this style of string is almost universally understood. Disadvantages include: calculation of the length of the string is relatively slow; the content of the strings is restricted in that they connot contain a null character; and bugs regarding buffer overrun are difficult to avoid and debug.

  • The primary alternative to null-terminated strings are indexed strings, more commonly known as Pascal strings (named not for the mathemeticican and philosopher, but the language that popularized them). These strings are prefixed by some encoding of the total length of the string; thus, the length can be determined without actually counting each character. Pascal strings may or may not be zero-indexed. The length count makes resizing and manipulation easier and safer, but unfortantely many languages select a naive implementation and allocate a fixed amount of space for each string, cutting back on dynamic reallocating at the cost of unnecessary consumption. The other big problem with Pascal strings is that their size is limited to the upper limit on the type storing their length. Turbo Pascal uses a single length byte: therefore, all strings in Turbo Pascal are limited to 255 characters. Java, on the other hand, uses a 32-bit integer for storing strings in memory, making string size effectively limited only by available memory; unfortunately, this advantages is negated by a ridiculous serialization scheme that causes strings greater than 64K characters to cause exceptions when written to any stream.

Another varation within the sphere of strings is the definition of a character. Platforms using the ASCII character set usually employ characters of one byte. Increasingly, modern languages have support for wide characters: that is, characters of two bytes. C++, for example, has a wchar_t type. The character set used within a wide-character type is dependant on the underlying platform, but a common one is Unicode. Many languages allow manipulation of both single- and double-byte character strings, in which case one must take care to convert between them when necessary. The Win32 API, for example, defines types LPSTR and LPWSTR as ASCII and Unicode strings respectively.


C strings are usually described in that languages a char *: that is, a pointer to a character. This fact may be obscured by typedefs, but this is typically the underlying implementation. C provides a library of string-manipulation functions in the header string.h; some of these functions are listed along with the string operations above. Most of these functions are distinguished by the prefix "str".

C++ has a template class called string that implements strings as a managed array, although one could easily instantiate a class of string based around any kind of character type, due to the magic of the STL and templates.

Pascal supports a built-in type called STRING, which, depending on the implementation, may or may not be a keyword. Also, the STRING type may or may not accept a maximum length as an argument, using square brackets or parentheses, again, depending on implementation. There are several standard Pascal functions for manipulating strings, although, unlike with C, determining which standard functions deal with strings is not always easy; for example, the Copy function will extract a substring, while the Move function has nothing to do with strings whatsoever.

Java has a class java.lang.String that receives special syntactic sugar from the compiler, mostly in the form of the operator + for concatenation. Double-quoted text in the language automatically instantiates an instance of this class.


AT has corrected facts in this node.