This is the third group of questions of the serial:
Part 1: BitString (or bits)
Part 2: Binary (or bytes)
Part 3: String and Charlist
Here are some questions for String and Charlist in Elixir.
Q: Is the String definition in Elixir the same as in Erlang?
A: No.
Q: What is String in Erlang?
A: A string in Erlang can be:
- A binary with UTF-8-encoded Unicode codepoints.
- A list of UTF-8-encoded Unicode codepoints. (exactly the Charlist
in Elixir)
- A mix of the two above.
Q: What is String in Elixir?
A: A binary with UTF-8-encoded Unicode codepoints.
Q: Are strings always binaries? (in Elixir, for this and later questions)
A: Yes.
Q: What is Unicode?
A: Unicode (https://www.unicode.org) is a set of specifications that list every character, also called “user-perceived character”, used by written languages and give each character its own unique codes, or “codepoints”. (A character may have multiple codepoints.)
The Unicode standard contains a lot of tables listing characters and their corresponding codepoints:
0061 ‘a’; LATIN SMALL LETTER A
0062 ‘b’; LATIN SMALL LETTER B
0063 ‘c’; LATIN SMALL LETTER C
…
007B ‘{‘; LEFT CURLY BRACKET
…
2167 ‘Ⅷ’; ROMAN NUMERAL EIGHT
2168 ‘Ⅸ’; ROMAN NUMERAL NINE
…
265E ‘♞’; BLACK CHESS KNIGHT
265F ‘♟’; BLACK CHESS PAWN
…
1F600 ‘😀’; GRINNING FACE
1F609 ‘😉’; WINKING FACE
…
Q: Is there any limit on the number of characters in Unicode?
A: No.
Q: What is UTF-8?
A: It’s one of the encoding methods of Unicode. Unicode just cares about mapping characters to codepoints without representing them in memory/disk, which is the job of character encodings. A character can be represented differently in memory or disk in different encoding methods.
character -> code points -> bytes in memory/disk
Q: Is "abcd"
a valid string?
A: Yes.
Q: Is <<"abcd">>
a valid string?
A: Yes.
Q: Is 'abcd'
a valid string?
A: No. It’s a Charlist
.
Q: Is ~s'abcd'
a valid string?
A: Yes.
Q: Is <<237,160,128>>
a valid string?
A: No. Because if decoded as UTF-8 specifies, its corresponding codepoint is U+D800
, which is not a valid UTF-8 codepoint according to the specification.
Q: Given a raw string in memory, 6c f0 9f 8d ad 70
, how to decode it to user-perceived characters?
A: The simplest way is to manually type `<<0x6c, 0xf0, 0x9f, 0x8d, 0xad, 0x70>>` in IEx, but let’s make more fun from decoding it in UTF-8 way.
Step 1, turn it into binaries:
Step 2, search for leading 1
s.
The first byte 01101100
has no leading 1
, which means it’s an ASCII character. Its codepoint is simply the integer value of it, which is 001101100
in binary, or 6c
in hexadecimal.
The second byte 11110000
has four 1
at the beginning. It means four bytes, including current byte and three bytes following, will be used to represent a codepoint. Ok, we’re going to decode these four bytes soon. Let’s skip them right now and go to the sixth byte.
The sixth byte 0111000
is also an ASCII character due to its leading 0
. So its codepoint is 0111000
in binary, or 70
in hexadecimal.
Step 3, let’s decode the four bytes from 2nd to 5th.
The Unicode codepoint is 0x1F36D
.
In summary, the codepoints are [U+006C, U+1F36D, U+0070]
.
Therefore, we know the characters are "\u006C\u{1F36D}\u0070"
. As user-perceived characters, they are l
, 🍭, and p
.
Sweet!
Q: How can I get codepoints of a string as a list of codepoints?
A: You can get them from a Charlist
.
For example:
Also, you can use String/to_charlist/1
to convert a string into a list of codepoints.
Summary
We’ve answered some question about BitString
, Binary
, String
and Charlist
in Elixir. Naming is hard in programming. If you can’t get their meaning by the first glance, here are their more-friendly names for you to understand them better:
- BitString ➯ Bits
- Binary ➯ Bytes
- String ➯ String
- Charlist ➯ CodepointList
Elixir provides brilliant supports to String and raw binary. I hope you have fun programming with binaries!