Questions for BitString, Binary, Charlist, and String in Elixir — Part 3: String and Charlist

qhwa
3 min readMay 10, 2020

This is the third group of questions of the serial:

Part 1: BitString (or bits)
Part 2: Binary (or bytes)
Part 3: String and Charlist

Here are some questions for String and Charlist in Elixir.

Q: Is the String definition in Elixir the same as in Erlang?
A: No.

Q: What is String in Erlang?
A: A string in Erlang can be:
- A binary with UTF-8-encoded Unicode codepoints.
- A list of UTF-8-encoded Unicode codepoints. (exactly the Charlist in Elixir)
- A mix of the two above.

Q: What is String in Elixir?
A: A binary with UTF-8-encoded Unicode codepoints.

Q: Are strings always binaries? (in Elixir, for this and later questions)
A: Yes.

Q: What is Unicode?
A: Unicode (https://www.unicode.org) is a set of specifications that list every character, also called “user-perceived character”, used by written languages and give each character its own unique codes, or “codepoints”. (A character may have multiple codepoints.)

The Unicode standard contains a lot of tables listing characters and their corresponding codepoints:
0061 ‘a’; LATIN SMALL LETTER A
0062 ‘b’; LATIN SMALL LETTER B
0063 ‘c’; LATIN SMALL LETTER C

007B ‘{‘; LEFT CURLY BRACKET

2167 ‘Ⅷ’; ROMAN NUMERAL EIGHT
2168 ‘Ⅸ’; ROMAN NUMERAL NINE

265E ‘♞’; BLACK CHESS KNIGHT
265F ‘♟’; BLACK CHESS PAWN

1F600 ‘😀’; GRINNING FACE
1F609 ‘😉’; WINKING FACE

Q: Is there any limit on the number of characters in Unicode?
A: No.

Q: What is UTF-8?
A: It’s one of the encoding methods of Unicode. Unicode just cares about mapping characters to codepoints without representing them in memory/disk, which is the job of character encodings. A character can be represented differently in memory or disk in different encoding methods.

character -> code points -> bytes in memory/disk

Q: Is "abcd" a valid string?
A: Yes.

Q: Is <<"abcd">> a valid string?
A: Yes.

Q: Is 'abcd' a valid string?
A: No. It’s a Charlist.

Q: Is ~s'abcd' a valid string?
A: Yes.

Q: Is <<237,160,128>> a valid string?
A: No. Because if decoded as UTF-8 specifies, its corresponding codepoint is U+D800 , which is not a valid UTF-8 codepoint according to the specification.

Q: Given a raw string in memory, 6c f0 9f 8d ad 70, how to decode it to user-perceived characters?
A: The simplest way is to manually type `<<0x6c, 0xf0, 0x9f, 0x8d, 0xad, 0x70>>` in IEx, but let’s make more fun from decoding it in UTF-8 way.

Step 1, turn it into binaries:

Step 2, search for leading 1s.

The first byte 01101100 has no leading 1, which means it’s an ASCII character. Its codepoint is simply the integer value of it, which is 001101100 in binary, or 6c in hexadecimal.

The second byte 11110000 has four 1 at the beginning. It means four bytes, including current byte and three bytes following, will be used to represent a codepoint. Ok, we’re going to decode these four bytes soon. Let’s skip them right now and go to the sixth byte.

The sixth byte 0111000 is also an ASCII character due to its leading 0. So its codepoint is 0111000 in binary, or 70 in hexadecimal.

Step 3, let’s decode the four bytes from 2nd to 5th.

The Unicode codepoint is 0x1F36D.

In summary, the codepoints are [U+006C, U+1F36D, U+0070].

Therefore, we know the characters are "\u006C\u{1F36D}\u0070". As user-perceived characters, they are l, 🍭, and p.

Sweet!

Q: How can I get codepoints of a string as a list of codepoints?
A: You can get them from a Charlist.

For example:

Also, you can use String/to_charlist/1 to convert a string into a list of codepoints.

Summary

We’ve answered some question about BitString, Binary, String and Charlist in Elixir. Naming is hard in programming. If you can’t get their meaning by the first glance, here are their more-friendly names for you to understand them better:

  • BitString ➯ Bits
  • Binary ➯ Bytes
  • String ➯ String
  • Charlist ➯ CodepointList

Elixir provides brilliant supports to String and raw binary. I hope you have fun programming with binaries!

--

--