skip to content
Alvin Lucillo

Byte and codepoint index

/ 1 min read

Similar to how we perform substring operation, we must be careful with getting an index of a character. In the example below, the with_emoji text’s smiley is the 3rd codepoint with the use of $indexOfCP (the value 2 is based on zero-based index) and is the 3rd UTF-8 byte position (zero-based too) with the use of $indexOfBytes.

However, multi_byte_chars’s smileyIndexCP and smileyIndexBytes are different. This is because é occupies 2 bytes, therefore:

H = 1
é = 2
l = 1
l = 1
o = 1
total = 6

This is why the smiley is located at the 6th UTF-8 byte position.

// test data
db.getCollection("unicode_demo").insertMany([
	{ label: "with_emoji", text: "Hi☺!" }, // emoji is multi-byte
	{ label: "multi_byte_chars", text: "Héllo☺!" }, // emoji is multi-byte
]);

db.getCollection("unicode_demo").aggregate([
	{
		$project: {
			label: 1,
			text: 1,
			smileyIndexCP: { $indexOfCP: ["$text", ""] },
			smileyIndexBytes: { $indexOfBytes: ["$text", ""] },
		},
	},
]);

Output:

[
  {
    _id: ObjectId('69b40f4dee717862bd955950'),
    label: 'with_emoji',
    text: 'Hi☺!',
    smileyIndexCP: 2,
    smileyIndexBytes: 2
  },
  {
    _id: ObjectId('69b40f4dee717862bd955951'),
    label: 'multi_byte_chars',
    text: 'Héllo☺!',
    smileyIndexCP: 5,
    smileyIndexBytes: 6
  }
]