Byte and codepoint length • Alvin Lucillo

We saw from the previous entry that some Unicode characters span multiple bytes, that is why we need to be careful when performing substring operation. Below illustrates the difference between bytes and codepoints. Hi☺! consists of 4 Unicode characters, with each character representing a certain codepoint. This is why len_cp value is 4. On the other hand, len_bytes value is 6. This is because ☺ character occupies 3 bytes, so along with 3 other characters that each occupy 1 byte, the total UTF-8 encoded bytes is 6. len_bytes_smiley further shows that the smiley’s number of bytes is 3, and its codepoint count is 1.

// test data
db.getCollection("unicode_demo").insertMany([
	{ label: "with_emoji", text: "Hi☺!" }, // emoji is multi-byte
]);

db.getCollection("unicode_demo").aggregate([
	{
		$project: {
			label: 1,
			text: 1,
			len_bytes: { $strLenBytes: "$text" },
			len_cp: { $strLenCP: "$text" },
			len_bytes_smiley: { $strLenBytes: "☺" },
			len_cp_smiley: { $strLenCP: "☺" },
		},
	},
]);

Output:

[
  {
    _id: ObjectId('69b4077b23ee6c1be512c300'),
    label: 'with_emoji',
    text: 'Hi☺!',
    len_bytes: 6,
    len_cp: 4,
    len_bytes_smiley: 3,
    len_cp_smiley: 1
  }
]