Similar to how we perform substring operation, we must be careful with getting an index of a character. In the example below, the with_emoji text’s smiley is the 3rd codepoint with the use of $indexOfCP (the value 2 is based on zero-based index) and is the 3rd UTF-8 byte position (zero-based too) with the use of $indexOfBytes.
However, multi_byte_chars’s smileyIndexCP and smileyIndexBytes are different. This is because é occupies 2 bytes, therefore:
H = 1
é = 2
l = 1
l = 1
o = 1
total = 6
This is why the smiley is located at the 6th UTF-8 byte position.
// test data
db.getCollection("unicode_demo").insertMany([
{ label: "with_emoji", text: "Hi☺!" }, // emoji is multi-byte
{ label: "multi_byte_chars", text: "Héllo☺!" }, // emoji is multi-byte
]);
db.getCollection("unicode_demo").aggregate([
{
$project: {
label: 1,
text: 1,
smileyIndexCP: { $indexOfCP: ["$text", "☺"] },
smileyIndexBytes: { $indexOfBytes: ["$text", "☺"] },
},
},
]);
Output:
[
{
_id: ObjectId('69b40f4dee717862bd955950'),
label: 'with_emoji',
text: 'Hi☺!',
smileyIndexCP: 2,
smileyIndexBytes: 2
},
{
_id: ObjectId('69b40f4dee717862bd955951'),
label: 'multi_byte_chars',
text: 'Héllo☺!',
smileyIndexCP: 5,
smileyIndexBytes: 6
}
]