What if I want to get the emoji ☺ from the string value? In the example below, I tried to extract the emoji starting from 2 index with 1 byte size. This falls under the assumption that any character is just one byte. It turns out there are what we call multiple-byte characters like the emoji. The output shows that the range is invalid. This is because the emoji is actually 3 bytes, spanning 3 indices. If the substring operation lands in the middle of a multi-byte character, like what happened in the example below, mongodb will return an error.
// test data
db.getCollection("unicode_demo").insertMany([
{ label: "with_emoji", text: "Hi☺!" }, // emoji is multi-byte
]);
db.getCollection("unicode_demo").aggregate([
{
$project: {
label: 1,
text: 1,
substr_bytes: { $substrBytes: ["$text", 2, 1] },
},
},
]);
Output:
PlanExecutor error during aggregation :: caused by :: $substrBytes: Invalid range, ending index is in the middle of a UTF-8 character.
Now, I got the emoji extracted by starting with index 2 and getting 3 bytes.
db.getCollection("unicode_demo").aggregate([
{
$project: {
label: 1,
text: 1,
substr_bytes: { $substrBytes: ["$text", 2, 3] },
},
},
]);
[
{
_id: ObjectId('69b2a5af3fe6a77c90fb269d'),
label: 'with_emoji',
text: 'Hi☺!',
substr_bytes: '☺'
}
]