skip to content
Alvin Lucillo

Invalid range in getting substring

/ 1 min read

What if I want to get the emoji from the string value? In the example below, I tried to extract the emoji starting from 2 index with 1 byte size. This falls under the assumption that any character is just one byte. It turns out there are what we call multiple-byte characters like the emoji. The output shows that the range is invalid. This is because the emoji is actually 3 bytes, spanning 3 indices. If the substring operation lands in the middle of a multi-byte character, like what happened in the example below, mongodb will return an error.

// test data
db.getCollection("unicode_demo").insertMany([
	{ label: "with_emoji", text: "Hi☺!" }, // emoji is multi-byte
]);

db.getCollection("unicode_demo").aggregate([
	{
		$project: {
			label: 1,
			text: 1,
			substr_bytes: { $substrBytes: ["$text", 2, 1] },
		},
	},
]);

Output:

PlanExecutor error during aggregation :: caused by :: $substrBytes:  Invalid range, ending index is in the middle of a UTF-8 character.

Now, I got the emoji extracted by starting with index 2 and getting 3 bytes.

db.getCollection("unicode_demo").aggregate([
	{
		$project: {
			label: 1,
			text: 1,
			substr_bytes: { $substrBytes: ["$text", 2, 3] },
		},
	},
]);
[
  {
    _id: ObjectId('69b2a5af3fe6a77c90fb269d'),
    label: 'with_emoji',
    text: 'Hi☺!',
    substr_bytes: '☺'
  }
]