Safe extraction of chars • Alvin Lucillo

In the previous entries, we learned how to get the index of a character and string length based on the characters’ codepoint representation, not bytes.

Below is the first part of trying to reconstruct a string into another value that doesn’t have a specified multi-byte character.

There are two parts when we take out a specified character: the value before and after it. To perform a substring, we need the initial zero-based code point index and the number of code points.

For example:

Hi☺! is 4 codepoints length, and we want to extract Hi and !, taking out ☺.
For the first part of the text, Hi is 2 codepoints length; since the first part is at the left side of the emoji, we start at 0 codepoint index and then get 2 codepoints. This means the resulting expression should evaluate to $substrCP: ["$text", 0, 2]
To compute for the codepoint length, the 3rd arguument in the substring expression, we use $$smileyIndex, the zero-based codepoint index of the emoji. This works because the substring starts at 0, so the number of codepoints to take is equal to the emoji index itself. This works with the 2nd example Héllo☺!, where the emoji is at zero-based index 5. Its substring starts with 0 and only takes 5 codepoints from there, resulting to Héllo
Going back to the first example, to get the last part of the value, we need to start the index after the index of the emoji and counting the remaining codepoints. The emoji index is 2, and the number of codepoints to take in after that is 1 (!).
Th starting index is easy; we just add 1 to the index. This means we start at 3 index. As for the last argument, we subtract the codepoint position after the smiley (its index + 1) from the total number of codepoints in the text. This results to 4-3=1. So we get one codepoint count.

To do this, we need to declare variables so we can use them in the expressions/calculations. This is why we we have $let, where vars contains the temporary variables and in where the variables are used. The projected cp_values will contain the expression result in in.

// test data
db.getCollection("unicode_demo").insertMany([
	{ label: "with_emoji", text: "Hi☺!" },
	{ label: "multi_byte_chars", text: "Héllo☺!" },
]);

db.getCollection("unicode_demo").aggregate([
	{
		$project: {
			label: 1,
			text: 1,
			cp_values: {
				$let: {
					vars: {
						smileyIndex: { $indexOfCP: ["$text", "☺"] },
						textLen: { $strLenCP: "$text" },
					},
					in: {
						smileyIndex: "$$smileyIndex",
						textLen: "$$textLen",
						firstPart: {
							$substrCP: ["$text", 0, "$$smileyIndex"],
						},
						lastPart: {
							$substrCP: [
								"$text",
								{ $add: ["$$smileyIndex", 1] },
								{ $subtract: ["$$textLen", { $add: ["$$smileyIndex", 1] }] },
							],
						},
					},
				},
			},
		},
	},
]);

Output:

[
  {
    _id: ObjectId('69b6987355ed455452544ca7'),
    label: 'with_emoji',
    text: 'Hi☺!',
    cp_values: { smileyIndex: 2, textLen: 4, firstPart: 'Hi', lastPart: '!' }
  },
  {
    _id: ObjectId('69b6987355ed455452544ca8'),
    label: 'multi_byte_chars',
    text: 'Héllo☺!',
    cp_values: { smileyIndex: 5, textLen: 7, firstPart: 'Héllo', lastPart: '!' }
  }
]