In the previous entries, we learned how to get the index of a character and string length based on the characters’ codepoint representation, not bytes.
Below is the first part of trying to reconstruct a string into another value that doesn’t have a specified multi-byte character.
There are two parts when we take out a specified character: the value before and after it. To perform a substring, we need the initial zero-based code point index and the number of code points.
For example:
Hi☺!is 4 codepoints length, and we want to extractHiand!, taking out☺.- For the first part of the text,
Hiis 2 codepoints length; since the first part is at the left side of the emoji, we start at0codepoint index and then get 2 codepoints. This means the resulting expression should evaluate to$substrCP: ["$text", 0, 2] - To compute for the codepoint length, the 3rd arguument in the substring expression, we use
$$smileyIndex, the zero-based codepoint index of the emoji. This works because the substring starts at0, so the number of codepoints to take is equal to the emoji index itself. This works with the 2nd exampleHéllo☺!, where the emoji is at zero-based index5. Its substring starts with0and only takes5codepoints from there, resulting toHéllo - Going back to the first example, to get the last part of the value, we need to start the index after the index of the emoji and counting the remaining codepoints. The emoji index is
2, and the number of codepoints to take in after that is 1 (!). - Th starting index is easy; we just add
1to the index. This means we start at3index. As for the last argument, we subtract the codepoint position after the smiley (its index + 1) from the total number of codepoints in the text. This results to4-3=1. So we get one codepoint count.
To do this, we need to declare variables so we can use them in the expressions/calculations. This is why we we have $let, where vars contains the temporary variables and in where the variables are used. The projected cp_values will contain the expression result in in.
// test data
db.getCollection("unicode_demo").insertMany([
{ label: "with_emoji", text: "Hi☺!" },
{ label: "multi_byte_chars", text: "Héllo☺!" },
]);
db.getCollection("unicode_demo").aggregate([
{
$project: {
label: 1,
text: 1,
cp_values: {
$let: {
vars: {
smileyIndex: { $indexOfCP: ["$text", "☺"] },
textLen: { $strLenCP: "$text" },
},
in: {
smileyIndex: "$$smileyIndex",
textLen: "$$textLen",
firstPart: {
$substrCP: ["$text", 0, "$$smileyIndex"],
},
lastPart: {
$substrCP: [
"$text",
{ $add: ["$$smileyIndex", 1] },
{ $subtract: ["$$textLen", { $add: ["$$smileyIndex", 1] }] },
],
},
},
},
},
},
},
]);
Output:
[
{
_id: ObjectId('69b6987355ed455452544ca7'),
label: 'with_emoji',
text: 'Hi☺!',
cp_values: { smileyIndex: 2, textLen: 4, firstPart: 'Hi', lastPart: '!' }
},
{
_id: ObjectId('69b6987355ed455452544ca8'),
label: 'multi_byte_chars',
text: 'Héllo☺!',
cp_values: { smileyIndex: 5, textLen: 7, firstPart: 'Héllo', lastPart: '!' }
}
]