skip to content
Alvin Lucillo

Deduplicate unioned documents

/ 2 min read

In the last two journals, we saw how $unionWith returns duplicate documents. Now, how do we deduplicate the result? We do that with $group and $replaceRoot pipeline stages. In the example below, $group stage groups documents with the same _id and stores the first document in doc, which is a user defined name. We take the first document because in this scenario, it doesn’t matter which pipeline a document comes from since duplicates are grouped by _id, meaning they pertain to the same document. Now, at this stage, we still have the grouped output, and the document is stored in doc. We want to replace the that grouped output with the actual document (the first document), so we then use $replaceRoot with the $doc, the doc field from the grouped output.

[
  { $match: { gender: { $regex: ".*ale" } } },
  {
    $unionWith: {
      coll: "persons",
      pipeline: [
        { $match: { gender: { $regex: ".*le" } } }
      ]
    }
  },
  { $sort: { _id: 1, src: 1 } },
  {
    $group: {
      _id: "$_id",
      doc: { $first: "$$ROOT" }
    }
  },
  { $replaceRoot: { newRoot: "$doc" } }
]

Sample result without $replaceRoot. This is the grouped output.

[
	{
		"_id": {
			"$oid": "696e23bbbea06b125b46d431"
		},
		"doc": {
			"_id": {
				"$oid": "696e23bbbea06b125b46d431"
			},
			"id": 725,
			"first_name": "Cass",
			"last_name": "McGirl",
			"email": "cmcgirlk4@abc.net.au",
			"gender": "Male",
			"ip_address": "11.225.103.0"
		}
	}
]

Sample result with $replaceRoot. This is when grouped result is replaced with the doc field from the grouped output.

[
	{
		"_id": {
			"$oid": "696e23bbbea06b125b46d199"
		},
		"id": 61,
		"first_name": "Nikos",
		"last_name": "Mutimer",
		"email": "nmutimer1o@hubpages.com",
		"gender": "Male",
		"ip_address": "174.74.94.22"
	}
]