Saturday, 23 June 2018

MongoDB - I

We introduced MongoDB in the last two posts here and here. In this post, we explore some use cases of Aggregation Framework with reference to arrays. For all the work in this post, we will be using the same version as in the last two posts: MongoDB version v3.6.5

Once we are logged into the Mongo Shell, run the below command to use merchandise database:

> use merchandise
switched to db merchandise

Let us add a collection called products to merchandise database and insert a few records as shown below:

db.products.insert(
   [
    { "_id" : 100, "item" : "Pullover", "sizes": [ "S", "M", "L"] },
    { "_id" : 200, "item" : "T-shirt", "sizes" : ["X", "XL", "XXL"] },
    { "_id" : 300, "item" : "Bermuda Shorts", "sizes": ["M"] },
    { "_id" : 400, "item" : "Hat", "sizes": "M" },
    { "_id" : 500, "item" : "Wrist band" },
    { "_id" : 600, "item" : "Sweat band", "sizes" : null },
    { "_id" : 700, "item" : "Cap", "sizes" : [ ] },
   ],
   { ordered: false }
)

The output is shown below:

> db.products.insert(
...    [
...     { "_id" : 100, "item" : "Pullover", "sizes": [ "S", "M", "L"] },
...     { "_id" : 200, "item" : "T-shirt", "sizes" : ["X", "XL", "XXL"] },
...     { "_id" : 300, "item" : "Bermuda Shorts", "sizes": ["M"] },
...     { "_id" : 400, "item" : "Hat", "sizes": "M" },
...     { "_id" : 500, "item" : "Wrist band" },
...     { "_id" : 600, "item" : "Sweat band", "sizes" : null },
...     { "_id" : 700, "item" : "Cap", "sizes" : [ ] },
...    ],
...    { ordered: false }
... )
BulkWriteResult({
        "writeErrors" : [ ],
        "writeConcernErrors" : [ ],
        "nInserted" : 7,
        "nUpserted" : 0,
        "nMatched" : 0,
        "nModified" : 0,
        "nRemoved" : 0,
        "upserted" : [ ]
})

We can validate the insertion by running below command:

> db.products.find()
{ "_id" : 100, "item" : "Pullover", "sizes" : [ "S", "M", "L" ] }
{ "_id" : 200, "item" : "T-shirt", "sizes" : [ "X", "XL", "XXL" ] }
{ "_id" : 300, "item" : "Bermuda Shorts", "sizes" : [ "M" ] }
{ "_id" : 400, "item" : "Hat", "sizes" : "M" }
{ "_id" : 500, "item" : "Wrist band" }
{ "_id" : 600, "item" : "Sweat band", "sizes" : null }
{ "_id" : 700, "item" : "Cap", "sizes" : [ ] }

Using $unwind, we can split the array into separate records as shown below:

> db.products.aggregate( [ { $unwind : "$sizes" } ] )
{ "_id" : 100, "item" : "Pullover", "sizes" : "S" }
{ "_id" : 100, "item" : "Pullover", "sizes" : "M" }
{ "_id" : 100, "item" : "Pullover", "sizes" : "L" }
{ "_id" : 200, "item" : "T-shirt", "sizes" : "X" }
{ "_id" : 200, "item" : "T-shirt", "sizes" : "XL" }
{ "_id" : 200, "item" : "T-shirt", "sizes" : "XXL" }
{ "_id" : 300, "item" : "Bermuda Shorts", "sizes" : "M" }
{ "_id" : 400, "item" : "Hat", "sizes" : "M" }

Note that only those records are returned that have a valid size. Now let us group using $group as shown below:

> db.products.aggregate(
...    [
...       { $unwind : "$sizes" },
...       {
...         $group : {_id :  "$_id",
...            sizes: { $push: "$sizes" }
...         }
...       }
...    ]
... )
{ "_id" : 400, "sizes" : [ "M" ] }
{ "_id" : 300, "sizes" : [ "M" ] }
{ "_id" : 200, "sizes" : [ "X", "XL", "XXL" ] }
{ "_id" : 100, "sizes" : [ "S", "M", "L" ] }

The records are returned with sizes grouped by id. But, note that the item field is missing. Let us try to add the item field as show below:

> db.products.aggregate(
...    [
...       { $unwind : "$sizes" },
...       {
...         $group : {_id :  "$_id", item: "$item",
...            sizes: { $push: "$sizes" }
...         }
...       }
...    ]
... )
assert: command failed: {
        "ok" : 0,
        "errmsg" : "The field 'item' must be an accumulator object",
        "code" : 40234,
        "codeName" : "Location40234"
} : aggregate failed
_getErrorWithCode@src/mongo/shell/utils.js:25:13
doassert@src/mongo/shell/assert.js:16:14
assert.commandWorked@src/mongo/shell/assert.js:403:5
DB.prototype._runAggregate@src/mongo/shell/db.js:260:9
DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1212:12
@(shell):1:1

2018-06-23T18:11:39.668+0530 E QUERY    [thread1] Error: command failed: {
        "ok" : 0,
        "errmsg" : "The field 'item' must be an accumulator object",
        "code" : 40234,
        "codeName" : "Location40234"
} : aggregate failed :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
doassert@src/mongo/shell/assert.js:16:14
assert.commandWorked@src/mongo/shell/assert.js:403:5
DB.prototype._runAggregate@src/mongo/shell/db.js:260:9
DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1212:12
@(shell):1:1

When we try to add item, we get an error. Unlike SQL where can do a GROUP BY on multiple fields, we can can group only on a single field in MongoDB. One work around is to project the item field as shown below:

db.products.aggregate(
   [
      { $unwind : "$sizes" },
      {
        $group : {_id :  "$_id",
           sizes: { $push: "$sizes" }
        }
      },
      {$project: {"_id": 1, "item":1,"sizes":1}}
   ]
)

The results of the output of above query is shown below:

 db.products.aggregate(
...    [
...       { $unwind : "$sizes" },
...       {
...         $group : {_id :  "$_id",
...            sizes: { $push: "$sizes" }
...         }
...       },
...       {$project: {"_id": 1, "item":1,"sizes":1}}
...    ]
... )
{ "_id" : 400, "sizes" : [ "M" ] }
{ "_id" : 300, "sizes" : [ "M" ] }
{ "_id" : 200, "sizes" : [ "X", "XL", "XXL" ] }
{ "_id" : 100, "sizes" : [ "S", "M", "L" ] }

Note that the results are the same as we got earlier with just the group and without the project clause. The item field is lost in translation. The work around is to subsume all fields to one field and then do a group by and then finally project the subsumed fields to separate fields to retain the original schema as shown below:

> db.products.aggregate(
...    [
...       { $unwind : "$sizes" },
...       {
...         $group : {_id :  { _id: "$_id", item: "$item" },
...            sizes: { $push: "$sizes" }
...         }
...       },
...       {$project: {"_id": "$_id._id", "item":"$_id.item","sizes":1}}
...    ]
... )
{ "sizes" : [ "M" ], "_id" : 400, "item" : "Hat" }
{ "sizes" : [ "M" ], "_id" : 300, "item" : "Bermuda Shorts" }
{ "sizes" : [ "X", "XL", "XXL" ], "_id" : 200, "item" : "T-shirt" }
{ "sizes" : [ "S", "M", "L" ], "_id" : 100, "item" : "Pullover" }

This is very similar to doing a GROUP BY in relational databases on multiple fields. Note that we have added the item field

This kind of unwind and later grouping to retain the original schema can be useful in a pipeline involving multiple stages

We can do a count on the size of arrays as shown below:

> db.products.aggregate(
...    [
...       { $unwind : "$sizes" },
...       {
...         $group : {_id :  { _id: "$_id", item: "$item" },
...            sizes: { $push: "$sizes" }
...         }
...       },
...       {$project: {"_id": "$_id._id", "item":"$_id.item","sizes":"$sizes",CountSizes: { $size: "$sizes" }}}
...    ]
... )
{ "_id" : 400, "item" : "Hat", "sizes" : [ "M" ], "CountSizes" : 1 }
{ "_id" : 300, "item" : "Bermuda Shorts", "sizes" : [ "M" ], "CountSizes" : 1 }
{ "_id" : 200, "item" : "T-shirt", "sizes" : [ "X", "XL", "XXL" ], "CountSizes" : 3 }
{ "_id" : 100, "item" : "Pullover", "sizes" : [ "S", "M", "L" ], "CountSizes" : 3 }

Adding a match can limit the results as shown below:

> db.products.aggregate(
...    [
...       { $unwind : "$sizes" },
...       {
...         $group : {_id :  { _id: "$_id", item: "$item" },
...            sizes: { $push: "$sizes" }
...         }
...       },
...       {$project: {"_id": "$_id._id", "item":"$_id.item","sizes":"$sizes",CountSizes: { $size: "$sizes" }}},
...       {$match: { CountSizes: { $gte: 2 } }}
...    ]
... )
{ "_id" : 200, "item" : "T-shirt", "sizes" : [ "X", "XL", "XXL" ], "CountSizes" : 3 }
{ "_id" : 100, "item" : "Pullover", "sizes" : [ "S", "M", "L" ], "CountSizes" : 3 }