Let's Refresh: July 2018

We wrote a map reduce program in Python using Hadoop Streaming in this post. I recently got a request from a friend on how files are read in Python. So, this post is dedicated to that request. For all the work in this post, we will be using CLI of Python 3.7.0(the latest version of Python as of the time this blog is being written)

The version is shown below:

F:\PythonPrograms\files>python --version
Python 3.7.0

Once we invoke python we can start entering commands as show below:

F:\PythonPrograms\files>python
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>

We will be using a file called sample.txt at F:\PythonPrograms\files\ having below content:

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading this file comprises three operations: opening the file, reading the file, and closing the file. We can open the file using below command:

f = open('sample_file.txt')

To read the file, we use below command:

f.read()

Finally, we have to close the file object that we have used to release any resources used by the system:

f.close()

The outputs are shown below:

>>> f = open('sample_file.txt')
>>> f.read()
'Python is a programming language that lets you work quickly and integrate systems more effectively.'
>>> f.close()
>>>

open function takes another argument called mode. The default value for this argument is r that stands for read only. The mode can either be a single or two or three characters. The third character is always a +. The details about the characters are given below:

1) r : read. File is opened in read only mode. If no file exists, FileNotFoundError is thrown. If file exists, the file pointer points to beginning of file
2) w : write. File is opened for writing. If no file exists, a new file is created for writing, else, if file exists, then, existing contents are overwritten
3) a : append. File is opened for appending. If no file exists, a new file is created for appending, else, if file exists, then, the file pointer points to end of file
4) + : This is not an independent mode. + is used in combination with any one of three options: r, w or a. If the file is opened in read only mode, then, adding + will add the write option as well. If the file is opened in write/append only mode, then, adding + will add the read option as well.
5) b : binary format. Like +, this is not an independent mode. Used in combination with any one of r, w or a

Apart from r, w, and a, we can have combinations containing one of r, w, or a with b and/or + as r+, rb, rb+, w+, wb, wb+, a+, ab, ab+. The meanings of these combinations are explained from points 1 to 5 above

The file object returned by open function has below properties:

f.name returns file name
f.mode returns mode
f.closed returns boolean indicating whether file is open or closed
f.isatty() returns boolean True if file if stream is interactive
f.fileno() returns an descriptor defined in the environment as in integer corresponding to the file
f.readable() returns boolean True is stream is capable of being read
f.seekable() returns boolean True is stream allows random access
f.writable() returns boolean True is stream is capable of being written to

The usage is shown below:

>>> f = open('sample_file.txt')
>>> f.name
'sample_file.txt'
>>> f.mode
'r'
>>> f.closed
False
>>> f.isatty()
False
>>> f.fileno()
3
>>> f.readable()
True
>>> f.seekable()
True
>>> f.writable()
False
>>> f.close()
>>>

We can also use absolute file path as shown below:

>>> with open('F:\\PythonPrograms\\files\\sample_file.txt', 'r') as f:
... f.read()
...
'Python is a programming language that lets you work quickly and integrate systems more effectively.'
>>> f.closed
True
>>>

We have used with to open a file. If we use with, then, we do not need to explicitly close a file using close(). We can also use for loop. But, note that we have to explicitly close the file as shown below:

>>> f = open('sample_file.txt')
>>> for line in f:
... print(line)
...
Python is a programming language that lets you work quickly and integrate systems more effectively.
>>> f.closed
False
>>> f.close()
>>>

Let us use another file called blue_carbuncle.txt to read to show another usage of for loop. The contents of blue_carbuncle.txt are shown below:

1. THE ADVENTURE OF THE BLUE CARBUNCLE
2.
3. I had called upon my friend Sherlock Holmes upon the second
4. morning after Christmas, with the intention of wishing him the
5. compliments of the season.

Running the for loop returns below result:

>>> f = open('blue_carbuncle.txt')
>>> for line in f:
... print(line)
...
1. THE ADVENTURE OF THE BLUE CARBUNCLE

2.

3. I had called upon my friend Sherlock Holmes upon the second

4. morning after Christmas, with the intention of wishing him the

5. compliments of the season.
>>> f.close()
>>>

Using end ='' in print argument returns output after removing all newline characters:

>>> f = open('blue_carbuncle.txt')
>>> for line in f:
... print(line, end='')
...
1. THE ADVENTURE OF THE BLUE CARBUNCLE
2.
3. I had called upon my friend Sherlock Holmes upon the second
4. morning after Christmas, with the intention of wishing him the
5. compliments of the season.>>>
>>> f.close()
>>>

To read all the contents of the file into a single element in a list, we can use read‌lines():

>>> f = open('sample_file.txt')
>>> f.readlines()
['Python is a programming language that lets you work quickly and integrate systems more effectively.']
>>> f.close()
>>>

The same output is seen when using below code:

>>> f = open('sample_file.txt')
>>> file_in_list = list(f)
>>> f.close()
>>> print(file_in_list)
['Python is a programming language that lets you work quickly and integrate systems more effectively.']
>>>

To read a file line by line, we can use readline(). Below code uses readline() just once. So, only the first line of file is returned:

>>> with open('blue_carbuncle.txt', 'r') as f:
... f.readline()
...
'1. THE ADVENTURE OF THE BLUE CARBUNCLE\n'
>>>

If we need to read more lines, we can call as many times as we need:

>>> with open('blue_carbuncle.txt', 'r') as f:
... f.readline()
... f.readline()
... f.readline()
...
'1. THE ADVENTURE OF THE BLUE CARBUNCLE\n'
'2.\n'
'3. I had called upon my friend Sherlock Holmes upon the second\n'
>>>

We can specify the number of bytes to be read as an argument to readline():

>>> with open('blue_carbuncle.txt', 'r') as f:
... f.readline()
... f.readline()
... f.readline(15)
...
'1. THE ADVENTURE OF THE BLUE CARBUNCLE\n'
'2.\n'
'3. I had called'
>>>

We can also set the number of bytes to be read using read() by setting optional size argument as shown below:

>>> with open('blue_carbuncle.txt', 'rb') as f:
... f.read(16)
...
b'1. THE ADVENTURE'
>>>

The read mode above is binary. In normal mode also, in this case, the output is same:

>>> with open('blue_carbuncle.txt', 'r') as f:
... f.read(16)
...
'1. THE ADVENTURE'
>>>

tell() returns an integer corresponding to the position of the file pointer till where the file has been read in bytes in binary mode

>>> with open('blue_carbuncle.txt', 'rb') as f:
... f.read(16)
... f.tell()
...
b'1. THE ADVENTURE'
16
>>>

If we wish to read from a particular position in the content of file, we can use f.seek(offset, whence) such that any content before offset is ignored. The offset is set from whence parameter.

If whence is set to 0, then, the offset is set from the beginning of the file
If whence is set to 1, then, the offset is set the position of the file pointer in the file
If whence is set to 2, then, the offset is set from the end of the file

Let us see a few examples now:

>>> with open('blue_carbuncle.txt', 'rb') as f:

... f.read(16)

... f.tell()

... f.read(16)

... f.tell()

...

b'1. THE ADVENTURE'

b' OF THE BLUE CAR'

>>>

Let us now add seek(0) and reset the file pointer to the start of the file:

>>> with open('blue_carbuncle.txt', 'rb') as f:

... f.read(16)

... f.tell()

... f.seek(0)

... f.tell()

... f.read(16)

... f.tell()

...

b'1. THE ADVENTURE'

>>>

seek(0) returns the position of the file pointer. This is also confirmed by tell(). In the next example, we set whence to 1, and set the offset to 1. But, if the read mode is not in binary format, we get an error:

>>> with open('blue_carbuncle.txt', 'r') as f:

... f.read(16)

... f.tell()

... f.seek(1,1)

... f.tell()

... f.read(16)

... f.tell()

...

'1. THE ADVENTURE'

Traceback (most recent call last):

File "<stdin>", line 4, in <module>

io.UnsupportedOperation: can't do nonzero cur-relative seeks

>>>

Let us not set the mode to 'rb':

>>> with open('blue_carbuncle.txt', 'rb') as f:

... f.read(16)

... f.tell()

... f.seek(1,1)

... f.tell()

... f.read(16)

... f.tell()

...

b'1. THE ADVENTURE'

b'OF THE BLUE CARB'

>>>

The offset of 1 skips the space. In the last example on seek, we set whence to 2:

>>> with open('blue_carbuncle.txt', 'rb') as f:

... f.seek(-7,2)

... f.tell()

... f.read(6)

... f.tell()

...

197

b'season'

203

>>>

Using truncate we can resize the file. Let us truncate blue_carbuncle.txt file to a size of 91 bytes in content as shown below:

>>> f = open('blue_carbuncle.txt','rb+')

>>> f.truncate(91)

>>> f.close()

>>>

After truncate, the content of blue_carbuncle.txt is shown below:

1. THE ADVENTURE OF THE BLUE CARBUNCLE

3. I had called upon my friend Sherlock Holmes

Let us restore blue_carbuncle.txt with origianl content. If no argument is specified, then, the file is resized till current position a shown below:

>>> f = open('blue_carbuncle.txt','rb+')

>>> f.read(91)

b'1. THE ADVENTURE OF THE BLUE CARBUNCLE\r\n2.\r\n3. I had called upon my friend Sherlock Holmes '

>>> f.truncate()

>>> f.close()

>>>

After truncate, the content of blue_carbuncle.txt is shown below:

1. THE ADVENTURE OF THE BLUE CARBUNCLE

3. I had called upon my friend Sherlock Holmes

This is in line with our expectation. Lastly, we explore write. Let us create a new file for writing as follows:

>>> f = open('sample_file_new.txt','wb')

>>> f.write(b'Python is a programming language that lets you work quickly and integrate systems more effectively.')

>>> f.close()

>>>

The contents of above file are shown below:

Python is a programming language that lets you work quickly and integrate systems more effectively.

Let us append one more sentence to the same file using mode 'a':

>>> f = open('sample_file_new.txt','ab')

>>> f.write(b'\r\nPython is a programming language that lets you work quickly and integrate systems more effectively.')

101

>>> f.close()

>>>

The contents of above file are shown below:

Python is a programming language that lets you work quickly and integrate systems more effectively.

Let us now rewrite this file with mode 'w':

>>> f = open('sample_file_new.txt','wb')

>>> f.write(b'Python is a programming language')

>>> f.close()

Now, the contents of sample_file_new.txt are:

Python is a programming language

This concludes our discussion on reading files in Python

In an earlier post on MongoDB, we used the Mongo Shell to build Aggregation pipeline. In this post, we will take a look at the new Aggregation Pipeline Builder feature in MongoDB Compass. Folks who have used Mongo Shell to build the Aggregation Pipeline will know that this task can be a tad tedious because of the syntax involved . In particular, when the pipeline is fairly lengthy, the developer may find it difficult to build and debug the pipeline. The Aggregation Pipeline Builder feature has been added in MongoDB Compass to help reduce the complexity in building the pipeline. For all the work in this post, we will be using the latest versions of MongoDB and MongoDB Compass: MongoDB 4.0 and MongoDB Compass 1.14.5

We will be using employees data that we have used in earlier posts. A few records of the data we will be working with are shown below:

Note that there is a new tab next to Documents called Aggregations. Click on that to navigate to the Aggregation Pipeline Builder that is shown below:

Let us quickly see the different sections and features of the Aggregation Pipeline Builder before we build a pipeline

The first section is the preview section. This sections shows a preview of 20 documents in the collection. Here it is showing the first 20 documents of the collection. You can see only 3 documents in the frame. To see the rest of the documents, one can use the scroll bar to scroll right. If the collection has less than 20 documents then all the documents are shown. The second section is the section where we can add stages. The stages can be either be added by selecting from the dropdown or typing in the stage name and then selecting the stage from the filtered selection shown in the dropdown based on what we typed. Once the stage is added and necessary details filled in, we can see a preview of the documents in the third section. If we wish to add a new stage, we can click on the ADD STAGE button below the second section. This will be bring up a new Stage Addition Section with a corresponding preview section on the right. Next, let us understand how we can enable certain features shown below:

If the COMMENT MODE is enabled, then, when a stage is added in the Stage Addition Section, comments with information about that stage are shown to help the developer complete that section. If SAMPLE MODE is enabled, then, input documents are limited to 100000 before certain stages as $group, $bucket, and $bucketAuto. If the AUTO PREVIEW is enabled, then, we can see a preview of resulting documents as a stage is added (third section in earlier schematic). We can now build a simple pipeline to understand the pipeline build process.

Let us add a match stage that will select only those employees who belong to DEPARTMENT_ID less than 40. Click on Select and enter m in the field in the Stage Addition Section. This brings up three options. Select $match from them:

Once this is selected, comments and syntax corresponding to $match are added below:

In the place of <<query>>, enter D. Two options are shown:

Select DEPARTMENT_ID and enter :{ $lt: 40 }

Note that as you enter $, you are prompted with options. As soon as this whole expression, :{ $lt: 40 }, is entered, you will see a preview of documents on the right that match this criteria in this stage:

Let us add another another stage that calculates the sum of salaries grouped by JOB_ID. Click on ADD STAGE button to add next stage:

Add $group as next stage:

For the four fields, enter below values:

<<expression>> = "$JOB_ID"
<<field1>> = SUM_SALARY
<<accumulator1>> = $sum
<<expression1>> = "$SALARY"

Note that except for <<field1>>, for all other placeholders, options are shown as you start entering values. After the last entry, you will see results on the right:

The results compare well with the first query on this post.

One can copy the pipeline to clipboard for any further analysis as shown below:

You can also save the pipeline for future reference by entering a name in the field that says "Enter a pipeline name..." and clicking on SAVE PIPELINE button as shown below:

Click on Toggle Saved Pipelines to see saved pipeline as shown below:

We can clone the existing pipeline as shown below:

We can now save this as another pipeline and continue our work while the original pipeline is preserved. After clicking on SAVE PIPELINE button, disable the $match stage by sliding the green enable button to left as shown below:

Note that the results of the next stage are altered. This is because now the documents are not filtered by the match stage. We can now again enable it by sliding the enable switch to the right. Knocking out a stage and enabling it back is as simple as that. If one wants to delete the stage from the pipeline, one can click on the delete button in that section. But, we cannot restore this stage back and we have to build it back to restore it.

Since the $match stage is not operational, we can drag an entire stage like the $group stage to be the first stage as shown below: