Thursday, 2 August 2018

Python - II

In the last post, we looked at reading files in Python. Once the files are read, we may need to process the data in the files. Data in our context will be strings. So, it is only logical that we take a look at how strings are handled in Python. We will try to cover some basic string operations in this post. For all the work in this post, like before,  we will be using CLI of Python 3.7.0(the latest version of Python as of the time this blog is being written)

Below code defines two string variables: one in single quotes, and another in double quotes:

F:\PythonPrograms\files>python
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> string_1 = "Python is a programming language that lets you work quickly and integrate systems more effectively."
>>> string_2 = 'Python is a programming language that lets you work quickly and integrate systems more effectively.'
>>> type(string_1)
<class 'str'>
>>> type(string_2)
<class 'str'>

We can use type to check the type of the string variables. If the string has a single or a double quote, an error is thrown as shown below:

>>> string_3 = 'Whether you're new to programming or an experienced developer, it's easy to learn and use Python.'
  File "<stdin>", line 1
    string_3 = 'Whether you're new to programming or an experienced developer, it's easy to learn and use Python.'
                             ^
SyntaxError: invalid syntax
>>>

In this case, we have to escape it with \ as shown below:

>>> string_3 = 'Whether you\'re new to programming or an experienced developer, it\'s easy to learn and use Python.'
>>> string_3
"Whether you're new to programming or an experienced developer, it's easy to learn and use Python."
>>>

Alternately, we can enclose strings containing single quotes in double quotes as shown below:

>>> string_4 = "Whether you're new to programming or an experienced developer, it's easy to learn and use Python."
>>> string_4
"Whether you're new to programming or an experienced developer, it's easy to learn and use Python."
>>>

Therefore, we can have valid strings enclosed by double quotes having single quotes between them and also the other way round. We can also define a string variable enclosed in triple quotes as ''' or """. The content within these quotes can span multiple lines:

>>> string_5 = """Whether you're new to programming or an experienced developer,
...         it's easy to learn and use Python."""
>>> string_5
"Whether you're new to programming or an experienced developer, \n        it's easy to learn and use Python."
>>>

We will quickly see few string operations that we may need in our text processing activities:

>>> "Hello," +  " " + 'Python!'
'Hello, Python!'
>>> 'Hello, Python!' * 3
'Hello, Python!Hello, Python!Hello, Python!'
>>> len('Hello, Python!')
14
>>> 'Hello, Python!'[0]
'H'
>>> 'Hello, Python!'[1]
'e'
>>> 'Hello, Python!'[-1]
'!'
>>> 'Hello, Python!'[-2]
'n'
>>> 'Hello, Python!'[0:5]
'Hello'
>>> 'Hello, Python!'[0:-1]
'Hello, Python'

+ is used to concatenation of strings. * is used to output repetitions of strings. len() returns length of strings. Using [ ] to extract substring of string is slicing. [n] returns the character of string at index n starting with index zero for the first character of the string. [m,n] returns a substring of the string starting at index m and ending at index n. [-n] returns the character at position n but from the reverse direction. We can extend the slicing by adding an optional argument called step that indicates the number of arguments to be skipped in either direction

>>> 'Hello, Python!'[0:5:1]
'Hello'
>>> 'Hello, Python!'[0:5:2]
'Hlo'
>>> 'Hello, Python!'[0:5:3]
'Hl'
>>> 'Hello, Python!'[0:5:4]
'Ho'
>>>

We can obtain a sorted list of characters using sorted:

>>> sorted('Hello, Python!')
[' ', '!', ',', 'H', 'P', 'e', 'h', 'l', 'l', 'n', 'o', 'o', 't', 'y']
>>> sorted('Hello, Python!',reverse=False)
[' ', '!', ',', 'H', 'P', 'e', 'h', 'l', 'l', 'n', 'o', 'o', 't', 'y']
>>> sorted('Hello, Python!',reverse=True)
['y', 't', 'o', 'o', 'n', 'l', 'l', 'h', 'e', 'P', 'H', ',', '!', ' ']
>>>

set() returns a set of unique characters:

>>> set('Hello, Python!')
{'y', 'l', 'H', 'h', ',', 't', ' ', 'P', 'n', '!', 'o', 'e'}

list() returns a list of characters:

>>> list('Hello, Python!')
['H', 'e', 'l', 'l', 'o', ',', ' ', 'P', 'y', 't', 'h', 'o', 'n', '!']

tuple() returns a tuple of characters:

>>> tuple('Hello, Python!')
('H', 'e', 'l', 'l', 'o', ',', ' ', 'P', 'y', 't', 'h', 'o', 'n', '!')

min() and max() returns the minimum and maximum character in the string:

>>> min('Hello, Python!')
' '
>>> max('Hello, Python!')
'y'

isinstance(object,class) returns boolean based on if the object is an instance of a class. This can also be used to check if the object is a string:

>>> isinstance('Hello, Python!',str)
True

in and not in can be used to check if a character is present in the string or not:

>>> 'H' in 'Hello, Python!'
True
>>> 'Z' in 'Hello, Python!'
False
>>> 'Z' not in 'Hello, Python!'
True
>>> 'H' not in 'Hello, Python!'
False
>>>

Next, we see different methods applicable for strings:

>>> 'hello, Python!'.capitalize()
'Hello, python!'
>>> 'hello, Python!'.center(20,'$')
'$$$hello, Python!$$$'
>>> 'hello, Python!'.count('o')
2
>>> 'hello, Python!'.count('o',0,5)
1

capitalize() will capitalize the first character of the string. center() will enclose string with character in the second argument and length of the resultant string is given by the width. count() returns the number of times a substring of string occurs in the string. Optionally, we can set a beginning index and end index for the search

>>> 'hello, Python!'.endswith("n!")
True
>>>
>>> 'hello, Python!'.endswith("lo",0,5)
True
>>> 'Hello, Python!'.find("n!")
12
>>>
>>> 'Hello, Python!'.find("z")
-1
>>>
>>> 'Hello, Python!'.find("lo",0,5)
3
>>>
>>> 'Hello, {}!'.format("Python")
'Hello, Python!'
>>>
>>> '{0}, {1}!'.format("Hello","Python")
'Hello, Python!'
>>>
>>> 'Hello, Python!'.index("n!")
12
>>>
>>> 'Hello, Python!'.index("z")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: substring not found
>>>
>>> 'Hello, Python!'.index("lo",0,5)
3
>>>
>>> 'Hello, Python!'.isalnum()
False
>>>
>>> 'Hello, Python!'.isalpha()
False
>>>
>>> 'Hello, Python!'.isascii()
True
>>>
>>> 'Hello, Python!'.isdecimal()
False
>>>
>>> '55'.isdecimal()
True
>>>
>>> 'Hello, Python!'.isdigit()
False
>>>
>>> '55'.isdigit()
True
>>>
>>> 'string_1'.isidentifier()
True
>>>
>>> 'string_1'.islower()
True
>>>
>>> '55'.isnumeric()
True
>>>
>>> 'Hello, Python!'.isprintable()
True
>>>
>>> 'Hello, Python!'.isspace()
False
>>>
>>> '   '.isspace()
True
>>>

endswith() returns a boolean if the string ends with suffix specified. It can have an optional start and end indices. find() returns a position index if the substring is present in index else returns -1. format() gives the inputs to be filled in placeholders specified in the string as {}. They can have numbers like {0}, {1} indicating position. index() is similar to find() except that it throws an error if substring is not present in string. isalnum() returns a boolean based on if string is alphanumeric. isalpha() returns a boolean based on if string has only alphabets. isascii() returns a boolean based if string has only ascii characters. isdecimal() returns a boolean based if string has only decimals. isdigit() returns a boolean based if string has only digits. isidentifier() returns a boolean based if string is an identifier. islower() returns a boolean based if string has only lower characters. isnumeric() returns a boolean based if string has only numeric characters. isprintable() returns a boolean based if string has only printable characters. isspace() returns a boolean based if string has only spaces

>>> 'Hello, Python!'.istitle()
True
>>>
>>> 'PYTHON!'.isupper()
True
>>>
>>> 'Hello, Python!'.join('1')
'1'
>>>
>>> 'Hello, Python!'.join('12')
'1Hello, Python!2'
>>>
>>> 'Hello, Python!'.join('123')
'1Hello, Python!2Hello, Python!3'
>>>
>>> 'Hello, Python!'.ljust(24)
'Hello, Python!          '
>>>
>>> 'Hello, Python!'.ljust(24,'$')
'Hello, Python!$$$$$$$$$$'
>>>
>>> 'Hello, Python!'.lower()
'hello, python!'
>>>
>>> '  Hello, Python!'.lstrip()
'Hello, Python!'
>>>
>>> 'Hello, Python!'.lstrip('Hello, ')
'Python!'
>>>
>>> 'Hello, Python!'.partition(' ')
('Hello,', ' ', 'Python!')
>>>
>>> 'Hello, Python!'.replace('l', 's')
'Hesso, Python!'
>>>
>>> 'Hello, Python!'.replace('l', 's', 1)
'Heslo, Python!'
>>>
>>> 'Hello, Python!'.rfind('o')
11
>>>
>>> 'Hello, Python!'.rfind('o',0,6)
4
>>>
>>> 'Hello, Python!'.rindex('o')
11
>>>
>>> 'Hello, Python!'.rindex('z')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: substring not found
>>>
>>> 'Hello, Python!'.rindex('o',0,6)
4
>>>

istitle() returns a boolean based on if the string is titlecased or has an uppercase character. isupper() returns a boolean based on if the string has only uppercase characters. join() concatenates the argument with the calling string. Three examples are shown. ljust() justifies the string to the left. An optional argument can be given for padding on right so that the total length of the resulting string equals the width. lower() returns string in lowercase characters based on if string is alphanumeric. lstrip() strips the string of any leading spaces. An optional argument can be specified that strips the leading argument from the string. partition() separates the string into three elements of a tuple including the separator that is specified as an argument. replace() replaces all the occurences of a string with a replacement. An optional count argument specifies the count of the replacements from the left. rfind() returns the highest index of the substring found in the string. Optional arguments can be a start and end indices. rindex() is similar to rfind() throws an error if the substring is not found in string. 

>>> 'Hello, Python!'.rjust(20)
'      Hello, Python!'
>>>
>>> 'Hello, Python!'.rjust(20,'$')
'$$$$$$Hello, Python!'
>>>
>>> 'Hello, Python !'.rpartition(' ')
('Hello, Python', ' ', '!')
>>>
>>> 'Hello, Python !'.rpartition('Z')
('', '', 'Hello, Python !')
>>>
>>> 'Hello, Python !'.rsplit(' ')
['Hello,', 'Python', '!']
>>>
>>> 'Hello, Python !'.rsplit(' ',1)
['Hello, Python', '!']
>>>
>>> 'Hello, Python !  '.rstrip()
'Hello, Python !'
>>>
>>> 'Hello, Python !  '.rstrip(' !  ')
'Hello, Python'
>>>
>>> 'Hello, Python !'.split()
['Hello,', 'Python', '!']
>>>
>>> 'Hello, Python !'.split(',')
['Hello', ' Python !']
>>>
>>> 'Hello, Python !'.split(' ',1)
['Hello,', 'Python !']
>>>
>>> 'Hello,\n Python!'.splitlines()
['Hello,', ' Python!']
>>>
>>> 'Hello,\n Python!'.splitlines(keepends=True)
['Hello,\n', ' Python!']
>>>
>>> 'Hello, Python!'.startswith('Hello')
True
>>>
>>> 'Hello, Python!'.startswith('Py',7,9)
True
>>>
>>> '  Hello, Python !  '.strip()
'Hello, Python!'
>>>
>>> '  Hello, Python!  '.strip(' !')
'Hello, Python'
>>>
>>> 'hello, python!'.swapcase()
'HELLO, PYTHON!'
>>>
>>> 'hello, python!'.title()
'Hello, Python!'
>>>
>>> 'Hello, Python!'.upper()
'HELLO, PYTHON!'
>>>
>>> '0.55'.zfill(6)
'000.55'
>>>
>>> '+0.55'.zfill(6)
'+00.55'
>>>

rjust() is similar to ljust but the string is right justified in this case. rpartition partitions the string at the last occurence of the separator. If the separator is not found in string, then, a three element tuple is returned: two empty elements and the third element is the original element itself. rsplit() returns a list containing elements obtained by splitting the string based on a separator. The default separator is a space. An optional argument is the maxsplit that specifies the maximum split that can occur from the right. rstrip() strips the string of any trailing spaces. An optional argument can be specified that strips the trailing argument from the string. split() returns a list of elements split on the basis of a separator. The default separator is a space. An optional argument is the maxsplit that specifies the maximum split that can occur from the left. splitlines() will return a list of elements split at line boundaries like '\n'. An optional argument, keepends, will retain the line boundary character on the left elements in the resulting list. startswith() is similar to endswith() but is applicable from the start of the string. strip() strips the string of both leading and trailing characters specified in the argument. Default argument is a space. swapcase() will invert the case of the string. title() returns a camel cased output of the string. upper() will return a string in upper case. Lastly, zfill() or zero fill will pad the string with leading zeroes while maintaining the width of the resulting string specified in the argument. If the string starts with + or - characters, then, the zeroes are inserted between the + or - and the string. 

There are a few constants defined under string that can be useful while we handle strings. They are shown below:

string.ascii_letters
string.ascii_lowercase
string.ascii_uppercase
string.digits
string.hexdigits
string.octdigits
string.punctuation
string.printable
string.whitespace

>>> import string
>>> string.ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> string.ascii_lowercase
'abcdefghijklmnopqrstuvwxyz'
>>> string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> string.digits
'0123456789'
>>> string.hexdigits
'0123456789abcdefABCDEF'
>>> string.octdigits
'01234567'
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'
>>> string.whitespace
' \t\n\r\x0b\x0c'
>>>

This concludes the string handling in Python