Thursday, 3 September 2020

Numpy - II

After the first post on Numpy, we explore more aspects of Numpy. The idea is to cover as far as possible the most basic of these and, thus, lay the foundation for future work in areas of AI, ML or Data Science

Like in the earlier post, we will be using Jupyter notebook for all the work in this article. The code is in blue font and output is in green font below the code. The version details are given below:

import sys
print("Python version:", sys.version)

import numpy as np
print("NumPy version:", np.__version__)

Python version: 3.8.3 (default, Jul  2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]
NumPy version: 1.18.5

Some of the basic statistic functions are shown below:

numpy_array11 = np.array([[ 1, 2, 3, 4,  5,  6,  7,  8],
                          [ 6, 7, 8, 9, 10, 11, 12, 13]])
numpy_array11


array([[ 1,  2,  3,  4,  5,  6,  7,  8],
       [ 6,  7,  8,  9, 10, 11, 12, 13]])


numpy_array11.sum()


112

numpy_array11.min()

1

numpy_array11.max()

13

numpy_array11.mean()

7.0

np.median(numpy_array11)

7.0

numpy_array11.std()

3.391164991562634


numpy_array11.var()

11.5

numpy_array11.max(axis=0)   ## max column wise

array([ 6,  7,  8,  9, 10, 11, 12, 13])


numpy_array11.max(axis=1)   ## max row wise


array([ 8, 13])

numpy_array11.cumsum(axis=0) ## cumulative sum along column

array([[ 1,  2,  3,  4,  5,  6,  7,  8],
       [ 7,  9, 11, 13, 15, 17, 19, 21]], dtype=int32)


numpy_array11.cumsum(axis=1) ## cumulative sum along row

array([[ 1,  3,  6, 10, 15, 21, 28, 36],
       [ 6, 13, 21, 30, 40, 51, 63, 76]], dtype=int32)

There are two ways arrays can be copied: Shallow copy and Deep copy. Commands for both are shown below:

numpy_array11 = np.array([[ 1, 2, 3, 4,  5,  6,  7,  8],
                          [ 6, 7, 8, 9, 10, 11, 12, 13]])
numpy_array11


array([[ 1,  2,  3,  4,  5,  6,  7,  8],
       [ 6,  7,  8,  9, 10, 11, 12, 13]])


numpy_array11_view = numpy_array11.view() 
numpy_array11_view


array([[ 1,  2,  3,  4,  5,  6,  7,  8],
       [ 6,  7,  8,  9, 10, 11, 12, 13]])

numpy_array11_view is a new view of array but shares the same data.This copying technique is called Shallow copy


numpy_array11_view is numpy_array11     

 
False

Above test shows that numpy_array11_view is not numpy_array11 itself

numpy_array11_view.base is numpy_array11

True 

Above test confirms that data in numpy_array11_view is based on numpy_array11


id(numpy_array11)       # identifier of numpy_array11


2674179372976


id(numpy_array11_view)  # identifier of numpy_array11_view


2674179374496

Identifier of numpy_array11_view is different from numpy_array11


numpy_array11_deepcopy = numpy_array11.copy()


numpy_array11_deepcopy

array([[ 1,  2,  3,  4,  5,  6,  7,  8],
       [ 6,  7,  8,  9, 10, 11, 12, 13]])

Above manner of copying is called deep copy in which a new object of array is created with the same data that is not shared

numpy_array11_deepcopy is numpy_array11      

False

The test above shows that numpy_array11_deepcopy is not numpy_array11


numpy_array11_deepcopy.base is numpy_array11 

False

Above test shows that data in numpy_array11_deepcopy is not based on numpy_array11

Sorting arrays examples are shown below:

numpy_array16= np.random.randint(100, size=(5, 5))
numpy_array16


array([[38, 56, 80, 29, 32],
       [26, 55,  0, 47,  5],
       [29, 27, 50, 49, 35],
       [54, 50, 97, 56, 31],
       [63, 27, 38, 98, 84]])


np.sort(numpy_array16)        # sort along the last axis

array([[29, 32, 38, 56, 80],
       [ 0,  5, 26, 47, 55],
       [27, 29, 35, 49, 50],
       [31, 50, 54, 56, 97],
       [27, 38, 63, 84, 98]])


np.sort(numpy_array16, axis=0)   # sort along the first axis

array([[26, 27,  0, 29,  5],
       [29, 27, 38, 47, 31],
       [38, 50, 50, 49, 32],
       [54, 55, 80, 56, 35],
       [63, 56, 97, 98, 84]])


np.sort(numpy_array16, axis=None)  # sort the flattened array

array([ 0,  5, 26, 27, 27, 29, 29, 31, 32, 35, 38, 38, 47, 49, 50, 50, 54,
       55, 56, 56, 63, 80, 84, 97, 98])


dtype = [('name', 'S10'), ('salary', float), ('age', int)]
values = [('Nick', 5500, 41), ('Kyle', 6500, 44),
          ('Ken', 7500, 44)]


structured_array1 = np.array(values, dtype=dtype)       # create a structured array
np.sort(structured_array1, order='salary')              # sort by salary


array([(b'Nick', 5500., 41), (b'Kyle', 6500., 44), (b'Ken', 7500., 44)],
      dtype=[('name', 'S10'), ('salary', '<f8'), ('age', '<i4')])


np.sort(structured_array1, order=['age', 'salary'])     # sort by age, salary

array([(b'Nick', 5500., 41), (b'Kyle', 6500., 44), (b'Ken', 7500., 44)],
      dtype=[('name', 'S10'), ('salary', '<f8'), ('age', '<i4')])

Extraction of elements from a NumPy array is one of the most important activity any developer will encounter. There are various techniques like Subsetting, Slicing, Indexing, etc. Examples of these techniques are described below:

a) Subsetting: In this technique, a subset of the array is extracted and can be a single member or may have more members

numpy_array17 = np.random.randint(100, size=(5, 5))
numpy_array17

array([[ 9, 12, 28, 68, 76],
       [74, 58, 37, 39, 46],
       [46, 15, 46, 24, 34],
       [33, 41, 53, 35, 30],
       [49, 78, 86, 57, 38]])

numpy_array17[0]          #Extracts first row

array([ 9, 12, 28, 68, 76])

numpy_array17[:,0]        #Extracts first column

array([ 9, 74, 46, 33, 49])

numpy_array17[2,2]       #Extracts single element

46

b) Slicing: In this technique, a slice consisting of one or more members is extracted

Syntax for Slicing is [lower:upper:step] where lower bound is included but upper bound is not included. step specifies stride between elements and is 1 by default, if unspecified. The first element in a single dimension array is 0 in the forward direction and is -1 for the last element in the reverse direction. Some examples are shown below:

numpy_array18 = np.array([10,11,12,13,14])

numpy_array18[1:3]

array([11, 12])

numpy_array18[-4:3]

array([11, 12])

numpy_array18[:3]   # more like selecting head

array([10, 11, 12])

numpy_array18[-2:]  # more like selecting tail

array([13, 14])

numpy_array18[::2]

array([10, 12, 14])

numpy_array18[::-1]  # Reversing the array


array([14, 13, 12, 11, 10])

c) Indexing using boolean indices: In this technique, a boolean array is used like shown below:

numpy_array18 = np.array([10,11,12,13,14])

numpy_array18[numpy_array18 >= 13]

 array([13, 14])

d) Fancy indexing: Lastly, we have the fancy indexing where we have the capability to select complex subsets and also modify them using assignment

rand = np.random.RandomState(1)
numpy_array19 = rand.randint(100, size=10)
print(
numpy_array19)

[37 12 72  9 75  5 79 64 16  1]

[numpy_array19[3], numpy_array19[7], numpy_array19[2]]

[9, 64, 72]

Alternatively, we can pass a single list or array of indices to obtain the same result:

ind = [3, 7, 4]

numpy_array19[ind]

array([ 9, 64, 75])

When using fancy indexing, the shape of the result reflects the shape of the index arrays rather than the shape of the array being indexed:

indices = np.array([[3, 7],
                [4, 5]])

numpy_array19[indices]


array([[ 9, 64],
       [75,  5]])


Fancy indexing also works in multiple dimensions. Consider the following array:

numpy_array20 = np.arange(12).reshape((3, 4))
numpy_array20


array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])


Like with standard indexing, the first index refers to the row, and the second to the column:

row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
numpy_array20[row, col]


array([ 2,  5, 11])

We can modify the array as shown below:

row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
numpy_array20[row, col] = -1
numpy_array20

array([[ 0,  1, -1,  3],
       [ 4, -1,  6,  7],
       [ 8,  9, 10, -1]])

A few other operations are described below:

a) Changing array shape:

numpy_array21 = np.arange(24).reshape((2,2,2,3))
numpy_array21

array([[[[ 0,  1,  2],
         [ 3,  4,  5]],

        [[ 6,  7,  8],
         [ 9, 10, 11]]],


       [[[12, 13, 14],
         [15, 16, 17]],

        [[18, 19, 20],
         [21, 22, 23]]]])


numpy_array21 = numpy_array21.ravel()   # Flatten the array
numpy_array21


array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23])


numpy_array21.reshape((2,2,2,3))         # Reshape the array

array([[[[ 0,  1,  2],
         [ 3,  4,  5]],

        [[ 6,  7,  8],
         [ 9, 10, 11]]],


       [[[12, 13, 14],
         [15, 16, 17]],

        [[18, 19, 20],
         [21, 22, 23]]]])
 

b)  Add and remove members:

numpy_array22=np.array([[0,1],[2,3]])
numpy_array22


array([[0, 1],
       [2, 3]])

np.resize(numpy_array22,(2,3))

array([[0, 1, 2],
       [3, 0, 1]])

Above step returns a new array with the specified shape. If the shape of the new array is larger than the original array, then the new array is filled with repeated copies of the original array

np.append(numpy_array22,numpy_array22)   # Append items to an array

array([0, 1, 2, 3, 0, 1, 2, 3])

np.insert(numpy_array22, 1, 5)   #Insert values along the given axis before the given indices

array([0, 5, 1, 2, 3])


np.insert(numpy_array22, 1, 5, axis=1)

array([[0, 5, 1],
       [2, 5, 3]])


np.delete(numpy_array22, 1, 0)  # Return a new array with sub-arrays along an axis deleted

array([[0, 1]])


np.delete(numpy_array22, 1, 1)  # Return a new array with sub-arrays along an axis deleted

array([[0],
       [2]])
 

c) Combining arrays:

numpy_array23 = np.array([[1, 1], [2, 2], [3, 3]])
numpy_array23


array([[1, 1],
       [2, 2],
       [3, 3]]) 

np.concatenate((numpy_array23,numpy_array23),axis=0) 

array([[1, 1],
       [2, 2],
       [3, 3],
       [1, 1],
       [2, 2],
       [3, 3]])

Above steps shows a joining a sequence of arrays along an existing axis

np.concatenate((numpy_array23,numpy_array23),axis=1)

array([[1, 1, 1, 1],
       [2, 2, 2, 2],
       [3, 3, 3, 3]])


np.concatenate((numpy_array23, numpy_array23), axis=None)

array([1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3])

numpy_array24 = np.array([1, 2, 3])
numpy_array25 = np.array([2, 3, 4])


np.vstack((numpy_array24, numpy_array25))     # Stack arrays vertically (row-wise)

array([[1, 2, 3],
       [2, 3, 4]])


np.hstack((numpy_array24, numpy_array25))   # stack arrays in sequence horizontally (column wise)

array([1, 2, 3, 2, 3, 4])

np.column_stack((numpy_array24, numpy_array25))   # Stack 1-D arrays as columns into a 2-D array

array([[1, 2],
       [2, 3],
       [3, 4]])

d) Splitting of arrays: Arrays can be split using hsplit (horizontal split) and vsplit (vertical split) as shown below:

numpy_array26 = np.arange(16.0).reshape(4, 4)
numpy_array26


array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.],
       [12., 13., 14., 15.]])


np.hsplit(numpy_array26, 2) # Split an array into multiple sub-arrays horizontally (column-wise)

[array([[ 0.,  1.],
        [ 4.,  5.],
        [ 8.,  9.],
        [12., 13.]]),
 array([[ 2.,  3.],
        [ 6.,  7.],
        [10., 11.],
        [14., 15.]])]


np.hsplit(numpy_array26, 4)

[array([[ 0.],
        [ 4.],
        [ 8.],
        [12.]]),
 array([[ 1.],
        [ 5.],
        [ 9.],
        [13.]]),
 array([[ 2.],
        [ 6.],
        [10.],
        [14.]]),
 array([[ 3.],
        [ 7.],
        [11.],
        [15.]])]


np.vsplit(numpy_array26, 2)   # Split an array into multiple sub-arrays vertically (row-wise)

[array([[0., 1., 2., 3.],
        [4., 5., 6., 7.]]),
 array([[ 8.,  9., 10., 11.],
        [12., 13., 14., 15.]])]


np.vsplit(numpy_array26, 4)

[array([[0., 1., 2., 3.]]),
 array([[4., 5., 6., 7.]]),
 array([[ 8.,  9., 10., 11.]]),
 array([[12., 13., 14., 15.]])]

With this we have nearly covered all basic aspects of Numpy arrays. This concludes the posts on Numpy

Monday, 17 August 2020

NumPy - I

Today I received a mail from a friend complaining that there are no new articles. While writing a new post on the blog had been running on my mind for the last few months, for lack of motivation, I was dilly dallying and resorted to the easy way out: procrastinating. VoilĂ , a shot in the arm and we are back in business

In this post, we will attempt to unveil some important aspects of NumPy. NumPy as we know today was released as NumPy 1.0 in 2006. As of today when we write this article, the version is 1.19.0. The Python versions supported by this release are 3.6-3.8. NumPy is the fundamental package for scientific computing in Python. At the centre of the NumPy package, a Python library, is the ndarray object or a n-dimensional arrays of  homogeneous elements

We will be using Jupyter notebook for all the work in this article. The code is in blue font and output is in green font below the code. The version details are given below:

import sys
print("Python version:", sys.version)

import numpy as np
print("NumPy version:", np.__version__)

Python version: 3.8.3 (default, Jul  2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]
NumPy version: 1.18.5

Let's start with creating a few NumPy arrays and then calling them immediately:

numpy_array1 = np.array([1, 2, 3, 4, 5])
numpy_array1

array([1, 2, 3, 4, 5])

numpy_array2 = np.array([True, False, True, False, True], dtype = bool)
numpy_array2
 

array([ True, False,  True, False,  True])

numpy_array3 = np.array([1.1, 2.2, 3.3, 4.4, 5.5])
numpy_array3

array([1.1, 2.2, 3.3, 4.4, 5.5])

numpy_array4 = np.array([1, 2, 3, 4, 5], dtype = np.uint8) ## dtype is Unsigned integer (0 to 255)
numpy_array4

array([1, 2, 3, 4, 5], dtype=uint8)

numpy_array5 = np.array(['NumPy',"is","the",'fundamental',"package",'for',"scientific",'computing'])
numpy_array5

array(['NumPy', 'is', 'the', 'fundamental', 'package', 'for',
       'scientific', 'computing'], dtype='<U11')

We can also create NumPy arrays from list:

python_list = [6, 7, 8, 9, 10]
numpy_array6 = np.array(python_list)
numpy_array6

array([ 6,  7,  8,  9, 10])

Few more examples of NumPy arrays creation are shown below:

String = "1.1 2.2 3.3 4.4 5.5"
numpy_array7 = np.fromstring(String, dtype = np.double, sep = " ")
numpy_array7

array([1.1, 2.2, 3.3, 4.4, 5.5])

numpy_array8 = np.zeros((5,2), dtype = int) ## initializes with zeros
numpy_array8

array([[0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0]])

numpy_array9 = np.eye(3, dtype = float) ## initializes with zeros but with ones along dialgonal
numpy_array9

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])


numpy_array10 = np.full((5, 4), 10) ## completes array with value for given shape
numpy_array10

array([[10, 10, 10, 10],
       [10, 10, 10, 10],
       [10, 10, 10, 10],
       [10, 10, 10, 10],
       [10, 10, 10, 10]])

To check the properties of arrays, we can use the following commands:

numpy_array11 = np.array([[ 1, 2, 3, 4,  5,  6,  7,  8],
                          [ 6, 7, 8, 9, 10, 11, 12, 13]])
numpy_array11

array([[ 1,  2,  3,  4,  5,  6,  7,  8],
       [ 6,  7,  8,  9, 10, 11, 12, 13]])

numpy_array11.shape  ## shape

(2, 8)

numpy_array11.size  ## size

16

numpy_array11.ndim ## number of dimensions

2

numpy_array11.itemsize ## size of each item in byte
s

4

numpy_array11.nbytes ##size in bytes of all elements


64

numpy_array11.dtype ## type of elements


dtype('int32')

len(numpy_array11) ## length of array


2

Operations on NumPy arrays are shown in the next few commands:

numpy_array_sum = numpy_array1 + numpy_array2 # Addition
numpy_array_sum


array([2, 2, 4, 4, 6])

numpy_array_difference = numpy_array1 - numpy_array2 # Subtraction
numpy_array_difference

array([0, 2, 2, 4, 4])

numpy_array_product = numpy_array1 * numpy_array2 # Multiplication
numpy_array_product


array([1, 0, 3, 0, 5])

numpy_array_quotient = numpy_array1 / numpy_array1 # Division
numpy_array_quotient


array([1., 1., 1., 1., 1.])

numpy_array_square_root = numpy_array1 ** 0.5 # square root
numpy_array_square_root


array([1.        , 1.41421356, 1.73205081, 2.        , 2.23606798])


numpy_array_raised_power = numpy_array1 ** 3 # Exponentiation
numpy_array_raised_power


array([  1,   8,  27,  64, 125], dtype=int32)


numpy_array_sin = np.sin(numpy_array1) # array value treated as radians
numpy_array_sin


array([ 0.84147098,  0.90929743,  0.14112001, -0.7568025 , -0.95892427])


numpy_array_log = np.log(numpy_array1) # natural logarithm
numpy_array_log

array([0.        , 0.69314718, 1.09861229, 1.38629436, 1.60943791])

Some of the constants that we may encounter in our line of work:

np.e

2.718281828459045

np.pi

3.141592653589793

np.inf

inf

np.Inf

inf

np.NINF

-inf

np.NAN

nan

np.NZERO

-0.0

np.PZERO

0.0

To compare two NumPy arrays, we can use the following commands:

np.equal(numpy_array2, numpy_array1)  #Element wise comparison

array([ True, False, False, False, False])


numpy_array1 == numpy_array2  #Element wise comparison


array([ True, False, False, False, False])


np.array_equal(numpy_array1, numpy_array1)  # Array Comparison

True

np.array_equal(numpy_array1, numpy_array2)  # Array Comparison


False

Broadcasting in NumPy: For operations between arrays, a comparison is first made of their shapes element-wise. If the shapes are the same, then, no broadcasting is applied. But, if the sizes are not the same, then, for the operation to succeed, the size of  trailing axes for both arrays in an operation must either be the same size or one of them must be one. Else, a ValueError: operands could not be broadcast together with shapes ... exception is thrown, indicating that the arrays have incompatible shapes and broadcasting could not be applied

numpy_array12 = np.arange(1,16).reshape(3,5)
numpy_array12


array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15]])


numpy_array12 + numpy_array12 # same shape so, no problem


array([[ 2,  4,  6,  8, 10],
       [12, 14, 16, 18, 20],
       [22, 24, 26, 28, 30]])


numpy_array13 = np.array([1])
numpy_array13


array([1])


numpy_array12 + numpy_array13 # broadcasting is valid as size of trailing axis of numpy_array13 is 1

array([[ 2,  3,  4,  5,  6],
       [ 7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16]])


numpy_array14 = np.array([1, 2, 3, 4, 5])
numpy_array14


array([1, 2, 3, 4, 5])

numpy_array12 + numpy_array14 # broadcasting is valid as size of trailing axes of both arrays is 5 and is same

array([[ 2,  4,  6,  8, 10],
       [ 7,  9, 11, 13, 15],
       [12, 14, 16, 18, 20]])

numpy_array15 = np.array([1, 2])
numpy_array15

array([1, 2])

numpy_array12 + numpy_array15  # error will be thrown

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-ff4233b2f95d> in <module>
----> 1 numpy_array12 + numpy_array15  # error will be thrown

ValueError: operands could not be broadcast together with shapes (3,5) (2,) 
 
With this concept of Broadcasting in NumPy, we come to the end the first post on NumPy

Tuesday, 29 October 2019

Cloud - VII

We did a word count using Apache Hive on Cloudera QuickStart VM 5.12 here. In this post, we will repeat the same word count using Apache Hive but on AWS EMR. Amazon Elastic MapReduce or AWS EMR is a managed Hadoop framework that can be easily deployed swiftly to process large amounts of data across dynamically scalable Amazon EC2 instances. You can also run open source tools and popular distributed frameworks such as Apache Hive, Apache Spark, Apache HBase, Presto, and Apache Flink in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3, Amazon DynamoDB, and Amazon Redshift.

We will use Amazon S3 to store our Hive queries, input data, and also the output result from Hive queries on Amazon EMR. In S3, we have created a bucket called emr-example--bucket having three folders: code, input and output. code folder will house all the hive queries. input folder will contain the data file called blue_carbuncle.txt containing the text on which we will attempt the word count. Results of the Hive queries will be outputted to the output folder. The bucket and folders are shown below:













The first paragraph in the data file, blue_carbuncle.txt, is shown below:

 











The data file, blue_carbuncle.txt, is placed in input folder:














The Hive queries is shown below:

create external table text (line string) location '${INPUT}/input';

insert overwrite directory '${OUTPUT}/result1/' select word, count(*) from(select explode(split(line,'\\s')) as word from text) z group by word;

They are the same from the previous post expect for the "location" part in the first query and "insert overwrite directory" part in the second query. They are in a single file called hive1.q under code folder:














The output folder is empty as no Hive queries have been run so far. Now, we can go ahead and run the Hive queries on the data file by spinning up a Amazon EMR on the fly. Frankly, I enjoyed the experience as I never thought setting up a Hadoop cluster will be so much of a breeze. So, now onto Amazon EMR:











Click on Create cluster button. In the next screen, only add a EC2 key pair if you already have, else, take the defaults. Click on Create cluster to launch a cluster comprising 1 m5.xlarge master node and 2 m5.xlarge core nodes:























The cluster will take a few minutes to launch. Once launched, click on steps tab and Add step button to add the details to run Hive query:

 









In the Add step window, add the values as follows:
















After setting the values for Script, Input and Output locations, click Add button to kick off the Hive query on the cluster. Once the Hive query has run, we get Completed in Status:











Navigate to S3 to see result:














Download the file, 000000_0 and see the contents:

























The contents are the same like in the last post. This concludes the post on Amazon EMR