Wednesday, 19 April 2017

Analytic Functions in Oracle - IX

In this post, we will take a look at MODEL keyword. The MODEL or SPREADSHEET keyword expands the functionality of SQL in that it gives the user the ability to treat the result set of a query as a multidimensional array and also apply calculations or formulas on the multidimensional array. These calculations or formulas that can be applied are quite flexible and can be applied, for example, on a column to be based on values in other columns.

We use Oracle 12c as the basis for all our queries. For this post, we will create a new view called sales_model that we will use for our queries related to this post. Sales_model view is based on tables in sample SH schema. The statement for creation of sales_model view is given below:

create or replace view sales_model 
as
select distinct extract(year from c.time_id) sale_year, a.prod_name product, b.channel_desc channel, sum(c.amount_sold) over (partition by extract (year from c.time_id), a.prod_name, b.channel_desc) amount from sh.products a, sh.channels b, sh.sales c
where a.prod_id = c.prod_id and b.channel_id = c.channel_id and prod_name like '%6%MB Memory Card' and  extract(year from c.time_id) in (1999,2000) order by a.prod_name;


returns

view SALES_MODEL created.

Below query will give an idea of  the data in the result that we will play with in the various queries using MODEL:
 
select * from sales_model;
















There are twelve rows having sale amount about two products over two years via different channels. 

Let us take up the first query on SALES_MODEL. The query is shown below:

select sale_year, product,channel, amount, new_amount
from sales_model
spreadsheet
partition by (sale_year, product)
dimension by (channel,amount)
measures (amount new_amount) ignore nav
rules
(new_amount['Internet',any]  = 100)
order by sale_year, product;


The results returned are shown below: 














To understand the results of the above query, we need to understand the different parts of the query:

a) SPREADSHEET keyword is used instead of MODEL.

b)  The query returns five columns: four from the original view and the fifth is calculated as an alias of amount.

c) The result set is partitioned by sale_year and product. The rules that will apply later will be only on the partitioned result set defined by the partition.

d) The columns mentioned in the dimension clause, channel and amount, are the ones used by the rules.

e) The measures clause contains the columns that will be used for the calculation. Here, it is column amount aliased to new_amount.

f) Rules use the columns in the dimension clause to act on the columns in measures clause for calculations defined in the rule. The rule here means that the new_amount column is an alias of amount column with the rule that if the channel column has value 'Internet' and for any value in amount column, the aliased amount column should have a default value of 100. So, rows 5, 8 and 12 contain 100 in new_amount column.

The second example has two rules as shown below:

select sale_year, product,channel, amount, new_amount
from sales_model
spreadsheet
partition by (sale_year, product)
dimension by (channel,amount)
measures (amount new_amount) ignore nav
rules
(new_amount['Internet',any]  = 100,
new_amount['Partners',any]  = 50
)
order by sale_year, product;


The results returned are shown below:














The above query just extends the rule to include 'Partners' also in channel column but, should have a value of 50 and the result is evident in the result set.

The next example shows how we can use calculations on existing column value by using cv() function. "cv" stands for current value.

select sale_year, product,channel, amount, new_amount
from sales_model where sale_year = 2000
spreadsheet
partition by (sale_year, product)
dimension by (channel,amount)
measures (amount new_amount) ignore nav
rules
(new_amount['Internet',any]  = 2*new_amount['Internet',cv()]
)
order by sale_year, product;


The results returned are shown below:








The records returned are fewer as there is a filter clause. The rules are defined such that in the new_amount column, the current values get doubled only when the value in channel column is 'Internet'.

One can also use values in another row for replacement as shown in below query:

select sale_year, product,channel, new_amount
from sales_model where sale_year = 2000
spreadsheet
partition by (sale_year)
dimension by (product,channel)
measures (amount new_amount) ignore nav
rules
(new_amount['64MB Memory Card','Partners']  = new_amount['64MB Memory Card','Internet'])
order by sale_year, product;


The results returned are shown below:


  






 In the results above, when the product is '64MB Memory Card' and channel is 'Partners', the new_amount is set to the value when the product is '64 MP Memory Card' and channel is 'Internet'. This is as per the rule that is set in the query.

Like the result above, we can also use calculations to arrive at a value for new_amount as shown below:

select sale_year, product,channel, new_amount
from sales_model where sale_year = 2000
spreadsheet
partition by (sale_year)
dimension by (product,channel)
measures (amount new_amount) ignore nav
rules
(new_amount['64MB Memory Card','Partners']  = new_amount['64MB Memory Card','Internet'] + new_amount['256MB Memory Card','Internet'])
order by sale_year, product;


The results returned are shown below:









As per the rule, when the product is '64MB Memory Card' and channel is 'Partners' , new_amount value is set to sum of new_amount values when products are '64MB Memory Card' and '256MB Memory Card' and the channel is 'Internet'. 4348.69 is a sum of 2814.4 and 1534.29.

The next query creates new records based on the rules as shown below:

select sale_year, product,channel, new_amount
from sales_model
model
partition by (sale_year)
dimension by (product,channel)
measures (amount new_amount) ignore nav
rules
(new_amount['1024MB Memory Card','Partners']  = new_amount['64MB Memory Card','Partners'] + new_amount['256MB Memory Card','Partners'])
order by sale_year, product;


The results returned are shown below:
















In the rules, we have created a new product called '1024MB Memory Card' with a channel of  'Partners'. Two records are created corresponding to the two sales_year values and  new_amount value is set to sum of new_amount values of the same channel but corresponding to existing two products.

To return only the new rows, use return updated rows clause as shown below:

select sale_year, product,channel, new_amount
from sales_model
model return updated rows
partition by (sale_year)
dimension by (product,channel)
measures (amount new_amount) ignore nav
rules
(new_amount['1024MB Memory Card','Partners']  = new_amount['64MB Memory Card','Partners'] + new_amount['256MB Memory Card','Partners'])
order by sale_year, product;


The results returned are shown below:

 


The return updated rows clause after the model keyword does the trick.

In the below query, we create a record based on a different channel:

select sale_year, product,channel, new_amount
from sales_model where sale_year = 2000
model
partition by (sale_year)
dimension by (product,channel)
measures (amount new_amount) ignore nav
rules
(new_amount['64MB Memory Card','Road Show']  = new_amount['64MB Memory Card','Partners'] + new_amount['256MB Memory Card','Partners'])
order by sale_year, product;


The results returned are shown below:










The newly created record is seen in row 6 with new_amount value as a sum of the two products but having channel value, 'Partners'.

We can also use aggregate functions across rows to set value on records as shown below:

select sale_year, product,channel, new_amount
from sales_model where sale_year = 2000
model
partition by (sale_year)
dimension by (product,channel)
measures (amount new_amount) ignore nav
rules
(new_amount['64MB Memory Card','Road Show']  = 2*max(new_amount)['256MB Memory Card',channel in ('Internet','Direct Sales')])
order by sale_year, product;


The results returned are shown below:









 
The row 6 contains 208919.82 in new_amount that is twice the maximum value,104459.91, between the new_amount values of product, '256MB Memory Card' in two channels, 'Internet' and 'Direct Sales'. 

The last example shows use of the for clause as shown below:

select sale_year, product,channel, new_amount
from sales_model where sale_year = 2000
spreadsheet
partition by (sale_year)
dimension by (product,channel)
measures (amount new_amount) ignore nav
rules
(new_amount['64MB Memory Card',for channel in ('Internet','Partners','Direct Sales')]  = 3*new_amount['64MB Memory Card',cv()])
order by sale_year, product;


The results returned are shown below:









For clause has been used to set values in new_amount with three times the existing value only for product, '64MB Memory Card'.

This concludes the post on MODEL keyword.

Wednesday, 12 April 2017

Analytic Functions in Oracle - VIII

We embark on the next topic in the series, Pattern Matching using MATCH_RECOGNIZE. This powerful feature was introduced in Oracle 12c. We will not be using the familiar HR schema for the queries to explore this topic. Some preparatory work is needed before we start running the queries related to Pattern Matching. We will create a table based on Dow Jones Industrial Average for the period 20th September 2016 to 27th October 2016. The data can be found at this link. The displayed data can be downloaded as a .csv file. To use this data for our queries, let us create a table using below statement:

create table hr.djia_table (trade_date date, open number, high number, low number, close number, volume number, adj_close number);

returns

table HR.DJIA_TABLE created.

Let us now proceed to add data to the newly created table as follows:

1) First, click on the table in SQL Developer as shown below. This will bring up the table details on the right.












2) Click on Action menu as shown below. This will bring up the Import Data ... menu item.






























3) Click on Import Data ... menu item. This will let you choose the file to be imported. Navigate to the directory where the .csv file is located and click Open button.

















4) Click OK button on the next dialog box that comes up.

















5) Then, click on Next button.

















6) Click Next button on the next dialog box.

















7) Click on Next button the next dialog box.

















8) On the next dialog box, enter YYYY-MM-DD in Format field as shown below.






























9) Click on each Source Data Column on left to make sure that the column on the right is correctly mapped. As an example, see that High column in the file is mapped to HIGH column in the table.

















10) Adj Close column in file is mapped to TRADE_DATE by default. Map Adj Close to ADJ_CLOSE as shown below and click on Next button.






























11) On the last dialog box, click on Verify button to verify the parameters.






























We get SUCCESS status for different tests carried out by SQL Developer.

12) Click on Finish button. We get below message.










13) Click OK button and then click on data tab to see the data loaded successfully.





















Before we explain Pattern Matching, let us look at a plot of trade date versus adjusted close in Excel. 





Pattern Matching will help in recognizing patterns in the graph shown above as a V shaped pattern, say, the y-axis values 18392.46094,18293.69922,18261.44922,18094.83008,18339.24023.

Now that we have the data in DJIA_TABLE, we can start firing queries on pattern matching on this table.

As a first example, let us run below query:

select * from hr.djia_table
    match_recognize
        (order by trade_date
        measures
                    strt.trade_date as start_date,
                    last(down.trade_date) as bottom_date,
                    last(up.trade_date) as end_date
        after match skip to last up
        pattern (strt down+ up+)
        define
                     down as down.adj_close < prev(down.adj_close),
                     up as up.adj_close > prev(up.adj_close) ) mr 

        order by mr.start_date;

The results returned are shown below: 
















There are 9 rows that are returned which means that there are 9 V patterns in the data set. Let us break down the query in parts to understand more details about the MATCH_RECOGNIZE clause.

The SELECT at the start of the query will specify the data set that will be used for pattern matching. In the above example, we use DJIA_TABLE. 

MATCH_RECOGNIZE starts with a  PARTITION BY, an optional clause, not used by us here. It is followed by ORDER BY clause. ORDER BY clause is mandatory and is useful for producing consistent results.

Pattern definition specified by the PATTERN keyword can use regular expression against a sequence of rows. For example, + symbol can be used to specify one or more iterations. In the above example, the pattern begins with a strt and is followed by one or more down and later by one or more up. These are used to define the V pattern like PATTERN (STRT DOWN+ UP+). The STRT is the start point, DOWN can be one or more as we do not know after many points, the bottom is reached and UP can be one or more as we do not know after how many points, the top is reached but at the end we have a V shaped pattern comprising a START, one or more DOWNs, and one or more UPs. 

DEFINE is used to specify the pattern of events and pattern variables identifying the individual events within the pattern. In the above example, DEFINE is used to specify details about the down and up events.

MEASURES will specify the measures that will be passed to the select statement such as source data points, pattern data points and aggregate related to a pattern. Here, we can use pattern navigation operations as FIRST, LAST, NEXT, and PREV. In above example, we are using LAST keyword to specify the bottom and end of the V pattern. Two special measures are available called CLASSIFIER() and MATCH_NUMBER(). CLASSIFIER() classifies each row as variables mentioned in PATTERN clause or DEFINE clause. MATCH_NUMBER() tells to which pattern does that row belong starting from 1. We will use these later.

Lastly, we need to specify the output that will be passed to the select statement like summary level or detail level and also where the next search should start once a pattern match has been found.

For the level of information, there are three levels: ONE ROW PER MATCH, ALL ROWS PER MATCH, and ALL ROWS PER MATCH WITH UNMATCHED ROWS. ONE ROW PER MATCH is the default and provides more like a summary. Here, we are using ONE ROW PER MATCH. Since we have not specified, it is used by default. 

To indicate where the next start should start after a pattern match, there are SKIP PAST LAST ROW, SKIP TO NEXT ROW, SKIP TO <VARIABLE>, SKIP TO FIRST (<VARIABLE>), and SKIP TO LAST (<VARIABLE>). In above example, we are using AFTER MATCH SKIP TO LAST UP.

In the second query, we find the number of peaks in the data set. The portions of the SQL that have been modified is updated in bold font. The changed query is shown below:  

select * from hr.djia_table
    match_recognize
        (order by trade_date
        measures
                    strt.trade_date as start_date,
                    last(up.trade_date) as up_date,
                    last(down.trade_date) as end_date

        after match skip to last up
        pattern (strt up+ down+)
        define
                     down as down.adj_close < prev(down.adj_close),
                     up as up.adj_close > prev(up.adj_close) ) mr
        order by mr.start_date;


The results returned are shown below:


















In this case, the query returns 10 rows as there are indeed 10 peaks.

In the next query, we aim for a more detailed query that will give information about the patterns too.

select * from hr.djia_table
    match_recognize
        (order by trade_date
        measures
                    strt.trade_date as start_date,
                    final last(up.trade_date) as up_date,
                    final last(down.trade_date) as end_date,
                    match_number() as match_num,
                    classifier() AS var_match
        all rows per match
        after match skip to last up
        pattern (strt up+ down+)
        define
                     down as down.adj_close < prev(down.adj_close),
                     up as up.adj_close > prev(up.adj_close) ) mr
        order by trade_date;



The results returned are shown below:
















We see that though the source has only 28 records, 37 records are returned.This is because records are repeated where last row in a pattern is also the start row in the next pattern.

The changes from the last script are: 

1) The FINAL keyword used under MEASURES. This makes sure that the last value in up_date or end_date is constant and the final value is shown. Else, we see running values as null that is replaced as the pattern progresses.

2) ALL ROWS PER MATCH gives detailed results and covers all the records in the source data set corresponding to the pattern.

3) MATCH_NUMBER returns the pattern number. In above example, there are ten matches so the MATCH_NUMBER increments accordingly. CLASSIFIER classifies the row according to the pattern variable mentioned in PATTERN clause.

This concludes the post on PATTERN MATCHING in Oracle 12c.