When Data Drives Your Business
  GO
Contact Us 888-828-8201

 
 The Aginity Blog

Posted by: Brian Weissler on 7/12/2010 | 0 Comments

Here's a short demonstration video of how to create, compile, register, and call a custom iClass Analytic Process. 

The Analytic Process is one of the most powerful features of the next Netezza release (v6 + iClass), currently in Beta.  This provides developers with extreme flexibility to create programs which execute on the NPS, taking advantage of the MPP architecture for distribution and parallelism.

 

CompileandRegister

Posted by: Oleg on 7/7/2010 | 0 Comments

This is the second part of the UDA creation article. The first part can be found here: Creating a Netezza User Defined Aggregate (UDA) – understanding the basics.

The most simple and intuitive tool for creating a UDA is Aginity Netezza Workbench. Well, it is the most simple and intuitive tool I am aware of :) The tool will be used for creating a sample UDA.

Sample UDA Description

We are going to create a user defined aggregate SUM_IGNORE_NULLS(). The UDA will behave exactly like SUM() does with the only exception that a NULL value would not cause the function to return NULL, e.g. NULLs will be considered zeroes.

Step by step instructions for creating SUM_IGNORE_NULLS aggregate

1.       Start the Workbench

2.       Click File / New UDX / Aggregate menu

3.       Specify SUM_IGNORE_NULLS as the SQL Name on the General page


 

4.       This sample function will accept a BIGING parameter and return a BIGINT value. Switch to the Arguments page and specify the function accepts single parameter called Summand and its data type is BIGINT as shown on the image below:


 

5.       The function should be able maintaining its stage between the calls (please see the first part of this article for more information.) Since we are calculating a sum of integer values, single BIGINT state value is sufficient. Switch to the State page and add an Accumulator state variable of data type BIGINT


 

6.       Specify the return value on the Returns page


 

7.       Click OK button to proceed. The tool will generate a CPP file that contains the following items:

a.       Include directives for importing required headers

b.      Using directives to start using appropriate namespaces

c.       CSUM_IGNORE_NULLS class interface that includes base virtual functions to be overridden as well as several inline helper functions for accessing function argument and state variables to make the code more readable

d.      A fully implemented instantiate() function that creates an instance of your class and returns it to the caller

e.      Empty stubs for the virtual functions that must be overridden by you

8.       Implement initializeState() function.
This function should set the state variables to initial values. Since our aggregate is going to ignore NULLs and treat them as zeroes, replace the second parameter passed to the
setStateNull() with false and set the state variable initial value to zero:

void CSUM_IGNORE_NULLS::initializeState()
{
      
setStateNull(0, false);
      
*get_state_Accumulator() = 0;
}

9.        Implement accumulate() function.
This function accepts single argument and, in accord to the logic, should add the incoming value to the state like this:

void CSUM_IGNORE_NULLS::accumulate()
{
      
// see if the argument is NULL and ignore it if so
      
if (!is_arg_null_Summand())
       {
             
// accumulate
             
*get_state_Accumulator() += get_arg_Summand();
       }
}

10.   Implement merge() function.
This function should merge two state variables – local and the one that is passed in via arguments. Our function will look like this:

void CSUM_IGNORE_NULLS::merge()
{
      
// the second state variable passed in via arguments is guaranteed to hold
       // no NULL value because the initializeState() explicitly sets it to zero,
      
// so an extra check can be avoided
      
*get_state_Accumulator() += get_arg_Summand();
}

11.   Implement the finalResult() function.
The state variable holds an aggregated sum of values from all SPUs, so simply return the value stored in the session variable. Here a strange thing happens. The session variables within the
finalResult() function have to be accessed using arguments accessor functions. E.g. state variable 0 should be accessed using int64Arg(0)as opposed to int64State(0) as you might had been expecting. So the function body is

ReturnValue CSUM_IGNORE_NULLS::finalResult()
{
      
NZ_UDX_RETURN_INT64(get_arg_Summand());
}

12.   The C++ part is ready now. Let’s define a sample SQL statement for testing. Switch to the Test Query tab and type the following query:

select category_id, sum_ignore_nulls(item_count)
 
from test_uda
group by category_id;

The script for creating the test table and populating it with test data can be found at the bottom of this article.

13.   We are done. Hit F5 for compiling the UDA, registering it in the target database and running the test SQL script. The following output is produced:

Posted by: Oleg on 7/7/2010 | 1 Comment

Exactly like a User Defined Function (UDF), a User Defined Aggregate (UDA) in Netezza is a piece of code that is compiled into a binary output and can be called from within a SQL statement. Some examples of an aggregate function are AVG(), SUM() and MIN(). UDAs are implemented in C++ at the moment, but it is planned that other languages (such as Java, Python, R and so on) can be used in the future.

The similarity between an UDF and UDA is that both return single scalar value.

The difference is that the former accepts a scalar set of arguments while the latter accepts zero-to-many rows of arguments and produce an aggregate value based on a set of rows rather than on single row of input parameters.

Terminology

A typical query that uses an aggregate function contains a group by statement, for instance

select store_id, item_id, max(price)

  from products

 group by store_id, item_id;

A term Distinct Combination (DC) hereafter is used to refer to a unique combination of values stored in a set of columns specified in group by clause. E.g. each unique combination of store_id and item_id returned by the query above is considered to be a Distinct Combination.

In other words, DC refers to single output row produced by a query that contains a GROUP BY clause. The output query contains as many DCs as there are distinct combinations of fields in the group by clause of the SQL statement.

The term DC is important and will be used extensively below to make it clear what is called under circumstances and what is not. And what is called where :)

An UDA in MPP world

Netezza box (NPS) consists of a central host and a bunch of SPU modules that can be thought of as separate computers connected via a high-speed network appliance. The central host features RedHat operating system, while SPUs use some kind of embedded Linux OS. A UDA can be executed on either central NPS host on or an SPU depending on how it is invoked. This is why it is compiled for two different target platforms and two different binaries are produced. The binaries are then copied over to every SPU as well as to a system location on the central hosts and registered with the database system to make them available for using in SQL statements.

The fact that data reside on many SPUs means that we must aggregate the data twice:

1.       Produce an aggregate value on each individual SPU for each DC

2.       Merge the aggregated values on the central host in order to produce a final aggregate result for each DC

We will see how it affects the UDA interface in the next sections.

Keeping the UDA state between the calls

A UDA is called for each row of data produced by a SQL statement. This means it must have means of storing certain pieces of information between the calls in order to produce an aggregated value for a particular DC. A UDA in Netezza uses state variables for that.

You, as a designer of the UDA, specify what state variables are required for your UDA and the data type of each state variable. Netezza creates a set of state variables for each DC produced by a SQL query and makes sure that your UDA has access to the same set of variables for a given DC.

A base class for a UDA

A UDA C++ class must be inherited from Uda class that defines several abstract functions to be overridden as follows:

static Uda *instantiate();
virtual void initializeState() = 0;
virtual void accumulate() = 0;
virtual void merge() = 0;
virtual ReturnValue finalResult() = 0;

The instantiate() function should simply create an instance of your class and return it to the caller.

initializeState() is called once for each DC and the UDA must initialize its state variables at this point. For instance, you can set the some or all state variables to NULL, zero or any other initial value that you think is appropriate.

accumulate() is called once for each row of data. This is the place where you update your state variables to take into account the values from a new set of input parameters.

merge() is called or it is not :) In the most generic case, it is called on a central host to merge two states received from two different SPUs into single set of state variables. This is done repeatedly for each individual SPU state until all states from all SPUs are merged into single one.

An important point here is that this function is not called under certain circumstances. Consider the situation when the column a table is distributed on is a part of the group by clause. In this case, every SPU holds all the data that belong to a given DC. As a result, there is no need to merge state variables produced by each SPU because their DCs never overlap.

finalResult() is the last function called by NPS. This function must return single scalar result using values stored in UDA state variables if necessary.

What’s next?

I am going to show you how to create a sample UDA in the next article. See you soon :)



 

1 2 3 4 5 6 7  Go to Page:  



    Syndicate    
     

    Recent Posts

    Archive

    Bloggers

    Category List

    Tag Cloud

       


    MapReduce Clickstream Response Attribution
    Java AP Basket
    MapReduce Keyword Tokenization
    Interactive Reporting Patterns

    This Content Requires Adobe Flash Player | Download Now

    This Content Requires Adobe Flash Player | Download Now

    This Content Requires Adobe Flash Player | Download Now

    This Content Requires Adobe Flash Player | Download Now

    This Content Requires Adobe Flash Player | Download Now


    Privacy Statement  |  Terms Of Use  |  Copyright 2010 by Aginity, Inc. Register   |   Login