Exactly like a User Defined Function (UDF), a User Defined Aggregate (UDA) in Netezza is a piece of code that is compiled into a binary output and can be called from within a SQL statement. Some examples of an aggregate function are AVG(), SUM() and MIN(). UDAs are implemented in C++ at the moment, but it is planned that other languages (such as Java, Python, R and so on) can be used in the future.
The similarity between an UDF and UDA is that both return single scalar value.
The difference is that the former accepts a scalar set of arguments while the latter accepts zero-to-many rows of arguments and produce an aggregate value based on a set of rows rather than on single row of input parameters.
Terminology
A typical query that uses an aggregate function contains a group by statement, for instance
select store_id, item_id, max(price)
from products
group by store_id, item_id;
A term Distinct Combination (DC) hereafter is used to refer to a unique combination of values stored in a set of columns specified in group by clause. E.g. each unique combination of store_id and item_id returned by the query above is considered to be a Distinct Combination.
In other words, DC refers to single output row produced by a query that contains a GROUP BY clause. The output query contains as many DCs as there are distinct combinations of fields in the group by clause of the SQL statement.
The term DC is important and will be used extensively below to make it clear what is called under circumstances and what is not. And what is called where :)
An UDA in MPP world
Netezza box (NPS) consists of a central host and a bunch of SPU modules that can be thought of as separate computers connected via a high-speed network appliance. The central host features RedHat operating system, while SPUs use some kind of embedded Linux OS. A UDA can be executed on either central NPS host on or an SPU depending on how it is invoked. This is why it is compiled for two different target platforms and two different binaries are produced. The binaries are then copied over to every SPU as well as to a system location on the central hosts and registered with the database system to make them available for using in SQL statements.
The fact that data reside on many SPUs means that we must aggregate the data twice:
1. Produce an aggregate value on each individual SPU for each DC
2. Merge the aggregated values on the central host in order to produce a final aggregate result for each DC
We will see how it affects the UDA interface in the next sections.
Keeping the UDA state between the calls
A UDA is called for each row of data produced by a SQL statement. This means it must have means of storing certain pieces of information between the calls in order to produce an aggregated value for a particular DC. A UDA in Netezza uses state variables for that.
You, as a designer of the UDA, specify what state variables are required for your UDA and the data type of each state variable. Netezza creates a set of state variables for each DC produced by a SQL query and makes sure that your UDA has access to the same set of variables for a given DC.
A base class for a UDA
A UDA C++ class must be inherited from Uda class that defines several abstract functions to be overridden as follows:
static Uda *instantiate();
virtual void initializeState() = 0;
virtual void accumulate() = 0;
virtual void merge() = 0;
virtual ReturnValue finalResult() = 0;
The instantiate() function should simply create an instance of your class and return it to the caller.
initializeState() is called once for each DC and the UDA must initialize its state variables at this point. For instance, you can set the some or all state variables to NULL, zero or any other initial value that you think is appropriate.
accumulate() is called once for each row of data. This is the place where you update your state variables to take into account the values from a new set of input parameters.
merge() is called or it is not :) In the most generic case, it is called on a central host to merge two states received from two different SPUs into single set of state variables. This is done repeatedly for each individual SPU state until all states from all SPUs are merged into single one.
An important point here is that this function is not called under certain circumstances. Consider the situation when the column a table is distributed on is a part of the group by clause. In this case, every SPU holds all the data that belong to a given DC. As a result, there is no need to merge state variables produced by each SPU because their DCs never overlap.
finalResult() is the last function called by NPS. This function must return single scalar result using values stored in UDA state variables if necessary.
What’s next?
I am going to show you how to create a sample UDA in the next article. See you soon :)