Before you say anything, I know this isn't strictly SAQL (you can use SAQL in a Relative Compute node though and we will.)
A colleague came to me with an interesting use case where they needed to create a summary dataset of a larger base dataset to report from fo their client. The reasons for doing this are that the client wants to be able to use the dataset via the UI, they don't know or care for coding and as the real world dataset they are working with is so large, it would be a plus if the new dataset could be optimized in some way.
They sent me an example of the data and an illustration of the desired outcome.
So our aims for each Electricity Meter (Device Code) are:
- To record the date when the first reading was taken
- To record the date when the last reading was taken
- To count the number of readings that were taken
- Discard any unrequired rows and columns to optimize the dataset.
It sounds a lot of work, but with our tool box of dataflow nodes it's easier than you might think. And we'll break it down into three task areas.
Tip: Ensure that the columns are using the correct format (dimension, date, measure) before uploading the dataset.
If you want to see how the flow in your org create a new Dataflow using the data manager and upload this .json file.
Part 1 - Setting the start and end date
Create a new dataflow to house our new process. Save it as 'Reading Data Summary'. I tend to keep dataflows separate, especially if they have scheduling dependencies. We wouldn't want the main dataset to run after the summary for example.
Add an Edgemart node to the canvas, and give it a name.
Select the sample dataset as your source
Now we will add our first computeRelative node.
The computeRelative node enables us to create a window for a specified group of data, ordered according to a field, so that we can create new fields based on the values in that group. It's similar in some ways to the computeExpression node, but with the added power of being able to work within partitioned (or grouped) and ordered data. In our dataset we are going to use this node to set the Start_Date of the earliest reading taken for each device.
Add a new computeRelative node, give it a name and then select the parameters that will set the partition and order.
As our first aim is to 'stamp' the earliest reading date onto our dataset rows for each meter device, we will partition our data by the ["Device_Code"] field, and within that partition we want the rows to ordered by the ["Date"] of the reading, in 'Ascending' order. So in effect giving us something that would look like this:
Now click 'Add Field' and give the New field a name, label and description. I've called it 'Start_Date'.
Then we get down to business. We want to set a value based on a date, so we set the Source Field to ["Date"].
The offset function stipulates which value in our partition we will populate the field with. As our data is ordered with the earliest date 'first' this will become our Start Date. And last of all set a default value. I've set this using the current() funtion (read more about relative functions). Which would populate the new field with the value of the current row it is focussed on in the partition rather than leave the field empty.
Save the node.
Bacause there isn't a last() function to capture the last value in a partition, we have to do a bit of creative thinking and flip our partition on its head (which will also help us in part 2.)
Create a new computeRelative node with an appropriate name. Use the same parameters as before; partition our data by the ["Device_Code"] field and within that partition we want the rows ordered by the ["Date"] of the reading, this time in 'descending' order. So the data is ordered with the latest recorded reading date at first position in the partition.
Create a new field called 'End_Date'. And we'll use the same paremeters, because our device data is now in reverse date order, so the latest value is first.
So with these nodes added we can add a Register node to capture our dataset and make it available in the Analytics studio. Add a Register node, name it and set the source node to be our ["COMPREL_LATEST_READING"]
Save the register node, and then save the dataflow. It should look something like this:
Navigate to the Dataflows and Recipes page and start the dataflow.
If we look at the output of our dataset in the recipe view we can see that our new columns have been added with the appropriate start and end dates for each device as we required.
Part 2 - Count the number of readings per device
To count the number of readings we have for each device we can utilise the computeRelative node that we used to set the Start_Date field. I thought there might be a crafty way of calculating the occurence of rows but I couldn't find one in the documentation. I think in most cases this would be something we could use a lens for. However, the client needs the data summarised so here's how.
Open the dataflow and edit the COMPREL_EARLIEST_READING node.
Add a new field called 'Count_Of_Readings'. We can use this existing node because it already has the Partiton and Order values set up, with the data partioned by device and the order set by date.
We need to increment the value in our new 'Count_Of_Readings' field by 1 for each row in the partition. For this we need a dash of SAQL, a case statement to be precise.
Change the expression type to 'SAQL'. This allows us to use a SAQL expression rather than just one of the four standard fucntion values.
We will use the previous() function. The previous function is really useful, because it allows us to reference the new field we are creating, a self-reference. So with this we can check what the 'Count_Of_Readings' value is on the previous row and then add to it.
case when previous(Count_Of_Readings) == null then 1 when previous(Count_Of_Readings) > 0 then previous(Count_Of_Readings) +1 end
The first condition sets the 'Count_Of_Readings' field to 1, when there are no previous rows in the partition. So the top row only. The second condition checks that when our previous 'Count_Of_Readings' field is greater than 0, we will add 1 to the previous value and set the current row field with the new value. This case statement works on each row until the final row contains the total number of readings for a device.
The default field is set to 1 in this case, as we need to set a default and the scale field is set to 2.
Update and then run the dataflow. The results should be something like this.
Part 3 : Tidy up
Tidying up involves selecting only one row for each device, the final row if our data were in date order (the one with the greatest row count/total readings.) Which will also have the start and end date. We can then drop the following columns using a slice node: 'Reading', 'Date', 'Filter' (which we will create soon.)
Open the dataflow again and edit the COMPREL_LATEST_READING Relative Compute node. We want the most recent row which has the highest 'Count_Of_Readings' value, to be the first row in our node partition. So we can use the first() function again.
Add a new field called "Filter" and set the expression parameter to 'SAQL' and the type to "Text". we want to set a string.
In the SAQL field we use the following case statement:
case when current(Date_day_epoch) < first(Date_day_epoch) then "False" else "True" end
To simplify the date comparison, we use the Date_Day_Epoch field. This is a field which is generated automatically when a new dataset is imported or created. It Returns the number of seconds elapsed since January 1, 1970, 00:00:00.000 GMT for a given date. We don't have to worry about adding this to our dataset, it's generated when our Edgemart was imported.
Each time our case statement runs against a row, if the current(Date_Day_Epoch) value is less than the first(Date_Day_Epoch) in the partition then we will set the 'Filter' value to "False" else "True".
Then add a Filter node to remove any of our rows that do not have a 'Filter' value of "False". The logic for the filter node is different that we would normally see in the SAQL editor and it's worth reading about.
Name the filter node 'FILTER_FILTER_ROWS' and set the source to the 'COMPREL_LATEST_READING'.
We want to keep the rows that have a 'Filter' value of "True", so we use the filter syntax:
Then we need to remove any fields we no longer require. We do this by adding a Slice node. Add the Slice node and name it 'SLICE_UNWANTED_FIELDS' and use the picklist to set the fileds we want to drop.
Our dataflow process should look like this:
Rejig the Register node to come after our Slice node. You can do this by editing the Register node and setting the source to be the Slice node, or by dragging the > part of the Slice node to create a new connection to the Register node.
Now update the dataset and run the dataflow. This should give us a dataset which looks like the one below. Fulfilling our client's requirements.