SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

RETAIN statement anomaly on Hadoop Data

Accepted Solution Solved
Reply
Contributor
Posts: 47
Accepted Solution

RETAIN statement anomaly on Hadoop Data

[ Edited ]

Hello,

 

I have a dataset in Hadoop that records bank transactions for a set of customers (~300 vars, 12 mil rows).

This dataset was preprocessed by someone else and made available to me in SAS EG.

I extracted a subset (let's call it "have") containing the customer id, the date of transaction, the amount and the description.

Most of the transaction amounts are > 0, others are 0.

I used the following code:

 

data have;
	set have;
	retain t 0;
	select;
		when (amount NE 0) t=t+1;
		otherwise;
	end;
run;

Somewhere in the output I'm seeing this (Please note this is a replication of the output I'm getting):

 

 Capture.JPG

Basically, t jumps from 3616 without a reason.

Note 1: t has never used before in the code.

Note 2: this issue cannot be reproduced on a generated data

Note 3: the data I'm using is subject to security policy and I cannot publish it

Note 4: A simple t+1 statement with no conditions is also defective

 

I would like to know if anyone has an idea of what's causing the anomaly.

 

Solution: When subsetting the hadoop dataset, I output the resulted table back to Hadoop. As soon as I changed that and saved the subset table to the work library, the code worked fine.


Accepted Solutions
Solution
‎09-05-2017 07:45 AM
Super User
Posts: 9,932

Re: Help with RETAIN


Elle wrote:

Dear Kurt, I already said in my previous posts I cannot reproduce the issue with mock data (fake data generated by an algorithm). Not because I don't want to. It won't replicate. I also said I cannot provide the original data due to security issues. I simply don't have enabled the capability of Copy/Export/Send to ... I would be in violation of the company's policy.

Indeed, it's not rocket science to read through the replies before demanding things.


Then the problem is clearly with your data. If it can't be reproduced with a piece of code, it is simply not there where you're looking.

 

As others suggested, pull the dataset into SAS (WORK) so you get a SAS-only process, to make sure you don't suffer from a "Hadoop-Effect".

 

I also suspect that the code you showed is a very tiny part of the complete code, so you have ample opportunities of hidden side-effects there.

 

Start with simple code that creates t, like the code I gave you. If that does not work already, contact SAS technical support (they work under non-disclosure rules, so you can give them your data or access to your system).

If it works, expand the code until you either reach your intended functionality or the problem arises anew; then you'll know what's causing your problem.

 

One of the possible problem points can be SQL steps that don't specify a forced order for the output. SQL optimization may lead to unexpected reordering of data. See Maxim 31. And heed Maxim 34, so you can follow the development of your data through simply recreatable (and maintainable) steps.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code

View solution in original post


All Replies
Super User
Posts: 9,932

Re: Help with RETAIN

I'd rather write the code as

data have1;
set have;
retain t 0;
if amount ne 0 then t + 1;
run;

But this won't work if variable t is already on dataset have, because then the values read from the dataset will always overwrite the values retained from the previous observation.

And note that I don't overwrite the dataset in place. If something goes wrong, I'll still have the original dataset.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Contributor
Posts: 47

Re: Help with RETAIN

[ Edited ]
Posted in reply to KurtBremser

I have edited my original post accordingly

Super User
Posts: 9,932

Re: Help with RETAIN

Then please post your have dataset (at least enough observations so that the effect happens). Do so in a data step (use the macro from https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... to convert your dataset to datastep code).

 

Code with example data that shows that it works:

data have;
input amount;
cards;
0
0
1
0
0
1
0
1
0
;
run;

data have1;
set have;
retain t 0;
if amount ne 0 then t + 1;
run;

proc print data=have1 noobs;
run;

Result:

amount    t

   0      0
   0      0
   1      1
   0      1
   0      1
   1      2
   0      2
   1      3
   0      3
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Super Contributor
Posts: 331

Re: Help with RETAIN

Posted in reply to KurtBremser

As @KurtBremser suggests, the problem probably lies outside the elements you have given here.

Maybe the variable t is used somewhere else for another purpose (a counter for a loop ?).

Start by commenting all code sections that are not directly concerned. If the problem disappears

then progressively uncomment the code until it appears again to find the code section where it occurs.

Contributor
Posts: 47

Re: Help with RETAIN

Posted in reply to KurtBremser

I apologize, I'm not able to reproduce the problem with a mock dataset. As to the data I'm having, it's securely protected on a Hadoop server. I think the problem might be related to the way it's stored because even a simple t+1 jumps thousands of observations in multiple points.

Super User
Posts: 13,358

Re: Help with RETAIN

After your first pass through the data with the

data have;

   set have;

 

you no longer have a valid data set to work from more than likely. Retain will not work as expected after the first pass because you added the T variable to the data and that existing value will replace the "retained" version from the code.

 

 

And what is the result supposed to look like?

 

Super Contributor
Posts: 331

Re: Help with RETAIN

I may be wrong but i think the T variable is only added to the PDV after the first pass.

It is only after all input rows have been processed that the output dataset is created

(here with the same name so it overwrites the input).

 

This small example does seem to work :

data have;
	drop i;
	do i=1 to 10000;
		amount=rand("Uniform")>0.5;
		output;
	end;
run;

data have;
	set have;
	retain t 0;

	if amount then t=t+1;
run;

 

Super User
Super User
Posts: 7,944

Re: Help with RETAIN

RETAIN will not work properly for a variable that already exists on the input data set.  So if this is you input.

data have;
  infile datalines delimiter="!";
  input id date:ddmmyy. amount desc:$50. t;
  format date yymmdd10.;
cards;
1!01/01/2017!5!Commission ! 3615
1!01/01/2017!0!01/07/2017 2 USD ! 3615
1!02/01/2017!54!Walmart ! 3616
1!30/01/2017!1000!Rent ! 119401
1!30/01/2017!100!Tax ! 119402
;

Do you want to make a NEW variable?

data want ;
  set have ;
  new_t + (amount ne 0) ;
run;

image.png

Or do you want to conditionally add one to an existing variable?

data want ;
  set have ;
  new_t = t + (amount ne 0) ;
run;

image.png

Or perhaps reset the new variable when AMOUNT does equal 0?

data want ;
  set have ;
  if amount = 0 then new_t = t;
  else new_t + 1;
run;

image.png

 

 

 

 

 

Contributor
Posts: 47

Re: Help with RETAIN

Thank you for your response. I've edited my post and added some replies to clarify the situation. The problem is with the t that jumps from 3616 to 119000 although I'm using the select statement with the condition of t+1. Somehow I suspect the dataset has invisible rows that t accounts for. Does that make sense? 

Super User
Posts: 9,932

Re: Help with RETAIN

POST.EXAMPLE.DATA.AS.MENTIONED.

 

It's not rocket science.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Contributor
Posts: 47

Re: Help with RETAIN

[ Edited ]
Posted in reply to KurtBremser

Dear Kurt, I already said in my previous posts I cannot reproduce the issue with mock data (fake data generated by an algorithm). Not because I don't want to. It won't replicate. I also said I cannot provide the original data due to security issues. I simply don't have enabled the capability of Copy/Export/Send to ... I would be in violation of the company's policy.

Indeed, it's not rocket science to read through the replies before demanding things.

Solution
‎09-05-2017 07:45 AM
Super User
Posts: 9,932

Re: Help with RETAIN


Elle wrote:

Dear Kurt, I already said in my previous posts I cannot reproduce the issue with mock data (fake data generated by an algorithm). Not because I don't want to. It won't replicate. I also said I cannot provide the original data due to security issues. I simply don't have enabled the capability of Copy/Export/Send to ... I would be in violation of the company's policy.

Indeed, it's not rocket science to read through the replies before demanding things.


Then the problem is clearly with your data. If it can't be reproduced with a piece of code, it is simply not there where you're looking.

 

As others suggested, pull the dataset into SAS (WORK) so you get a SAS-only process, to make sure you don't suffer from a "Hadoop-Effect".

 

I also suspect that the code you showed is a very tiny part of the complete code, so you have ample opportunities of hidden side-effects there.

 

Start with simple code that creates t, like the code I gave you. If that does not work already, contact SAS technical support (they work under non-disclosure rules, so you can give them your data or access to your system).

If it works, expand the code until you either reach your intended functionality or the problem arises anew; then you'll know what's causing your problem.

 

One of the possible problem points can be SQL steps that don't specify a forced order for the output. SQL optimization may lead to unexpected reordering of data. See Maxim 31. And heed Maxim 34, so you can follow the development of your data through simply recreatable (and maintainable) steps.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Super User
Super User
Posts: 7,944

Re: Help with RETAIN


Elle wrote:

Thank you for your response. I've edited my post and added some replies to clarify the situation. The problem is with the t that jumps from 3616 to 119000 although I'm using the select statement with the condition of t+1. Somehow I suspect the dataset has invisible rows that t accounts for. Does that make sense? 


Are you sure you have not just sorted the data into a different order?

Or is it possible that SAS has somehow used multiple threads since you are using HADOOP ata and so it cannot really process the data in seqential order?

Occasional Contributor
Posts: 13

Re: RETAIN statement anomaly on Hadoop Data

[ Edited ]

Could you try following code:

 

data have2;
    set have;
    retain t 0;
    t = t + (amount > %sysevalf(10**-8));
run;

 Also, could you try to copy the dataset in SAS to ensure that it is stored in SAS native format like:

 

data new;
    set old;
run;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 15 replies
  • 308 views
  • 5 likes
  • 6 in conversation