Solved: Re: Help with RETAIN

Elle · Posted 08-29-2017 08:32 AM

Hello,

I have a dataset in Hadoop that records bank transactions for a set of customers (~300 vars, 12 mil rows).

This dataset was preprocessed by someone else and made available to me in SAS EG.

I extracted a subset (let's call it "have") containing the customer id, the date of transaction, the amount and the description.

Most of the transaction amounts are > 0, others are 0.

I used the following code:

data have;
	set have;
	retain t 0;
	select;
		when (amount NE 0) t=t+1;
		otherwise;
	end;
run;

Somewhere in the output I'm seeing this (Please note this is a replication of the output I'm getting):

Basically, t jumps from 3616 without a reason.

Note 1: t has never used before in the code.

Note 2: this issue cannot be reproduced on a generated data

Note 3: the data I'm using is subject to security policy and I cannot publish it

Note 4: A simple t+1 statement with no conditions is also defective

I would like to know if anyone has an idea of what's causing the anomaly.

Solution: When subsetting the hadoop dataset, I output the resulted table back to Hadoop. As soon as I changed that and saved the subset table to the work library, the code worked fine.

Kurt_Bremser · Posted 08-30-2017 01:58 AM

@Elle wrote:

Dear Kurt, I already said in my previous posts I cannot reproduce the issue with mock data (fake data generated by an algorithm). Not because I don't want to. It won't replicate. I also said I cannot provide the original data due to security issues. I simply don't have enabled the capability of Copy/Export/Send to ... I would be in violation of the company's policy.

Indeed, it's not rocket science to read through the replies before demanding things.

Then the problem is clearly with your data. If it can't be reproduced with a piece of code, it is simply not there where you're looking.

As others suggested, pull the dataset into SAS (WORK) so you get a SAS-only process, to make sure you don't suffer from a "Hadoop-Effect".

I also suspect that the code you showed is a very tiny part of the complete code, so you have ample opportunities of hidden side-effects there.

Start with simple code that creates t, like the code I gave you. If that does not work already, contact SAS technical support (they work under non-disclosure rules, so you can give them your data or access to your system).

If it works, expand the code until you either reach your intended functionality or the problem arises anew; then you'll know what's causing your problem.

One of the possible problem points can be SQL steps that don't specify a forced order for the output. SQL optimization may lead to unexpected reordering of data. See Maxim 31. And heed Maxim 34, so you can follow the development of your data through simply recreatable (and maintainable) steps.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 08-29-2017 09:18 AM

I'd rather write the code as

data have1;
set have;
retain t 0;
if amount ne 0 then t + 1;
run;

But this won't work if variable t is already on dataset have, because then the values read from the dataset will always overwrite the values retained from the previous observation.

And note that I don't overwrite the dataset in place. If something goes wrong, I'll still have the original dataset.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Elle · Posted 08-29-2017 09:58 AM

I have edited my original post accordingly

Kurt_Bremser · Posted 08-29-2017 10:05 AM

Then please post your have dataset (at least enough observations so that the effect happens). Do so in a data step (use the macro from https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... to convert your dataset to datastep code).

Code with example data that shows that it works:

data have;
input amount;
cards;
0
0
1
0
0
1
0
1
0
;
run;

data have1;
set have;
retain t 0;
if amount ne 0 then t + 1;
run;

proc print data=have1 noobs;
run;

Result:

amount    t

   0      0
   0      0
   1      1
   0      1
   0      1
   1      2
   0      2
   1      3
   0      3

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

gamotte · Posted 08-29-2017 10:24 AM

As @Kurt_Bremser suggests, the problem probably lies outside the elements you have given here.

Maybe the variable t is used somewhere else for another purpose (a counter for a loop ?).

Start by commenting all code sections that are not directly concerned. If the problem disappears

then progressively uncomment the code until it appears again to find the code section where it occurs.

Elle · Posted 08-29-2017 02:54 PM

I apologize, I'm not able to reproduce the problem with a mock dataset. As to the data I'm having, it's securely protected on a Hadoop server. I think the problem might be related to the way it's stored because even a simple t+1 jumps thousands of observations in multiple points.

ballardw · Posted 08-29-2017 10:41 AM

After your first pass through the data with the

data have;

set have;

you no longer have a valid data set to work from more than likely. Retain will not work as expected after the first pass because you added the T variable to the data and that existing value will replace the "retained" version from the code.

And what is the result supposed to look like?

gamotte · Posted 08-29-2017 11:01 AM

I may be wrong but i think the T variable is only added to the PDV after the first pass.

It is only after all input rows have been processed that the output dataset is created

(here with the same name so it overwrites the input).

This small example does seem to work :

data have;
	drop i;
	do i=1 to 10000;
		amount=rand("Uniform")>0.5;
		output;
	end;
run;

data have;
	set have;
	retain t 0;

	if amount then t=t+1;
run;

Tom · Posted 08-29-2017 12:22 PM

RETAIN will not work properly for a variable that already exists on the input data set. So if this is you input.

data have;
  infile datalines delimiter="!";
  input id date:ddmmyy. amount desc:$50. t;
  format date yymmdd10.;
cards;
1!01/01/2017!5!Commission ! 3615
1!01/01/2017!0!01/07/2017 2 USD ! 3615
1!02/01/2017!54!Walmart ! 3616
1!30/01/2017!1000!Rent ! 119401
1!30/01/2017!100!Tax ! 119402
;

Do you want to make a NEW variable?

data want ;
  set have ;
  new_t + (amount ne 0) ;
run;

Or do you want to conditionally add one to an existing variable?

data want ;
  set have ;
  new_t = t + (amount ne 0) ;
run;

Or perhaps reset the new variable when AMOUNT does equal 0?

data want ;
  set have ;
  if amount = 0 then new_t = t;
  else new_t + 1;
run;

Elle · Posted 08-29-2017 03:03 PM

Thank you for your response. I've edited my post and added some replies to clarify the situation. The problem is with the t that jumps from 3616 to 119000 although I'm using the select statement with the condition of t+1. Somehow I suspect the dataset has invisible rows that t accounts for. Does that make sense?

Kurt_Bremser · Posted 08-29-2017 03:58 PM

POST.EXAMPLE.DATA.AS.MENTIONED.

It's not rocket science.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Elle · Posted 08-29-2017 04:22 PM

Dear Kurt, I already said in my previous posts I cannot reproduce the issue with mock data (fake data generated by an algorithm). Not because I don't want to. It won't replicate. I also said I cannot provide the original data due to security issues. I simply don't have enabled the capability of Copy/Export/Send to ... I would be in violation of the company's policy.

Indeed, it's not rocket science to read through the replies before demanding things.

Kurt_Bremser · Posted 08-30-2017 01:58 AM

@Elle wrote:

Dear Kurt, I already said in my previous posts I cannot reproduce the issue with mock data (fake data generated by an algorithm). Not because I don't want to. It won't replicate. I also said I cannot provide the original data due to security issues. I simply don't have enabled the capability of Copy/Export/Send to ... I would be in violation of the company's policy.

Indeed, it's not rocket science to read through the replies before demanding things.

Then the problem is clearly with your data. If it can't be reproduced with a piece of code, it is simply not there where you're looking.

As others suggested, pull the dataset into SAS (WORK) so you get a SAS-only process, to make sure you don't suffer from a "Hadoop-Effect".

I also suspect that the code you showed is a very tiny part of the complete code, so you have ample opportunities of hidden side-effects there.

Start with simple code that creates t, like the code I gave you. If that does not work already, contact SAS technical support (they work under non-disclosure rules, so you can give them your data or access to your system).

If it works, expand the code until you either reach your intended functionality or the problem arises anew; then you'll know what's causing your problem.

One of the possible problem points can be SQL steps that don't specify a forced order for the output. SQL optimization may lead to unexpected reordering of data. See Maxim 31. And heed Maxim 34, so you can follow the development of your data through simply recreatable (and maintainable) steps.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom · Posted 08-29-2017 08:02 PM

@Elle wrote:

Thank you for your response. I've edited my post and added some replies to clarify the situation. The problem is with the t that jumps from 3616 to 119000 although I'm using the select statement with the condition of t+1. Somehow I suspect the dataset has invisible rows that t accounts for. Does that make sense?

Are you sure you have not just sorted the data into a different order?

Or is it possible that SAS has somehow used multiple threads since you are using HADOOP ata and so it cannot really process the data in seqential order?

sspkmnd · Posted 08-29-2017 07:07 PM

Could you try following code:

data have2;
    set have;
    retain t 0;
    t = t + (amount > %sysevalf(10**-8));
run;

Also, could you try to copy the dataset in SAS to ensure that it is stored in SAS native format like:

data new;
    set old;
run;

RETAIN statement anomaly on Hadoop Data

Re: Help with RETAIN

Re: Help with RETAIN

Re: Help with RETAIN

Re: Help with RETAIN

Re: Help with RETAIN

Re: Help with RETAIN

Re: Help with RETAIN

Re: Help with RETAIN

Re: Help with RETAIN

Re: Help with RETAIN

Re: Help with RETAIN

Re: Help with RETAIN

Re: Help with RETAIN

Re: Help with RETAIN

Re: RETAIN statement anomaly on Hadoop Data

SAS Innovate 2025: Call for Content