BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Elle
Quartz | Level 8

Hello,

 

I have a dataset in Hadoop that records bank transactions for a set of customers (~300 vars, 12 mil rows).

This dataset was preprocessed by someone else and made available to me in SAS EG.

I extracted a subset (let's call it "have") containing the customer id, the date of transaction, the amount and the description.

Most of the transaction amounts are > 0, others are 0.

I used the following code:

 

data have;
	set have;
	retain t 0;
	select;
		when (amount NE 0) t=t+1;
		otherwise;
	end;
run;

Somewhere in the output I'm seeing this (Please note this is a replication of the output I'm getting):

 

 Capture.JPG

Basically, t jumps from 3616 without a reason.

Note 1: t has never used before in the code.

Note 2: this issue cannot be reproduced on a generated data

Note 3: the data I'm using is subject to security policy and I cannot publish it

Note 4: A simple t+1 statement with no conditions is also defective

 

I would like to know if anyone has an idea of what's causing the anomaly.

 

Solution: When subsetting the hadoop dataset, I output the resulted table back to Hadoop. As soon as I changed that and saved the subset table to the work library, the code worked fine.

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

@Elle wrote:

Dear Kurt, I already said in my previous posts I cannot reproduce the issue with mock data (fake data generated by an algorithm). Not because I don't want to. It won't replicate. I also said I cannot provide the original data due to security issues. I simply don't have enabled the capability of Copy/Export/Send to ... I would be in violation of the company's policy.

Indeed, it's not rocket science to read through the replies before demanding things.


Then the problem is clearly with your data. If it can't be reproduced with a piece of code, it is simply not there where you're looking.

 

As others suggested, pull the dataset into SAS (WORK) so you get a SAS-only process, to make sure you don't suffer from a "Hadoop-Effect".

 

I also suspect that the code you showed is a very tiny part of the complete code, so you have ample opportunities of hidden side-effects there.

 

Start with simple code that creates t, like the code I gave you. If that does not work already, contact SAS technical support (they work under non-disclosure rules, so you can give them your data or access to your system).

If it works, expand the code until you either reach your intended functionality or the problem arises anew; then you'll know what's causing your problem.

 

One of the possible problem points can be SQL steps that don't specify a forced order for the output. SQL optimization may lead to unexpected reordering of data. See Maxim 31. And heed Maxim 34, so you can follow the development of your data through simply recreatable (and maintainable) steps.

View solution in original post

15 REPLIES 15
Kurt_Bremser
Super User

I'd rather write the code as

data have1;
set have;
retain t 0;
if amount ne 0 then t + 1;
run;

But this won't work if variable t is already on dataset have, because then the values read from the dataset will always overwrite the values retained from the previous observation.

And note that I don't overwrite the dataset in place. If something goes wrong, I'll still have the original dataset.

Elle
Quartz | Level 8

I have edited my original post accordingly

Kurt_Bremser
Super User

Then please post your have dataset (at least enough observations so that the effect happens). Do so in a data step (use the macro from https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... to convert your dataset to datastep code).

 

Code with example data that shows that it works:

data have;
input amount;
cards;
0
0
1
0
0
1
0
1
0
;
run;

data have1;
set have;
retain t 0;
if amount ne 0 then t + 1;
run;

proc print data=have1 noobs;
run;

Result:

amount    t

   0      0
   0      0
   1      1
   0      1
   0      1
   1      2
   0      2
   1      3
   0      3
gamotte
Rhodochrosite | Level 12

As @Kurt_Bremser suggests, the problem probably lies outside the elements you have given here.

Maybe the variable t is used somewhere else for another purpose (a counter for a loop ?).

Start by commenting all code sections that are not directly concerned. If the problem disappears

then progressively uncomment the code until it appears again to find the code section where it occurs.

Elle
Quartz | Level 8

I apologize, I'm not able to reproduce the problem with a mock dataset. As to the data I'm having, it's securely protected on a Hadoop server. I think the problem might be related to the way it's stored because even a simple t+1 jumps thousands of observations in multiple points.

ballardw
Super User

After your first pass through the data with the

data have;

   set have;

 

you no longer have a valid data set to work from more than likely. Retain will not work as expected after the first pass because you added the T variable to the data and that existing value will replace the "retained" version from the code.

 

 

And what is the result supposed to look like?

 

gamotte
Rhodochrosite | Level 12

I may be wrong but i think the T variable is only added to the PDV after the first pass.

It is only after all input rows have been processed that the output dataset is created

(here with the same name so it overwrites the input).

 

This small example does seem to work :

data have;
	drop i;
	do i=1 to 10000;
		amount=rand("Uniform")>0.5;
		output;
	end;
run;

data have;
	set have;
	retain t 0;

	if amount then t=t+1;
run;

 

Tom
Super User Tom
Super User

RETAIN will not work properly for a variable that already exists on the input data set.  So if this is you input.

data have;
  infile datalines delimiter="!";
  input id date:ddmmyy. amount desc:$50. t;
  format date yymmdd10.;
cards;
1!01/01/2017!5!Commission ! 3615
1!01/01/2017!0!01/07/2017 2 USD ! 3615
1!02/01/2017!54!Walmart ! 3616
1!30/01/2017!1000!Rent ! 119401
1!30/01/2017!100!Tax ! 119402
;

Do you want to make a NEW variable?

data want ;
  set have ;
  new_t + (amount ne 0) ;
run;

image.png

Or do you want to conditionally add one to an existing variable?

data want ;
  set have ;
  new_t = t + (amount ne 0) ;
run;

image.png

Or perhaps reset the new variable when AMOUNT does equal 0?

data want ;
  set have ;
  if amount = 0 then new_t = t;
  else new_t + 1;
run;

image.png

 

 

 

 

 

Elle
Quartz | Level 8

Thank you for your response. I've edited my post and added some replies to clarify the situation. The problem is with the t that jumps from 3616 to 119000 although I'm using the select statement with the condition of t+1. Somehow I suspect the dataset has invisible rows that t accounts for. Does that make sense? 

Elle
Quartz | Level 8

Dear Kurt, I already said in my previous posts I cannot reproduce the issue with mock data (fake data generated by an algorithm). Not because I don't want to. It won't replicate. I also said I cannot provide the original data due to security issues. I simply don't have enabled the capability of Copy/Export/Send to ... I would be in violation of the company's policy.

Indeed, it's not rocket science to read through the replies before demanding things.

Kurt_Bremser
Super User

@Elle wrote:

Dear Kurt, I already said in my previous posts I cannot reproduce the issue with mock data (fake data generated by an algorithm). Not because I don't want to. It won't replicate. I also said I cannot provide the original data due to security issues. I simply don't have enabled the capability of Copy/Export/Send to ... I would be in violation of the company's policy.

Indeed, it's not rocket science to read through the replies before demanding things.


Then the problem is clearly with your data. If it can't be reproduced with a piece of code, it is simply not there where you're looking.

 

As others suggested, pull the dataset into SAS (WORK) so you get a SAS-only process, to make sure you don't suffer from a "Hadoop-Effect".

 

I also suspect that the code you showed is a very tiny part of the complete code, so you have ample opportunities of hidden side-effects there.

 

Start with simple code that creates t, like the code I gave you. If that does not work already, contact SAS technical support (they work under non-disclosure rules, so you can give them your data or access to your system).

If it works, expand the code until you either reach your intended functionality or the problem arises anew; then you'll know what's causing your problem.

 

One of the possible problem points can be SQL steps that don't specify a forced order for the output. SQL optimization may lead to unexpected reordering of data. See Maxim 31. And heed Maxim 34, so you can follow the development of your data through simply recreatable (and maintainable) steps.

Tom
Super User Tom
Super User

@Elle wrote:

Thank you for your response. I've edited my post and added some replies to clarify the situation. The problem is with the t that jumps from 3616 to 119000 although I'm using the select statement with the condition of t+1. Somehow I suspect the dataset has invisible rows that t accounts for. Does that make sense? 


Are you sure you have not just sorted the data into a different order?

Or is it possible that SAS has somehow used multiple threads since you are using HADOOP ata and so it cannot really process the data in seqential order?

sspkmnd
Obsidian | Level 7

Could you try following code:

 

data have2;
    set have;
    retain t 0;
    t = t + (amount > %sysevalf(10**-8));
run;

 Also, could you try to copy the dataset in SAS to ensure that it is stored in SAS native format like:

 

data new;
    set old;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 15 replies
  • 1688 views
  • 5 likes
  • 6 in conversation