Solved: Re: Issue combining datasets and modifying data in the same step

jhealthpolicy · Posted 05-04-2020 04:18 PM

HI, I am having an issue combining datasets and I am looking for help as to what is going on.

I have two datasets, all fields are character.

get_test1_nodup has the following data:

Spot_2	Spot_4	Spot_3
1	7	2
1	7	1
1	1	1
1	1	2

get_test2_nodup has the following data:

orig_identifier	identifier	Spot_2	Spot_4	Spot_3
		1	7	1
112	112
111	111
		1	1	2
		1	1	1

I combine the datasets with the following code:
data combine;

set get_test1_nodup

get_test2_nodup

;

if identifier ='' then identifier=Spot_2||Spot_4||Spot_3;

run;

Here is the resulting dataset:

orig_identifier	identifier	Spot_2	Spot_4	Spot_3
	172	1	7	2
	172	1	7	1
	172	1	1	1
	172	1	1	2
	171	1	7	1
112	112
111	111
	112	1	1	2
	111	1	1	1

The identifier field did not compute as I intended. Take the second row for example, identifier should be 171, not 172. What is causing this? Furthermore, where is it getting the 172 value its putting in the field? I have resolved this by combining the datasets first and then creating another dataset with the if statement, but why can't it be done in one step? I'd like to understand what is going on here.

Tom · Posted 05-04-2020 04:52 PM

Because IDENTIFIER is coming from one of the input datasets it is RETAINed (that it it is NOT reset to missing at the start of the data step iteration). But it is NOT in TEST1. So when you read a new observation from TEST1 the value does not change. So the test in your IF statement fails after the first observation from TEST1 since IDENTIFIER is no longer blank.

You could create a NEW variable.

new_identifier = coalescec(identifier, cats(Spot_2,Spot_4,Spot_3));

Or use IN= dataset option variables to drive your IF/THEN statement.

data combine;
  set get_test1_nodup (in=in1)
      get_test2_nodup (in=in2)
  ;
  if in1 then identifier=Spot_2||Spot_4||Spot_3;
run;

View solution in original post

Tom · Posted 05-04-2020 04:52 PM

Because IDENTIFIER is coming from one of the input datasets it is RETAINed (that it it is NOT reset to missing at the start of the data step iteration). But it is NOT in TEST1. So when you read a new observation from TEST1 the value does not change. So the test in your IF statement fails after the first observation from TEST1 since IDENTIFIER is no longer blank.

You could create a NEW variable.

new_identifier = coalescec(identifier, cats(Spot_2,Spot_4,Spot_3));

Or use IN= dataset option variables to drive your IF/THEN statement.

data combine;
  set get_test1_nodup (in=in1)
      get_test2_nodup (in=in2)
  ;
  if in1 then identifier=Spot_2||Spot_4||Spot_3;
run;

ChrisNZ · Posted 05-04-2020 06:31 PM

In other words:

Because IDENTIFIER is coming from one of the input data sets it is RETAINed, and only reset when read again from a data set. Since it is not present in TEST1, it is never reset as long as TEST1 is read. And since it takes a value on the first iteration, it keeps this value until it is read from TEST2 and is not blank.

You can create a new variable, or you can test the source:

data COMBINE;
  set TEST1 (in=TEST1)
      TEST2 ;
  if TEST1 | IDENTIFIER = ' ' then IDENTIFIER=catt(SPOT_2,SPOT_4,SPOT_3) ;
run;

High-Performance SAS Coding - Third Edition

jhealthpolicy · Posted 05-04-2020 08:11 PM

I think i'm starting to understand. Why does it retain the value by default? I thought that was the job of the retain statement? Would you elaborate on why it satisfies the if condition on the first obs, but not on any subsequent records on the first dataset? I dont see how SAS sees it as blank for only the first record.

ChrisNZ · Posted 05-04-2020 08:47 PM

All data set variables are retained until overwritten by another data set value. That's the way a data step works.
Since TEST1 doesn't overwrite the value, IDENTIFIER stays untouched across iterations. IDENTIFIER is changed by your code when its value is blank. Thereafter it's not blank so the test fails and it remains as is.

High-Performance SAS Coding - Third Edition

jhealthpolicy · Posted 05-04-2020 10:44 PM

Why does it satisfy the if statement on the first observation?

ChrisNZ · Posted 05-04-2020 10:49 PM

Because identifier is blank.

High-Performance SAS Coding - Third Edition

jhealthpolicy · Posted 05-04-2020 11:05 PM

So on the first obs it is blank, but all lines after that they are missing (therefore not satisfying the if statement)? I guess i'm not understanding why the first one it evaluates as blank, but after that they are evaluated as missing.

ChrisNZ · Posted 05-04-2020 11:17 PM

> I guess i'm not understanding why the first one it evaluates as blank, but after that they are evaluated as missing.

blank and missing are the same thing for a string.

High-Performance SAS Coding - Third Edition

jhealthpolicy · Posted 05-05-2020 12:26 AM

I just read the first few pages of the sas paper on the program data vector and reread all of these responses 5 times.

My understanding of the dataset creation is as follows:

1. The data step initializes and creates a list of all variables from both test1 and test2.

2. In the PDV, all variables are initialized to missing(according to the sas document) and all variables are retained until another round of the by statement.

3. The first iteration of reading in values for each variable from test1 occurs. It pulls in all values for all matching variables, but because test1 does not have the identifier variable, it keeps its value of missing.

4. My if statement is processed, because it is missing, it combines spot 2, 3, 4.

5. The next iteration of the datastep is executed, because identifier is not on the first dataset, it retains the value from the previous step due to the automatic retain in step 2 above.

6. This loop continues until identifier has a value, in this case from the test2 dataset.

The paper i am referencing(page 3):

http://support.sas.com/resources/papers/proceedings13/125-2013.pdf

Do i understand the process correctly?

ChrisNZ · Posted 05-05-2020 04:43 AM

> Do i understand the process correctly?

Perfectly! 👍

High-Performance SAS Coding - Third Edition

Tom · Posted 05-04-2020 09:18 PM

That is how data steps work. You normally don't notice because most steps are in the form:

data new;
  set old;

So the first thing the iteration does is read in the data from the old dataset.

This is why one to many merges can work. The observation from the dataset with just one observation is read the first time and then the values for the variables it is contributing stay the same until the next set of BY values are reached.

mkeintz · Posted 05-04-2020 09:20 PM

Apropos of @ChrisNZ 's explanation, you could easiliy fix with an explict OUTPUT statement followed by a CALL MISSING.

data combine;
  set   get_test1_nodup
        get_test2_nodup   ;
  if identifier ='' then identifier=Spot_2||Spot_4||Spot_3;
  output;
  call missing(of _all_);
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ChrisNZ · Posted 05-04-2020 10:09 PM

Or even

data combine;
  call missing(of _all_);
  set get_test1_nodup
      get_test2_nodup ;
  if identifier ='' then identifier=Spot_2||Spot_4||Spot_3;
run;

High-Performance SAS Coding - Third Edition

Tom · Posted 05-04-2020 10:22 PM

@ChrisNZ wrote:

Or even

data combine;
  call missing(of _all_);
  set get_test1_nodup
      get_test2_nodup ;
  if identifier ='' then identifier=Spot_2||Spot_4||Spot_3;
run;

That cannot work, because of the way that the _ALL_ keyword is evaluated at the point it is seen, not after all of the data step has been compiled. In fact it won't even run.

1193  data combine;
1194    call missing(of _all_);
             -------
             252
ERROR 252-185: The MISSING subroutine call does not have enough arguments.

1195    set test1 test2 ;
1196    if identifier ='' then identifier=Spot_2||Spot_4||Spot_3;
1197  run;

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away