BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jhealthpolicy
Fluorite | Level 6

HI, I am having an issue combining datasets and I am looking for help as to what is going on.

I have two datasets, all fields are character.

 

get_test1_nodup has the following data:

Spot_2

Spot_4

Spot_3

1

7

2

1

7

1

1

1

1

1

1

2

 

get_test2_nodup has the following data:

orig_identifier

identifier

Spot_2

Spot_4

Spot_3

 

 

1

7

1

112

112

 

 

 

111

111

 

 

 

 

 

1

1

2

 

 

1

1

1

 

I combine the datasets with the following code:
data combine;

set   get_test1_nodup

        get_test2_nodup

      ;

if identifier ='' then identifier=Spot_2||Spot_4||Spot_3;

run;

 

Here is the resulting dataset:

orig_identifier

identifier

Spot_2

Spot_4

Spot_3

 

172

1

7

2

 

172

1

7

1

 

172

1

1

1

 

172

1

1

2

 

171

1

7

1

112

112

 

 

 

111

111

 

 

 

 

112

1

1

2

 

111

1

1

1

 

The identifier field did not compute as I intended. Take the second row for example, identifier should be 171, not 172.  What is causing this? Furthermore, where is it getting the 172 value its putting in the field? I have resolved this by combining the datasets first and then creating another dataset with the if statement, but why can't it be done in one step? I'd like to understand what is going on here.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Because IDENTIFIER is coming from one of the input datasets it is RETAINed (that it it is NOT reset to missing at the start of the data step iteration).  But it is NOT in TEST1. So when you read a new observation from TEST1 the value does not change.  So the test in your IF statement fails after the first observation from TEST1 since IDENTIFIER is no longer blank.

 

You could create a NEW variable.

new_identifier = coalescec(identifier, cats(Spot_2,Spot_4,Spot_3));

Or use IN= dataset option variables to drive your IF/THEN statement.

data combine;
  set get_test1_nodup (in=in1)
      get_test2_nodup (in=in2)
  ;
  if in1 then identifier=Spot_2||Spot_4||Spot_3;
run;

 

View solution in original post

17 REPLIES 17
Tom
Super User Tom
Super User

Because IDENTIFIER is coming from one of the input datasets it is RETAINed (that it it is NOT reset to missing at the start of the data step iteration).  But it is NOT in TEST1. So when you read a new observation from TEST1 the value does not change.  So the test in your IF statement fails after the first observation from TEST1 since IDENTIFIER is no longer blank.

 

You could create a NEW variable.

new_identifier = coalescec(identifier, cats(Spot_2,Spot_4,Spot_3));

Or use IN= dataset option variables to drive your IF/THEN statement.

data combine;
  set get_test1_nodup (in=in1)
      get_test2_nodup (in=in2)
  ;
  if in1 then identifier=Spot_2||Spot_4||Spot_3;
run;

 

ChrisNZ
Tourmaline | Level 20

In other words:

Because IDENTIFIER is coming from one of the input data sets it is RETAINed, and only reset when read again from a data set. Since it is not present in TEST1, it is never reset as long as TEST1 is read. And since it takes a value on the first iteration, it keeps this value until it is read from TEST2 and is not blank.

 

You can create a new variable, or you can test the source:

data COMBINE;
  set TEST1 (in=TEST1)
      TEST2 ;
  if TEST1 | IDENTIFIER = ' ' then IDENTIFIER=catt(SPOT_2,SPOT_4,SPOT_3) ;
run;

  

jhealthpolicy
Fluorite | Level 6

I think i'm starting to understand. Why does it retain the value by default?  I thought that was the job of the retain statement?  Would you elaborate on why it satisfies the if condition on the first obs, but not on any subsequent records on the first dataset? I dont see how SAS sees it as blank for only the first record.

ChrisNZ
Tourmaline | Level 20

All data set variables are retained until overwritten by another data set value. That's the way a data step works.
Since TEST1 doesn't overwrite the value, IDENTIFIER stays untouched across iterations. IDENTIFIER is changed by your code when its value is blank. Thereafter it's not blank so the test fails and it remains as is.

jhealthpolicy
Fluorite | Level 6

Why does it satisfy the if statement on the first observation?

jhealthpolicy
Fluorite | Level 6

So on the first obs it is blank, but all lines after that they are missing (therefore not satisfying the if statement)? I guess i'm not understanding why the first one it evaluates as blank, but after that they are evaluated as missing.

ChrisNZ
Tourmaline | Level 20

>  I guess i'm not understanding why the first one it evaluates as blank, but after that they are evaluated as missing.

blank and missing are the same thing for a string.

 

jhealthpolicy
Fluorite | Level 6

I just read the first few pages of the sas paper on the program data vector and reread all of these responses 5 times.

My understanding of the dataset creation is as follows:

1. The data step initializes and creates a list of all variables from both test1 and test2.

2. In the PDV, all variables are initialized to missing(according to the sas document) and all variables are retained until another round of the by statement.

3. The first iteration of reading in values for each variable from test1 occurs. It pulls in all values for all matching variables, but because test1 does not have the identifier variable, it keeps its value of missing.

4. My if statement is processed, because it is missing, it combines spot 2, 3, 4.

5. The next iteration of the datastep is executed, because identifier is not on the first dataset, it retains the value from the previous step due to the automatic retain in step 2 above.

6. This loop continues until identifier has a value, in this case from the test2 dataset.

 

The paper i am referencing(page 3):

http://support.sas.com/resources/papers/proceedings13/125-2013.pdf

 

Do i understand the process correctly?

ChrisNZ
Tourmaline | Level 20

> Do i understand the process correctly?

Perfectly! 👍

Tom
Super User Tom
Super User

That is how data steps work.  You normally don't notice because most steps are in the form:

data new;
  set old;

So the first thing the iteration does is read in the data from the old dataset.

 

This is why one to many merges can work.  The observation from the dataset with just one observation is read the first time and then the values for the variables it is contributing stay the same until the next set of BY values are reached.

 

mkeintz
PROC Star

Apropos of @ChrisNZ 's explanation, you could easiliy fix with an explict OUTPUT statement followed by a CALL MISSING.

 

data combine;
  set   get_test1_nodup
        get_test2_nodup   ;
  if identifier ='' then identifier=Spot_2||Spot_4||Spot_3;
  output;
  call missing(of _all_);
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ChrisNZ
Tourmaline | Level 20

Or even

data combine;
  call missing(of _all_);
  set get_test1_nodup
      get_test2_nodup ;
  if identifier ='' then identifier=Spot_2||Spot_4||Spot_3;
run;

 

Tom
Super User Tom
Super User

@ChrisNZ wrote:

Or even

data combine;
  call missing(of _all_);
  set get_test1_nodup
      get_test2_nodup ;
  if identifier ='' then identifier=Spot_2||Spot_4||Spot_3;
run;

 


That cannot work, because of the way that the _ALL_ keyword is evaluated at the point it is seen, not after all of the data step has been compiled.  In fact it won't even run.

1193  data combine;
1194    call missing(of _all_);
             -------
             252
ERROR 252-185: The MISSING subroutine call does not have enough arguments.

1195    set test1 test2 ;
1196    if identifier ='' then identifier=Spot_2||Spot_4||Spot_3;
1197  run;

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 17 replies
  • 1701 views
  • 3 likes
  • 4 in conversation