topic Re: Merging multiple data sets in a data step in SAS Programming

Merging multiple data sets in a data step

Cruise — Wed, 15 Apr 2020 04:39:43 GMT

Hi Folks:

I'm merging 4 datasets and would like to select the observations matched to main_attr1(in=a) data. I would expect the resulting merged data w_four to have N=250 observations matched to the size of data (in=a). However, merged data w_four resulted in 262 rows.

proc freq below shows that w_four has all levels of data in the other datasets which obviously is not true.

How to do this merge in data step correct? Any suggestions appreciated.

Thank you very much for your time.

data w_four;  
merge 
outcome          (in=o) /*228*/
main_attr1(in=a) /*250*/
Korean_foreign1  (in=f) /*261*/
Korean_migration1(in=m); /*252*/
by id1name idname;
outcome=o;
map_attr=a;
foreign=f;
migration=m;
if a then output w_four; /*262*/ 
run; 
proc freq data=w_four;
tables map_attr*outcome*foreign*migration/list;
run;

Re: Merging multiple data sets in a data step

yabwon — Wed, 15 Apr 2020 05:37:24 GMT

Hi,

1) did you checked the log for "NOTE: MERGE statement has more than one data set with repeats of BY values"

2) try to run:

proc sql;
  select id1name, idname, count(1) as i
  from w_four
  group by id1name, idname
  having count(1) > 1
  ;
quit;

to find out which observations are possibly duplicated.

All the best

Bart

Re: Merging multiple data sets in a data step

ChrisNZ — Wed, 15 Apr 2020 06:26:10 GMT

Always add this kind of check in your code when you expect unique keys (unless the data had previously been vetted):

if not(first.IDNAME and last.IDNAME) then putlog 'WARNING: duplicate keys in merge';

Also, always check the log. Always.

Re: Merging multiple data sets in a data step

Kurt_Bremser — Wed, 15 Apr 2020 06:45:27 GMT

@ChrisNZ wrote:

Also, always check the log. Always.

aka Maxim 2.

Re: Merging multiple data sets in a data step

Cruise — Wed, 15 Apr 2020 12:30:45 GMT

Thanks all. My log was:

NOTE: There were 228 observations read from the data set WORK.OUTCOME.
NOTE: There were 250 observations read from the data set WORK.MAIN_ATTR1.
NOTE: There were 261 observations read from the data set WORK.KOREAN_FOREIGN1.
NOTE: There were 252 observations read from the data set WORK.KOREAN_MIGRATION1.
NOTE: The data set WORK.W_FOUR has 262 observations and 32 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.01 secondS

But after I added Chris's warning note the log turned into.

WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
WARNING: duplicate keys in merge
NOTE: There were 228 observations read from the data set WORK.OUTCOME.
NOTE: There were 250 observations read from the data set WORK.MAIN_ATTR1.
NOTE: There were 261 observations read from the data set WORK.KOREAN_FOREIGN1.
NOTE: There were 252 observations read from the data set WORK.KOREAN_MIGRATION1.
NOTE: The data set WORK.W_FOUR has 262 observations and 32 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

Yabwon's sql returned following.

Re: Merging multiple data sets in a data step

Cruise — Wed, 15 Apr 2020 13:00:48 GMT

With your pointers, the problem is solved. Hurray.

First I investigated each data set using proc sql checker suggested by Yabwon and found out 'foreign' and 'migration' datasets had duplicate keys. Mostly migration data set. Deduplicated them in data step. Proc sort nodup did not deduplicate. But I had no time to understand why even though proc sort nodup always did a job in the past for me.

But this article helped to be assured.

https://www.lexjansen.com/wuss/1998/WUSS98097.pdf

And deduplicated datasets in data step and tried merge which output data with N=250 the dataset that I wanted. Awesome.

Also, without addition of putlog 'WARNING: duplicate keys in merge' that Chris suggested the log wouldn't print out the warning on the duplicate keys.

Thanks all. I greatly greatly appreciate your time and insights. Helped a lot.

data w_four; /*250*/ 
merge 
outcome          (in=o) /*228*/
main_attr1(in=a) /*250*/
Korean_foreign2  (in=f) /*260*/
Korean_migration2(in=m); /*238*/
by id1name idname;
outcome=o;
map_attr=a;
foreign=f;
migration=m;
if a; /*250*/ 
if not(first.IDNAME and last.IDNAME) then putlog 'WARNING: duplicate keys in merge';
run;

Re: Merging multiple data sets in a data step

Kurt_Bremser — Wed, 15 Apr 2020 13:02:42 GMT

As long as only one of the incoming datasets has duplicate keys, SAS will not issue a NOTE, as a 1-to-many join is valid (and often done) in a data step merge. Only when two or more datasets have duplicates, SAS will issue the NOTE, as the outcome may not be what the programmer desired.

Re: Merging multiple data sets in a data step

Cruise — Wed, 15 Apr 2020 13:05:43 GMT

But in this case, two of the datasets had duplicate keys: foreign had one and migration had several. Mostly migration.

Re: Merging multiple data sets in a data step

Tom — Wed, 15 Apr 2020 13:08:28 GMT

Note that the NODUP option on proc sort will only eliminate duplicate observations, not observations that have duplicate keys. Note that it also will only eliminate duplicate observations if they happen to be right next to each other (there isn't a different observation between them).

You can use NODUPKEYS option to make sure there is only one observation per set of key (by vars) values.

Re: Merging multiple data sets in a data step

Kurt_Bremser — Wed, 15 Apr 2020 13:11:00 GMT

As long as you do not have duplicates on the same key, SAS will not detect that, as for the individual keys it is still one-to-many. Run these two codes in comparison:

data one;
input key;
datalines;
1
2
2
3
;

data two;
input key;
datalines;
1
1
2
3
;

data mgd;
merge
  one
  two
;
by key;
run;

vs.

data one;
input key;
datalines;
1
2
2
3
;

data two;
input key;
datalines;
1
2
2
3
;

data mgd;
merge
  one
  two
;
by key;
run;

The second code will issue the NOTE, as there's a duplicate for 2 in both datasets.

Re: Merging multiple data sets in a data step

Cruise — Wed, 15 Apr 2020 13:11:56 GMT

This was the case in the example here 'Note that the NODUP option on proc sort will only eliminate duplicate observations, not observations that have duplicate keys'. NODUP and NODUPKEYS were different options? I thought the earlier was the abbreviation of NODUPKEYS. My ignorance then.

Re: Merging multiple data sets in a data step

Cruise — Wed, 15 Apr 2020 13:14:01 GMT

Ah, got it. I can clearly see now. Thank you!