Solved: Data Step merge gives more observations than expected

constliv · Posted 11-22-2017 10:14 AM

I am running the below code on SAS EG 5.1 .

***START preparing data for merging;
data _1_ACRONYM_HUBIDS_FOUND ;
	set ACRONYM_HUBIDS_FOUND ;
	customerno = LEFT(UPCASE(GROUP_NO_FOR_ACRONYM)) ; /*GROUP_NO_FOR_ACRONYM is copied as new variable customerno be the 'merge by' variable*/
run ;

data _1_cust_groups_FINAL ;
set cust_groups_FINAL ;
customerno = LEFT(UPCASE(customerno)) ;
run ;

proc sort data=_1_ACRONYM_HUBIDS_FOUND ; by customerno ; run;
proc sort data=_1_cust_groups_FINAL; by customerno; run ;

proc sql noprint ;
%let keeplist=;
select name into :keeplist separated by ' '
from dictionary.columns
where libname='WORK'
and memname= UPCASE('_1_ACRONYM_HUBIDS_FOUND') ;
;
quit;

OPTIONS MERGENOBY=WARN MSGLEVEL=I;

DATA ACRONYM_HUBIDS_with_custgroups (keep=&keeplist rank customergroup)
	acronyms_not_in_top50 (keep=&keeplist rank customerno customergroup) /*dataset to identify RAY customers not in top 50 customergroups*/
	EXCEPTION_1;

/*format statement to order variables in output datasets*/
format 	&keeplist
		rank customergroup
;

MERGE _1_ACRONYM_HUBIDS_FOUND(IN=In1) _1_cust_groups_FINAL (IN=In2);
BY customerno ;

IF (In1=1 and In2=1) then output ACRONYM_HUBIDS_with_custgroups ;
IF (In1 = 1 and  In2=0 ) then output acronyms_not_in_top50 ;
RUN;

My issue is with the results of the merge as reported where there seems to 1 record appearing in both output files of the data step merge when the output criteria (in=) are mutually exclusive.

Here's the extract from the log:

NOTE: There were 3894 observations read from the data set WORK._1_ACRONYM_HUBIDS_FOUND.

NOTE: There were 1374 observations read from the data set WORK._1_CUST_GROUPS_FINAL.

NOTE: The data set WORK.ACRONYM_HUBIDS_WITH_CUSTGROUPS has 3261 observations and 11 variables.

NOTE: The data set WORK.ACRONYMS_NOT_IN_TOP50 has 634 observations and 11 variables.

NOTE: The data set WORK.EXCEPTION_1 has 0 observations and 11 variables.

3261 + 634 = 3895 , but the "left join" left dataset WORK._1_ACRONYM_HUBIDS_FOUND had only 3894 observations.

PROC CONTENTS of both files used in the merge are provided as PDF attachment to this post.

art297 · Posted 11-22-2017 10:38 AM

Without see your data we can only guess. My guess would be that you have a duplicate record for one customer in your _1_cust_groups_FINAL dataset.

For example, the following code results in the unexpected number you encountered:

DATA _1_ACRONYM_HUBIDS_FOUND;
  input customerno;
  cards;
1
2
3
4
;

DATA _1_cust_groups_FINAL;
  input customerno;
  cards;
1
3
3
4
;

DATA ACRONYM_HUBIDS_with_custgroups
	acronyms_not_in_top50
	EXCEPTION_1;
  MERGE _1_ACRONYM_HUBIDS_FOUND(IN=In1) _1_cust_groups_FINAL (IN=In2);
  BY customerno ;

  IF (In1=1 and In2=1) then output ACRONYM_HUBIDS_with_custgroups ;
  IF (In1 = 1 and  In2=0 ) then output acronyms_not_in_top50 ;
RUN;

Also, you create a file called EXCEPTION_1, but don't include a statement to output any records to it.

Art, CEO, AnalystFinder.com

View solution in original post

RW9 · Posted 11-22-2017 10:23 AM

If you provide test data which illustrates the issue, then it would be easier to help. I can suggest using a proc sql join to simplify the code:

proc sql;
  create table ACRONYM_HUBIDS_with_custgroups as 
  select A.*,
         B.....
  from   _1_ACRONYM_HUBIDS_FOUND A
  inner join _1_CUST_GROUPS_FINAL B
  on     A.CUSTOMERNO=B.CUSTOMER_NO;
  create table ACRONYMS_NOT_IN_TOP50 as 
  select ...
  from   ...
  where  CUSTOMERNO not in (select CUSTOMERNO from _1_CUST_GROUPS_FINAL);
quit;

art297 · Posted 11-22-2017 10:38 AM

Without see your data we can only guess. My guess would be that you have a duplicate record for one customer in your _1_cust_groups_FINAL dataset.

For example, the following code results in the unexpected number you encountered:

DATA _1_ACRONYM_HUBIDS_FOUND;
  input customerno;
  cards;
1
2
3
4
;

DATA _1_cust_groups_FINAL;
  input customerno;
  cards;
1
3
3
4
;

DATA ACRONYM_HUBIDS_with_custgroups
	acronyms_not_in_top50
	EXCEPTION_1;
  MERGE _1_ACRONYM_HUBIDS_FOUND(IN=In1) _1_cust_groups_FINAL (IN=In2);
  BY customerno ;

  IF (In1=1 and In2=1) then output ACRONYM_HUBIDS_with_custgroups ;
  IF (In1 = 1 and  In2=0 ) then output acronyms_not_in_top50 ;
RUN;

Also, you create a file called EXCEPTION_1, but don't include a statement to output any records to it.

Art, CEO, AnalystFinder.com

constliv · Posted 11-23-2017 02:09 AM

Art,

You guessed right! Indeed there are 2 records with same customerno in my dataset _1_cust_groups_FINAL !

However, I am curious to know why this did not get reported in the log with a message such as 'multiple values for BY variable" (Log provided as attachment).

art297 · Posted 11-23-2017 09:24 AM

You didn't get a note because multiple records for a by variable in one data set is quite normal in a merge. The note is only included when you have multiple records for a by variable in both datasets.

Art, CEO, AnalystFinder.com

Data Step merge gives more observations than expected

Re: Data Step merge gives more observations than expected

Re: Data Step merge gives more observations than expected

Re: Data Step merge gives more observations than expected

Re: Data Step merge gives more observations than expected

Re: Data Step merge gives more observations than expected

Data Step merge gives more observations than expected

Re: Data Step merge gives more observations than expected

Re: Data Step merge gives more observations than expected

Re: Data Step merge gives more observations than expected

Re: Data Step merge gives more observations than expected

Re: Data Step merge gives more observations than expected

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away