Solved: Re: Match Merge not lining up

GBL__ · Posted 03-05-2021 12:57 PM

Hello,

I need to merge three files together using the 'account' variable from the first data set:

i.e. "10130147190001"

However, the second data set has an 'accountnumber' variable, but in this data set it includes additional info:

i.e. "10010130147190001"

And the third data set has an 'acctid' variable, but in this data set it is excluding info at the end:

i.e. "1013014719"

So, in the second and third data set I am creating the 'account' variable to use to merge on:

account = SUBSTR(STRIP(accountnumber), 4) ;

And

account = CATS(STRIP(acctid), "0001") ;

Each 'account' variable is defined in the LENGTH statement for each data set when importing as Character with length of 30

However, when merging these three variables do NOT align in the final merged data set

PROC SORT DATA=data1 ;
  BY account ;
RUN ;

PROC SORT DATA=data2 ;
  BY account ;
RUN ;

DATA merged1 ;
  MERGE data1 data2 ;
  BY account ;
RUN ;

PROC SORT DATA=merged1 ;
  BY account ;
RUN ;

PROC SORT DATA=data3 ;
  BY account ;
RUN ;

DATA final_merged ;
  MERGE merged1 data3 ;
  BY account ;
RUN ;

And the Log:

NOTE: There were 9379 observations read from the data set WORK.DATA1.
NOTE: SAS sort was used.
NOTE: The data set WORK.DATA1 has 9379 observations and 46 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.04 seconds
      user cpu time       0.03 seconds
      system cpu time     0.01 seconds
      memory              7061.37k
      OS Memory           28328.00k
      Timestamp           03/05/2021 11:29:13 AM
      Step Count                        160  Switch Count  45


NOTE: There were 767 observations read from the data set WORK.DATA2.
NOTE: SAS sort was used.
NOTE: The data set WORK.DATA2 has 767 observations and 54 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time 0.03 seconds
      user cpu time 0.01 seconds
      system cpu time 0.01 seconds
      memory 1905.25k
      OS Memory 23188.00k
      Timestamp 03/05/2021 11:29:13 AM
      Step Count 161 Switch Count 41


NOTE: There were 9379 observations read from the data set WORK.DATA1.
NOTE: There were 767 observations read from the data set WORK.DATA2.
NOTE: The data set WORK.MERGED1 has 10146 observations and 99 variables.
NOTE: DATA statement used (Total process time):
      real time 0.07 seconds
      user cpu time 0.01 seconds
      system cpu time 0.06 seconds
      memory 1258.34k
      OS Memory 22156.00k
      Timestamp 03/05/2021 11:29:13 AM
      Step Count 162 Switch Count 43


NOTE: There were 10146 observations read from the data set WORK.MERGED1.
NOTE: SAS sort was used.
NOTE: The data set WORK.MERGED1 has 10146 observations and 99 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time 0.07 seconds
      user cpu time 0.01 seconds
      system cpu time 0.06 seconds
      memory 15542.37k
      OS Memory 37072.00k
      Timestamp 03/05/2021 11:29:13 AM
      Step Count 163 Switch Count 45


NOTE: There were 185 observations read from the data set WORK.DATA3.
NOTE: SAS sort was used.
NOTE: The data set WORK.DATA3 has 185 observations and 41 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time 0.03 seconds
      user cpu time 0.00 seconds
      system cpu time 0.03 seconds
      memory 810.18k
      OS Memory 22156.00k
      Timestamp 03/05/2021 11:29:13 AM
      Step Count 164 Switch Count 41

NOTE: There were 10146 observations read from the data set WORK.MERGED1.
NOTE: There were 185 observations read from the data set WORK.DATA3.
NOTE: The data set WORK.FINAL_MERGED has 10331 observations and 139 variables.
NOTE: DATA statement used (Total process time):
      real time 0.08 seconds
      user cpu time 0.03 seconds
      system cpu time 0.04 seconds
      memory 1520.84k
      OS Memory 22464.00k
      Timestamp 03/05/2021 11:29:14 AM
      Step Count 165 Switch Count 43

Each of these three account variables are the same type and length. I have tried COMPRESS(), STRIP() etc., I know I am an idiot, but what am I doing wrong here?

Thank you!

FreelanceReinh · Posted 03-05-2021 03:15 PM

Hello @GBL__,

Assuming that your screenshot shows a portion of dataset final_merged, I suspect that invisible characters (such as tabs, '09'x) have hampered the match merge. To check this assumption I would display the seemingly equal values in $HEX60. format:

proc report data=final_merged(firstobs=... obs=...);
column account=acchex account;
define acchex / format=$hex60.;
run;

Use suitable observation numbers (in the firstobs= and obs= dataset options) so that the observations from the screenshot are included.

The result might look like this (without highlighting, though):

  account                                                       account
  313031333031343731363030303109202020202020202020202020202020  10130147160001
  313031333031343731363030303120202020202020202020202020202020  10130147160001

Then you would know the culprit (here: trailing tab character in the first observation). The COMPRESS function can remove all sorts of unwanted characters and just keep digits:

cleanacc=compress(dirtyacc,,'kd');

This would be applied to the relevant variables in the process of creating the account variables (and, if necessary, to account from the first dataset as well) before the sort and merge steps.

View solution in original post

Kurt_Bremser · Posted 03-05-2021 01:25 PM

If the first 10 characters are unique, use SUBSTR with 1,10 on the longer strings.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

GBL__ · Posted 03-05-2021 02:24 PM

Hey Kurt,

Thank you for the reply, I updated the program to include a "merge variable" in each data set, but the merge is still not lining up as I am expecting:

In DATA1 "10130147190001" :
merge_var = SUBSTR(STRIP(account), 1, 10) ;  <-- "1013014719"

In DATA2 "10010130147190001" :
merge_var = SUBSTR(STRIP(accountnumber), 4, 10) ;  <-- "1013014719"

In DATA3 "1013014719" :
merge_var = STRIP(acctid) ;  <-- "1013014719"

I guess I will have to use PROC SQL for this then? :'(

FreelanceReinh · Posted 03-05-2021 03:15 PM

Hello @GBL__,

Assuming that your screenshot shows a portion of dataset final_merged, I suspect that invisible characters (such as tabs, '09'x) have hampered the match merge. To check this assumption I would display the seemingly equal values in $HEX60. format:

proc report data=final_merged(firstobs=... obs=...);
column account=acchex account;
define acchex / format=$hex60.;
run;

Use suitable observation numbers (in the firstobs= and obs= dataset options) so that the observations from the screenshot are included.

The result might look like this (without highlighting, though):

  account                                                       account
  313031333031343731363030303109202020202020202020202020202020  10130147160001
  313031333031343731363030303120202020202020202020202020202020  10130147160001

Then you would know the culprit (here: trailing tab character in the first observation). The COMPRESS function can remove all sorts of unwanted characters and just keep digits:

cleanacc=compress(dirtyacc,,'kd');

This would be applied to the relevant variables in the process of creating the account variables (and, if necessary, to account from the first dataset as well) before the sort and merge steps.

GBL__ · Posted 03-05-2021 03:29 PM

Hey @FreelanceReinh

Thank you very much for the reply! I did try this earlier before creating this post (searched for awhile). And using your recommendation here is some output (though not the exact same from the screenshot, but still):

I didnt notice any "bad" characters earlier, or just now when I ran the above :'(

FreelanceReinh · Posted 03-05-2021 04:06 PM

Thanks for checking. The "HEX ACCOUNT" covers only 20 of the 30 characters in account, so there's a remote possibility of differences in the remaining characters.

If nothing suspicious was found there, I would probably create small subsets (with only a few relevant observations and variables) of the datasets to be merged and run the MERGE step on these samples in a new SAS session.

AMSAS · Posted 03-05-2021 03:23 PM

I think this is what you are looking for.
Basically get your account numbers into the same format, then do a SQL merge:


/* Create Sample Data */
data table1 ;
	do i = 1 to 10 ;
		account=putn(i,"z5.")!!"aaa" ;
		output ;
	end ;
run ;

data table2 ;
	do i = 1 to 10 by 2;
		accountNumber="bbb"!!putn(i,"z5.")!!"aaa" ;
		output ;
	end ;
run ;

data table3 ;
	do i = 1 to 10 by 3;
		acctid=putn(i,"z5.") ;
		output ;
	end ;
run ;

/* Standardize account numbers to acctId format on table3 */
data table1Std ;
	set table1 ;
	matchId=substr(account,1,5) ;
run ;

data table2Std ;
	set table2 ;
	matchId=substr(accountNumber,4,5) ;
run ;

/* Merge the tables */
proc sql ;
	create table merged as
	select table1Std.account, table2Std.accountNumber, table3.acctId
	from
		table1Std, table2Std, table3
	where
		table1Std.matchId=table2Std.matchid=table3.acctId
	;
quit ;

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!