BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
GBL__
Quartz | Level 8

Hello,

 

I need to merge three files together using the 'account' variable from the first data set:

i.e. "10130147190001"

 

However, the second data set has an 'accountnumber' variable, but in this data set it includes additional info:

i.e. "10010130147190001"

 

And the third data set has an 'acctid' variable, but in this data set it is excluding info at the end:

i.e. "1013014719"

 

So, in the second and third data set I am creating the 'account' variable to use to merge on:

account = SUBSTR(STRIP(accountnumber), 4) ;

 And

account = CATS(STRIP(acctid), "0001") ;

Each 'account' variable is defined in the LENGTH statement for each data set when importing as Character with length of 30

 

However, when merging these three variables do NOT align in the final merged data set

PROC SORT DATA=data1 ;
  BY account ;
RUN ;

PROC SORT DATA=data2 ;
  BY account ;
RUN ;

DATA merged1 ;
  MERGE data1 data2 ;
  BY account ;
RUN ;

PROC SORT DATA=merged1 ;
  BY account ;
RUN ;

PROC SORT DATA=data3 ;
  BY account ;
RUN ;

DATA final_merged ;
  MERGE merged1 data3 ;
  BY account ;
RUN ;

 

And the Log:

NOTE: There were 9379 observations read from the data set WORK.DATA1.
NOTE: SAS sort was used.
NOTE: The data set WORK.DATA1 has 9379 observations and 46 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.04 seconds
      user cpu time       0.03 seconds
      system cpu time     0.01 seconds
      memory              7061.37k
      OS Memory           28328.00k
      Timestamp           03/05/2021 11:29:13 AM
      Step Count                        160  Switch Count  45


NOTE: There were 767 observations read from the data set WORK.DATA2.
NOTE: SAS sort was used.
NOTE: The data set WORK.DATA2 has 767 observations and 54 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.03 seconds
user cpu time 0.01 seconds
system cpu time 0.01 seconds
memory 1905.25k
OS Memory 23188.00k
Timestamp 03/05/2021 11:29:13 AM
Step Count 161 Switch Count 41


NOTE: There were 9379 observations read from the data set WORK.DATA1.
NOTE: There were 767 observations read from the data set WORK.DATA2.
NOTE: The data set WORK.MERGED1 has 10146 observations and 99 variables.
NOTE: DATA statement used (Total process time):
real time 0.07 seconds
user cpu time 0.01 seconds
system cpu time 0.06 seconds
memory 1258.34k
OS Memory 22156.00k
Timestamp 03/05/2021 11:29:13 AM
Step Count 162 Switch Count 43


NOTE: There were 10146 observations read from the data set WORK.MERGED1.
NOTE: SAS sort was used.
NOTE: The data set WORK.MERGED1 has 10146 observations and 99 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.07 seconds
user cpu time 0.01 seconds
system cpu time 0.06 seconds
memory 15542.37k
OS Memory 37072.00k
Timestamp 03/05/2021 11:29:13 AM
Step Count 163 Switch Count 45


NOTE: There were 185 observations read from the data set WORK.DATA3.
NOTE: SAS sort was used.
NOTE: The data set WORK.DATA3 has 185 observations and 41 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.03 seconds
user cpu time 0.00 seconds
system cpu time 0.03 seconds
memory 810.18k
OS Memory 22156.00k
Timestamp 03/05/2021 11:29:13 AM
Step Count 164 Switch Count 41

NOTE: There were 10146 observations read from the data set WORK.MERGED1.
NOTE: There were 185 observations read from the data set WORK.DATA3.
NOTE: The data set WORK.FINAL_MERGED has 10331 observations and 139 variables.
NOTE: DATA statement used (Total process time):
real time 0.08 seconds
user cpu time 0.03 seconds
system cpu time 0.04 seconds
memory 1520.84k
OS Memory 22464.00k
Timestamp 03/05/2021 11:29:14 AM
Step Count 165 Switch Count 43

 

NOT MATCH MERGING.JPG

 

Each of these three account variables are the same type and length.  I have tried COMPRESS(), STRIP() etc., I know I am an idiot, but what am I doing wrong here?

 

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @GBL__,

 

Assuming that your screenshot shows a portion of dataset final_merged, I suspect that invisible characters (such as tabs, '09'x) have hampered the match merge. To check this assumption I would display the seemingly equal values in $HEX60. format:

 

proc report data=final_merged(firstobs=... obs=...);
column account=acchex account;
define acchex / format=$hex60.;
run;

Use suitable observation numbers (in the firstobs= and obs= dataset options) so that the observations from the screenshot are included.

 

The result might look like this (without highlighting, though):

  account                                                       account
  313031333031343731363030303109202020202020202020202020202020  10130147160001
  313031333031343731363030303120202020202020202020202020202020  10130147160001

Then you would know the culprit (here: trailing tab character in the first observation). The COMPRESS function can remove all sorts of unwanted characters and just keep digits:

cleanacc=compress(dirtyacc,,'kd');

This would be applied to the relevant variables in the process of creating the account variables (and, if necessary, to account from the first dataset as well) before the sort and merge steps.

 

View solution in original post

6 REPLIES 6
GBL__
Quartz | Level 8

Hey Kurt,

 

Thank you for the reply, I updated the program to include a "merge variable" in each data set, but the merge is still not lining up as I am expecting:

 

In DATA1 "10130147190001" :
merge_var = SUBSTR(STRIP(account), 1, 10) ;  <-- "1013014719"
In DATA2 "10010130147190001" :
merge_var = SUBSTR(STRIP(accountnumber), 4, 10) ;  <-- "1013014719"
In DATA3 "1013014719" :
merge_var = STRIP(acctid) ;  <-- "1013014719"

 

I guess I will have to use PROC SQL for this then?  :'(

FreelanceReinh
Jade | Level 19

Hello @GBL__,

 

Assuming that your screenshot shows a portion of dataset final_merged, I suspect that invisible characters (such as tabs, '09'x) have hampered the match merge. To check this assumption I would display the seemingly equal values in $HEX60. format:

 

proc report data=final_merged(firstobs=... obs=...);
column account=acchex account;
define acchex / format=$hex60.;
run;

Use suitable observation numbers (in the firstobs= and obs= dataset options) so that the observations from the screenshot are included.

 

The result might look like this (without highlighting, though):

  account                                                       account
  313031333031343731363030303109202020202020202020202020202020  10130147160001
  313031333031343731363030303120202020202020202020202020202020  10130147160001

Then you would know the culprit (here: trailing tab character in the first observation). The COMPRESS function can remove all sorts of unwanted characters and just keep digits:

cleanacc=compress(dirtyacc,,'kd');

This would be applied to the relevant variables in the process of creating the account variables (and, if necessary, to account from the first dataset as well) before the sort and merge steps.

 

GBL__
Quartz | Level 8

Hey @FreelanceReinh 

 

Thank you very much for the reply!  I did try this earlier before creating this post (searched for awhile).  And using your recommendation here is some output (though not the exact same from the screenshot, but still):

 

HEX_ACCOUNT.JPG

 

I didnt notice any "bad" characters earlier, or just now when I ran the above  :'(

FreelanceReinh
Jade | Level 19

Thanks for checking. The "HEX ACCOUNT" covers only 20 of the 30 characters in account, so there's a remote possibility of differences in the remaining characters.

 

If nothing suspicious was found there, I would probably create small subsets (with only a few relevant observations and variables) of the datasets to be merged and run the MERGE step on these samples in a new SAS session.

AMSAS
SAS Super FREQ

I think this is what you are looking for.
Basically get your account numbers into the same format, then do a SQL merge:

 


/* Create Sample Data */
data table1 ;
	do i = 1 to 10 ;
		account=putn(i,"z5.")!!"aaa" ;
		output ;
	end ;
run ;

data table2 ;
	do i = 1 to 10 by 2;
		accountNumber="bbb"!!putn(i,"z5.")!!"aaa" ;
		output ;
	end ;
run ;

data table3 ;
	do i = 1 to 10 by 3;
		acctid=putn(i,"z5.") ;
		output ;
	end ;
run ;

/* Standardize account numbers to acctId format on table3 */
data table1Std ;
	set table1 ;
	matchId=substr(account,1,5) ;
run ;

data table2Std ;
	set table2 ;
	matchId=substr(accountNumber,4,5) ;
run ;

/* Merge the tables */
proc sql ;
	create table merged as
	select table1Std.account, table2Std.accountNumber, table3.acctId
	from
		table1Std, table2Std, table3
	where
		table1Std.matchId=table2Std.matchid=table3.acctId
	;
quit ;

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1155 views
  • 1 like
  • 4 in conversation