Hi ,
I have two files . I want to join these two files using variable .
Data firstfile;
Var1 Var2 Var3 Var4
TEST1 AAAA 1234 YES
TEST1 BBBB 4567 YES
TEST1 CCCC 6789 NO
TEST2 ZZZZ 1234 YES
TEST2 YYYY 4567 NO
TEST2 AAAA 6789 YES
TEST3 XXXX 3456 NO
TEST3 BBBB 2345 YES
TEST3 DDDD 1234 NO
Data Second file;
Var1 Var2 Var3
TEST1 AAAA NAME1
TEST1 BBBB NAME1
TEST2 ZZZZ NAME2
TEST2 YYYY NAME2
TEST3 BBBB NAME3
TEST3 DDDD NAME3
Desired Output :
Data firstfile;
Var1 Var2 Var3 Var4 Var5
TEST1 AAAA 1234 YES NAME1
TEST1 BBBB 4567 YES NAME1
TEST2 ZZZZ 1234 YES NAME2
TEST2 YYYY 4567 NO NAME2
TEST3 BBBB 2345 YES NAME3
TEST3 DDDD 1234 NO NAME3
Basically I want to compare Var1 and Var2 in both files and if they are same then read the Var3 from 2nd and append it next to the record in the first file .
Can this be achieved using Proc SQL or we need to perform a merge .
Thanks .
Yes, you can do it with PROC SQL like this
data First;
input Var1 $ Var2 $ Var3 Var4 $;
datalines;
TEST1 AAAA 1234 YES
TEST1 BBBB 4567 YES
TEST1 CCCC 6789 NO
TEST2 ZZZZ 1234 YES
TEST2 YYYY 4567 NO
TEST2 AAAA 6789 YES
TEST3 XXXX 3456 NO
TEST3 BBBB 2345 YES
TEST3 DDDD 1234 NO
;
data Second;
input Var1 $ Var2 $ Var3 $;
datalines;
TEST1 AAAA NAME1
TEST1 BBBB NAME1
TEST2 ZZZZ NAME2
TEST2 YYYY NAME2
TEST3 BBBB NAME3
TEST3 DDDD NAME3
;
proc sql;
create table want as
select Second.Var1
,Second.Var2
,First.Var3
,First.Var4
,Second.Var3 as Var5
from First, Second
where First.Var1=Second.Var1 and First.Var2=Second.Var2;
quit;
@sams54156 wrote:
Can this be achieved using Proc SQL or we need to perform a merge .
Thanks .
You can do it either using SQL or a merge, assuming your key/joining variables are VAR1 & VAR2. Your best bet is to rename variables with the same name though so you don't end up with weird issues.
Yes, you can do it with PROC SQL like this
data First;
input Var1 $ Var2 $ Var3 Var4 $;
datalines;
TEST1 AAAA 1234 YES
TEST1 BBBB 4567 YES
TEST1 CCCC 6789 NO
TEST2 ZZZZ 1234 YES
TEST2 YYYY 4567 NO
TEST2 AAAA 6789 YES
TEST3 XXXX 3456 NO
TEST3 BBBB 2345 YES
TEST3 DDDD 1234 NO
;
data Second;
input Var1 $ Var2 $ Var3 $;
datalines;
TEST1 AAAA NAME1
TEST1 BBBB NAME1
TEST2 ZZZZ NAME2
TEST2 YYYY NAME2
TEST3 BBBB NAME3
TEST3 DDDD NAME3
;
proc sql;
create table want as
select Second.Var1
,Second.Var2
,First.Var3
,First.Var4
,Second.Var3 as Var5
from First, Second
where First.Var1=Second.Var1 and First.Var2=Second.Var2;
quit;
A simple hash find method=0 is all you need if you choose a datastep using var1 and var2 as keys
Hello,
You can use PROC SQL or Data Merge , you can use any method based on number of observation in datasets.
If there are huge observation you can use PROC Format or hash table to join.
Thanks...
@singhsahab HI, Can you post the Proc format solution for the sample provided in this thread plz if you can lend a few mins? @PeterClemmensen has got the samples in the form of a datastep for you to work on. The reason I am asking is I would like to see how efficiently that can be done and learn. Thank you!
@novinosrin wrote:
@singhsahab HI, Can you post the Proc format solution for the sample provided in this thread plz if you can lend a few mins? @PeterClemmensen has got the samples in the form of a datastep for you to work on. The reason I am asking is I would like to see how efficiently that can be done and learn. Thank you!
If the VAR1 is one-to-one with Var3 (or at least no VAR1 values map to 2 or more Var3) as shown in that very small example SECOND data set this creates a format for that mapping:
proc sql; create table temp as select distinct var1, var3 from second ; quit; data fmtcntlin; set temp; fmtname = '$test2name'; type='char'; start=var1; label=var3; drop var1 var3;1 run; proc format library=work cntlin=fmtcntlin; run;
The want set would be
data want; set first; var5=put(var1,$test2name.); run;
Though of course there may not actually be a need for the added variable with the format.
Thank you Sir @ballardw However, don;t you need to concatenate var1 and var2 before you create the format with cntlin option to meet OP's requirement. That's the reason I deliberately raised this question as this makes it a full extra pass of the dataset second like
data fmt;
set second;
start=catx(' ',var1,var2);
rename var3=label;
drop var1 var2;
retain Fmtname '$fmt' Type 'C';
run;
proc format cntlin=fmt fmtlib;
run;
And then of course I agree the applying the same using put function. So should the second dataset be large, me thinks the pass is perhaps much too costly. Even so, there has to be an extra conditional check when applying the char format as for those records that doesn't have format associated, the var1 and var2 value would be written as is in the var5 variable.
While in hash approach, the key is anyway unique combination of var1 and var2 of both tables. And the look up seems so simple to go straight to the address, pick and deliver.
Best Regards,
PS I saw you on the call and wanted to say Hello, however my microphone wasn't working.
@novinosrin wrote:
Thank you Sir @ballardw However, don;t you need to concatenate var1 and var2 before you create the format with cntlin option to meet OP's requirement. That's the reason I deliberately raised this question as this makes it a full extra pass of the dataset second like
data fmt;
set second;
start=catx(' ',var1,var2);
rename var3=label;
drop var1 var2;
retain Fmtname '$fmt' Type 'C';
run;
proc format cntlin=fmt fmtlib;
run;
And then of course I agree the applying the same using put function. So should the second dataset be large, me thinks the pass is perhaps much too costly. Even so, there has to be an extra conditional check when applying the char format as for those records that doesn't have format associated, the var1 and var2 value would be written as is in the var5 variable.
While in hash approach, the key is anyway unique combination of var1 and var2 of both tables. And the look up seems so simple to go straight to the address, pick and deliver.
Best Regards,
PS I saw you on the call and wanted to say Hello, however my microphone wasn't working.
The example data provided does not show VAR2 having an impact on the map. If so, the data should show so. Also since a format takes a single value as input to get the result using both variables would require creating another variable to do the mapping. Which is a poor idea adding complication for no real gain. When two values are actually needed to match then format is not an appropriate choice, which I suspect it really is not in this case. Hash, SQL or possibly even a data step Merge or modify may be better.
Now you could do an ordinary DATA step with MERGE ... BY, but first you'd need to sort both datasets by VAR1 VAR2. If sorting is expensive and if:
data want (drop=_:);
set second (rename=(var3=var5));
do until (_var1=var1 and _var2=var2);
set first (rename=(var1=_var1 var2=_var2));
end;
run;
Unlike the proc sql, and unlike the SORT-with_MERGE-BY solutions, this program preserves original order, and takes only one pass through the data.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.