BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sams54156
Calcite | Level 5

Hi ,

 

I have two files . I want to join these two files using variable .

 

Data firstfile;

Var1                       Var2                       Var3                       Var4

TEST1                    AAAA                    1234                       YES

TEST1                    BBBB                     4567                       YES

TEST1                    CCCC                     6789                       NO                        

TEST2                    ZZZZ                       1234                       YES

TEST2                    YYYY                       4567                       NO        

TEST2                    AAAA                    6789                       YES

TEST3                    XXXX                     3456                       NO        

TEST3                    BBBB                     2345                       YES

TEST3                    DDDD                    1234                       NO

               

 

Data Second file;

Var1                       Var2                       Var3                      

TEST1                    AAAA                    NAME1                

TEST1                    BBBB                     NAME1                

TEST2                    ZZZZ                       NAME2                

TEST2                    YYYY                       NAME2                

TEST3                    BBBB                     NAME3                

TEST3                    DDDD                    NAME3                

 

Desired Output :

 

Data firstfile;

Var1                       Var2                       Var3                       Var4       Var5

TEST1                    AAAA                    1234                       YES         NAME1

TEST1                    BBBB                     4567                       YES         NAME1

TEST2                    ZZZZ                       1234                       YES         NAME2

TEST2                    YYYY                       4567                       NO         NAME2

TEST3                    BBBB                     2345                       YES         NAME3

TEST3                    DDDD                    1234                       NO         NAME3

 

Basically I want to compare Var1 and Var2 in both files and if they are same then read the Var3 from 2nd and append it next to the record in the first file .

 

Can this be achieved using Proc SQL or we need to perform a merge .

 

Thanks .

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

Yes, you can do it with PROC SQL like this

 

data First;
input Var1 $ Var2 $ Var3 Var4 $;
datalines;
TEST1 AAAA 1234 YES
TEST1 BBBB 4567 YES
TEST1 CCCC 6789 NO 
TEST2 ZZZZ 1234 YES
TEST2 YYYY 4567 NO 
TEST2 AAAA 6789 YES
TEST3 XXXX 3456 NO 
TEST3 BBBB 2345 YES
TEST3 DDDD 1234 NO
;

data Second;
input Var1 $ Var2 $ Var3 $;
datalines;
TEST1 AAAA NAME1
TEST1 BBBB NAME1
TEST2 ZZZZ NAME2
TEST2 YYYY NAME2
TEST3 BBBB NAME3
TEST3 DDDD NAME3
;

proc sql;
   create table want as
   select Second.Var1
         ,Second.Var2
         ,First.Var3
         ,First.Var4
         ,Second.Var3 as Var5
   from First, Second
   where First.Var1=Second.Var1 and First.Var2=Second.Var2;
quit;

View solution in original post

10 REPLIES 10
Reeza
Super User

@sams54156 wrote:

 

 

Can this be achieved using Proc SQL or we need to perform a merge .

 

Thanks .


You can do it either using SQL or a merge, assuming your key/joining variables are VAR1 & VAR2. Your best bet is to rename variables with the same name though so you don't end up with weird issues. 

PeterClemmensen
Tourmaline | Level 20

Yes, you can do it with PROC SQL like this

 

data First;
input Var1 $ Var2 $ Var3 Var4 $;
datalines;
TEST1 AAAA 1234 YES
TEST1 BBBB 4567 YES
TEST1 CCCC 6789 NO 
TEST2 ZZZZ 1234 YES
TEST2 YYYY 4567 NO 
TEST2 AAAA 6789 YES
TEST3 XXXX 3456 NO 
TEST3 BBBB 2345 YES
TEST3 DDDD 1234 NO
;

data Second;
input Var1 $ Var2 $ Var3 $;
datalines;
TEST1 AAAA NAME1
TEST1 BBBB NAME1
TEST2 ZZZZ NAME2
TEST2 YYYY NAME2
TEST3 BBBB NAME3
TEST3 DDDD NAME3
;

proc sql;
   create table want as
   select Second.Var1
         ,Second.Var2
         ,First.Var3
         ,First.Var4
         ,Second.Var3 as Var5
   from First, Second
   where First.Var1=Second.Var1 and First.Var2=Second.Var2;
quit;
novinosrin
Tourmaline | Level 20

A simple hash find method=0 is all you need if you choose a datastep using var1 and var2 as keys

singhsahab
Lapis Lazuli | Level 10

Hello,

 

You can use PROC SQL or Data Merge , you can use any method based on number of observation in datasets.

 

If there are huge observation you can use PROC Format or hash table to join.

 

Thanks...

novinosrin
Tourmaline | Level 20

@singhsahab HI, Can you post the Proc format solution for the sample provided in this thread plz if you can lend a few mins? @PeterClemmensen has got the samples in the form of a datastep for you to work on. The reason I am asking is I would like to see how efficiently that can be done and learn. Thank you!

ballardw
Super User

@novinosrin wrote:

@singhsahab HI, Can you post the Proc format solution for the sample provided in this thread plz if you can lend a few mins? @PeterClemmensen has got the samples in the form of a datastep for you to work on. The reason I am asking is I would like to see how efficiently that can be done and learn. Thank you!


If the VAR1 is one-to-one with Var3 (or at least no VAR1 values map to 2 or more Var3) as shown in that very small example SECOND data set this creates a format for that mapping:

proc sql;
   create table temp as
   select distinct var1, var3
   from second
   ;
quit;

data fmtcntlin;
  set temp;
  fmtname = '$test2name';
  type='char';
  start=var1;
  label=var3;
  drop var1 var3;1
run;

proc format library=work cntlin=fmtcntlin;
run;

The want set would be

 

data want;
   set  first;
   var5=put(var1,$test2name.);
run;

Though of course there may not actually  be a need for the added variable with the format.

 

novinosrin
Tourmaline | Level 20

Thank you Sir @ballardw However, don;t you need to concatenate var1 and var2 before you create the format with cntlin option to meet OP's requirement. That's the reason I deliberately raised this question as this makes it a full extra pass of the dataset second like

 

data fmt;
set second;
start=catx(' ',var1,var2);
rename var3=label;
drop var1 var2;
retain Fmtname '$fmt' Type 'C';
run;


proc format cntlin=fmt fmtlib;
run;

 

And then of course I agree the applying the same using put function. So should the second dataset be large, me thinks the pass is perhaps much too costly. Even so, there has to be an extra conditional check when applying the char format  as for those records that doesn't have format associated, the var1 and var2 value would be written as is in the var5 variable. 

 

While in hash approach, the key is anyway unique combination of var1 and var2 of both tables. And the look up seems so simple to go straight to the address, pick and deliver. 

 

Best Regards,

PS I saw you on the call and wanted to say Hello, however my microphone wasn't working. 

 

 

 

 

Reeza
Super User
I don't think this is a case where a format would be significantly more efficient, especially since more than one column is being added - too many passes of the data sets.

If you want an example of where a HASH vs FORMAT would be appropriate this one is likely a good test: https://stackoverflow.com/questions/52745225/can-proc-format-be-used-to-sum-within-groups-across-man...
ballardw
Super User

@novinosrin wrote:

Thank you Sir @ballardw However, don;t you need to concatenate var1 and var2 before you create the format with cntlin option to meet OP's requirement. That's the reason I deliberately raised this question as this makes it a full extra pass of the dataset second like

 

data fmt;
set second;
start=catx(' ',var1,var2);
rename var3=label;
drop var1 var2;
retain Fmtname '$fmt' Type 'C';
run;


proc format cntlin=fmt fmtlib;
run;

 

And then of course I agree the applying the same using put function. So should the second dataset be large, me thinks the pass is perhaps much too costly. Even so, there has to be an extra conditional check when applying the char format  as for those records that doesn't have format associated, the var1 and var2 value would be written as is in the var5 variable. 

 

While in hash approach, the key is anyway unique combination of var1 and var2 of both tables. And the look up seems so simple to go straight to the address, pick and deliver. 

 

Best Regards,

PS I saw you on the call and wanted to say Hello, however my microphone wasn't working. 

 

 

 

 


The example data provided does not show VAR2 having an impact on the map. If so, the data should show so. Also since a format takes a single value as input to get the result using both variables would require creating another variable to do the mapping. Which is a poor idea adding complication for no real gain. When two values are actually needed to match then format is not an appropriate choice, which I suspect it really is not in this case. Hash, SQL or possibly even a data step Merge or modify may be better.

mkeintz
PROC Star

Now you could do an ordinary DATA step with MERGE ... BY, but first you'd need to sort both datasets by VAR1 VAR2. If sorting is expensive and if:

  1. Every observation in dataset2 has a match in dataset1
  2. And each of those matching observations are in the same order (no matter how many or where extra non-matching records appear in dataset1), then you can do a pair of set statements:
data want (drop=_:);
  set second (rename=(var3=var5));
  do until (_var1=var1 and _var2=var2);
    set first (rename=(var1=_var1 var2=_var2));
  end;
run;

 

Unlike the proc sql, and unlike the SORT-with_MERGE-BY solutions, this program preserves original order, and takes only one pass through the data.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 1504 views
  • 2 likes
  • 7 in conversation