Solved: Re: how to concatenate two different files comparing common variables

sams54156 · Posted 10-10-2018 11:12 AM

Hi ,

I have two files . I want to join these two files using variable .

Data firstfile;

Var1 Var2 Var3 Var4

TEST1 AAAA 1234 YES

TEST1 BBBB 4567 YES

TEST1 CCCC 6789 NO

TEST2 ZZZZ 1234 YES

TEST2 YYYY 4567 NO

TEST2 AAAA 6789 YES

TEST3 XXXX 3456 NO

TEST3 BBBB 2345 YES

TEST3 DDDD 1234 NO

Data Second file;

Var1 Var2 Var3

TEST1 AAAA NAME1

TEST1 BBBB NAME1

TEST2 ZZZZ NAME2

TEST2 YYYY NAME2

TEST3 BBBB NAME3

TEST3 DDDD NAME3

Desired Output :

Data firstfile;

Var1 Var2 Var3 Var4 Var5

TEST1 AAAA 1234 YES NAME1

TEST1 BBBB 4567 YES NAME1

TEST2 ZZZZ 1234 YES NAME2

TEST2 YYYY 4567 NO NAME2

TEST3 BBBB 2345 YES NAME3

TEST3 DDDD 1234 NO NAME3

Basically I want to compare Var1 and Var2 in both files and if they are same then read the Var3 from 2nd and append it next to the record in the first file .

Can this be achieved using Proc SQL or we need to perform a merge .

Thanks .

PeterClemmensen · Posted 10-10-2018 11:22 AM

Yes, you can do it with PROC SQL like this

data First;
input Var1 $ Var2 $ Var3 Var4 $;
datalines;
TEST1 AAAA 1234 YES
TEST1 BBBB 4567 YES
TEST1 CCCC 6789 NO 
TEST2 ZZZZ 1234 YES
TEST2 YYYY 4567 NO 
TEST2 AAAA 6789 YES
TEST3 XXXX 3456 NO 
TEST3 BBBB 2345 YES
TEST3 DDDD 1234 NO
;

data Second;
input Var1 $ Var2 $ Var3 $;
datalines;
TEST1 AAAA NAME1
TEST1 BBBB NAME1
TEST2 ZZZZ NAME2
TEST2 YYYY NAME2
TEST3 BBBB NAME3
TEST3 DDDD NAME3
;

proc sql;
   create table want as
   select Second.Var1
         ,Second.Var2
         ,First.Var3
         ,First.Var4
         ,Second.Var3 as Var5
   from First, Second
   where First.Var1=Second.Var1 and First.Var2=Second.Var2;
quit;

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

Reeza · Posted 10-10-2018 11:16 AM

@sams54156 wrote:

Can this be achieved using Proc SQL or we need to perform a merge .

Thanks .

You can do it either using SQL or a merge, assuming your key/joining variables are VAR1 & VAR2. Your best bet is to rename variables with the same name though so you don't end up with weird issues.

PeterClemmensen · Posted 10-10-2018 11:22 AM

Yes, you can do it with PROC SQL like this

data First;
input Var1 $ Var2 $ Var3 Var4 $;
datalines;
TEST1 AAAA 1234 YES
TEST1 BBBB 4567 YES
TEST1 CCCC 6789 NO 
TEST2 ZZZZ 1234 YES
TEST2 YYYY 4567 NO 
TEST2 AAAA 6789 YES
TEST3 XXXX 3456 NO 
TEST3 BBBB 2345 YES
TEST3 DDDD 1234 NO
;

data Second;
input Var1 $ Var2 $ Var3 $;
datalines;
TEST1 AAAA NAME1
TEST1 BBBB NAME1
TEST2 ZZZZ NAME2
TEST2 YYYY NAME2
TEST3 BBBB NAME3
TEST3 DDDD NAME3
;

proc sql;
   create table want as
   select Second.Var1
         ,Second.Var2
         ,First.Var3
         ,First.Var4
         ,Second.Var3 as Var5
   from First, Second
   where First.Var1=Second.Var1 and First.Var2=Second.Var2;
quit;

The DATA to DATA Step Macro
Blog: SASnrd

novinosrin · Posted 10-10-2018 11:29 AM

A simple hash find method=0 is all you need if you choose a datastep using var1 and var2 as keys

singhsahab · Posted 10-10-2018 12:34 PM

Hello,

You can use PROC SQL or Data Merge , you can use any method based on number of observation in datasets.

If there are huge observation you can use PROC Format or hash table to join.

Thanks...

novinosrin · Posted 10-10-2018 12:43 PM

@singhsahab HI, Can you post the Proc format solution for the sample provided in this thread plz if you can lend a few mins? @PeterClemmensen has got the samples in the form of a datastep for you to work on. The reason I am asking is I would like to see how efficiently that can be done and learn. Thank you!

ballardw · Posted 10-10-2018 02:55 PM

@novinosrin wrote:

@singhsahab HI, Can you post the Proc format solution for the sample provided in this thread plz if you can lend a few mins? @PeterClemmensen has got the samples in the form of a datastep for you to work on. The reason I am asking is I would like to see how efficiently that can be done and learn. Thank you!

If the VAR1 is one-to-one with Var3 (or at least no VAR1 values map to 2 or more Var3) as shown in that very small example SECOND data set this creates a format for that mapping:

proc sql;
   create table temp as
   select distinct var1, var3
   from second
   ;
quit;

data fmtcntlin;
  set temp;
  fmtname = '$test2name';
  type='char';
  start=var1;
  label=var3;
  drop var1 var3;1
run;

proc format library=work cntlin=fmtcntlin;
run;

The want set would be

data want;
   set  first;
   var5=put(var1,$test2name.);
run;

Though of course there may not actually be a need for the added variable with the format.

novinosrin · Posted 10-10-2018 03:14 PM

Thank you Sir @ballardw However, don;t you need to concatenate var1 and var2 before you create the format with cntlin option to meet OP's requirement. That's the reason I deliberately raised this question as this makes it a full extra pass of the dataset second like

data fmt;
set second;
start=catx(' ',var1,var2);
rename var3=label;
drop var1 var2;
retain Fmtname '$fmt' Type 'C';
run;

proc format cntlin=fmt fmtlib;
run;

And then of course I agree the applying the same using put function. So should the second dataset be large, me thinks the pass is perhaps much too costly. Even so, there has to be an extra conditional check when applying the char format as for those records that doesn't have format associated, the var1 and var2 value would be written as is in the var5 variable.

While in hash approach, the key is anyway unique combination of var1 and var2 of both tables. And the look up seems so simple to go straight to the address, pick and deliver.

Best Regards,

PS I saw you on the call and wanted to say Hello, however my microphone wasn't working.

Reeza · Posted 10-10-2018 04:03 PM

I don't think this is a case where a format would be significantly more efficient, especially since more than one column is being added - too many passes of the data sets.

If you want an example of where a HASH vs FORMAT would be appropriate this one is likely a good test: https://stackoverflow.com/questions/52745225/can-proc-format-be-used-to-sum-within-groups-across-man...

ballardw · Posted 10-10-2018 04:07 PM

@novinosrin wrote:

Thank you Sir @ballardw However, don;t you need to concatenate var1 and var2 before you create the format with cntlin option to meet OP's requirement. That's the reason I deliberately raised this question as this makes it a full extra pass of the dataset second like

data fmt;
set second;
start=catx(' ',var1,var2);
rename var3=label;
drop var1 var2;
retain Fmtname '$fmt' Type 'C';
run;

proc format cntlin=fmt fmtlib;
run;

And then of course I agree the applying the same using put function. So should the second dataset be large, me thinks the pass is perhaps much too costly. Even so, there has to be an extra conditional check when applying the char format as for those records that doesn't have format associated, the var1 and var2 value would be written as is in the var5 variable.

While in hash approach, the key is anyway unique combination of var1 and var2 of both tables. And the look up seems so simple to go straight to the address, pick and deliver.

Best Regards,

PS I saw you on the call and wanted to say Hello, however my microphone wasn't working.

The example data provided does not show VAR2 having an impact on the map. If so, the data should show so. Also since a format takes a single value as input to get the result using both variables would require creating another variable to do the mapping. Which is a poor idea adding complication for no real gain. When two values are actually needed to match then format is not an appropriate choice, which I suspect it really is not in this case. Hash, SQL or possibly even a data step Merge or modify may be better.

mkeintz · Posted 10-10-2018 04:09 PM

Now you could do an ordinary DATA step with MERGE ... BY, but first you'd need to sort both datasets by VAR1 VAR2. If sorting is expensive and if:

Every observation in dataset2 has a match in dataset1
And each of those matching observations are in the same order (no matter how many or where extra non-matching records appear in dataset1), then you can do a pair of set statements:

data want (drop=_:);
  set second (rename=(var3=var5));
  do until (_var1=var1 and _var2=var2);
    set first (rename=(var1=_var1 var2=_var2));
  end;
run;

Unlike the proc sql, and unlike the SORT-with_MERGE-BY solutions, this program preserves original order, and takes only one pass through the data.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away