Solved: Merging and keeping only common variables

arielcslin · Posted 12-06-2018 02:57 PM

Hi

I am looking to merge datasets together however I only want to keep the common variables between the 2 datasets. Is there a way to do this without manually writing out all the variables to be dropped? For example (below) I am trying to merge Dataset 1 and Dataset 2 to make Dataset 3. I am aware that Dataset 2's data will overwrite Dataset 1 and I am ok with this because Dataset 1 is acting like a template.

Dataset 1

A B C

0 0 0

Dataset 2

B A

4 1

5 2

6 3

Dataset 3

A B

1 4

2 5

3 6

novinosrin · Posted 12-06-2018 03:10 PM

safer with

union corr all



 data one;
 input A  B  C;
 cards;
0   0   0
;

data two;
input B  A ;
cards;
4   1
5   2
6   3
;

proc sql;
create table want as
select * from one(obs=0)
union corr all
select * from two;
quit;

View solution in original post

r_behata · Posted 12-06-2018 03:03 PM

Are you looking to reorder the Variables in the final dataset ? If that is the only case then there are better options than merging. Merging the two data sets does not make any sense in your example.

data have;
input B A;
cards;
4 1
5 2
6 3
;
run;

data want;
retain a b;
set have;
run;

arielcslin · Posted 12-06-2018 03:08 PM

Yes the point is to order Dataset 2 however Dataset 2 may not have all the variables listed in the template(?) So I would like to put Dataset 2 in order but drop all variables that are not populated with any data

r_behata · Posted 12-06-2018 03:16 PM

Can you give a better example that reflect the scenario more accurately.

novinosrin · Posted 12-06-2018 03:06 PM


 data one;
 input A  B  C;
 cards;
0   0   0
;

data two;
input B  A ;
cards;
4   1
5   2
6   3
;

proc sql;
create table want as
select * from one(obs=0)
union corr
select * from two;
quit;

Tom · Posted 12-06-2018 07:24 PM

The UNION CORR ALL method is probably the best. (without the ALL keyword UNION will remove duplicate rows).

If you try to do it using a data step then you can get metadata attributes like FORMAT settings overriding the attributes set in the template.

data template;
 input A B C;
cards;
;

data sample;
  input B  A ;
  format a 4.  ;
cards;
4   1
5   2
6   3
;

proc transpose data=sample (obs=0) out=names ; var _all_; run;
proc sql noprint ;
  select _name_ into :names separated by ' ' from names ;
quit;

data almost;
  set template(keep=&names) sample;
run;

proc sql noprint;
create table want as
  select * from template
  union corr all
  select * from sample
;
describe table almost ;
describe table want ;
quit;

144  describe table almost;
NOTE: SQL table WORK.ALMOST was created like:

create table WORK.ALMOST( bufsize=65536 )
  (
   A num format=4.,
   B num
  );

145  describe table want ;
NOTE: SQL table WORK.WANT was created like:

create table WORK.WANT( bufsize=65536 )
  (
   A num,
   B num
  );

novinosrin · Posted 12-06-2018 03:10 PM

safer with

union corr all



 data one;
 input A  B  C;
 cards;
0   0   0
;

data two;
input B  A ;
cards;
4   1
5   2
6   3
;

proc sql;
create table want as
select * from one(obs=0)
union corr all
select * from two;
quit;

mkeintz · Posted 12-06-2018 05:59 PM

If you know that the vars in one of the datasets constitute a proper subset of vars in the other, then you just need to drop all vars not in that "narrow" dataset. In this case, you want to drop all vars not in datset2. But you want to merge such that dataset2 VALUES prevail:

data Dataset1;
  input A  B  C;
datalines;
0   0   0
run;

data Dataset2;
  input  B  A ;
datalines;
 4   1
 5   2
 6   3
run;

data dataset3;
  if 0 then set dataset2;      /*Start the PDV with the dataset2 vars*/
  retain _sentinel_between .;  /*Append this var to the PDV */
  merge dataset1 dataset2;     /*Append any vars exclusively in dataset1 to PDV*/
  retain _sentinel_after .;    /*Append a final var to the PDV*/
  drop _sentinel_between -- _sentinel_after;
run;

Note the PDV (program data vector) is the set of variables that the sas compiler establishes by parsing your sas code, prior to processing any data. At each statement above the PDV is expanded when any new variables are encountered. So by putting in _sentinel_between and _sentinel_after variables in strategic locations in the PDV, it becomes easy to specify a list of variables to drop.

The double dash in drop _sentinel_between -- _sentinel_after; tells sas to include all vars between the two sentinels, including the end points.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Merging and keeping only common variables

Re: Merging and keeping only common variables

Re: Merging and keeping only common variables

Re: Merging and keeping only common variables

Re: Merging and keeping only common variables

Re: Merging and keeping only common variables

Re: Merging and keeping only common variables

Re: Merging and keeping only common variables

Re: Merging and keeping only common variables

Registration is open

SAS Training: Just a Click Away