DATA Step, Macro, Functions and more

KEEP without enumerating dozens of variables

Accepted Solution Solved
Reply
Contributor
Posts: 22
Accepted Solution

KEEP without enumerating dozens of variables

Hello.

 

A dataset_1 has 100 variables

A dataset_2 has 1500 variables   (50 are common to dataset_1).

 

I want to append the observations of dataset_2 to dataset_1, but choosing only the 50 common variables, without writing 50 KEEP's, one by one.

 

Is this possible using some kind of KEEP?

 

DATA dataset_1;
     KEEP "50common_variables)

     SET dataset_2;

RUN;

 

Thank you!


Accepted Solutions
Solution
‎01-11-2017 06:48 AM
Trusted Advisor
Posts: 1,022

Re: KEEP without enumerating dozens of variables

[ Edited ]

This is a case where one of the special features of PROC SQL, namely the ability to access metadata via the DICTIONARY container name, is very handy:

 

proc sql noprint;
   select a.name into :keeplist separated by ' '
   from
     (select name from dictionary.columns where libname="WORK" and memname="DATASET_1") as a
   inner join
     (select name from dictionary.columns where libname="WORK" and memname="DATASET_2") as b
on a.name=b.name;
quit;

%put &=keeplist;
data want; 
  set d1 (keep=&keeplist)
      d2 (keep=&keeplist) ;
run;

 

 

There are a LOT of dictionaries available via PROC SQL.   Dictionary COLUMNS is the one suitable for your task. Do a "select * from dictionary.dictionaries" to see them all.

 

Now, if your answer to @Astounding's question is that you don't want the common variables, but rather all the variables from one of the data sets  (i.e. exclude vars only in the 2nd dataset), you don't even need metadata access.  Instead, you can just manipulate the program-data-vector constructed by the sas compiler in a data step:

 

data want (drop=_pre _post);
  retain _pre .;
  if 0 then set dataset_1;
  retain _post .;
  set dataset_1 dataset_2;
  keep _pre -- _post ;
run;

 

 Edit entered to initialize the variables (to missing values) in the two RETAIN statements.

 This all depends on how the pdv (program data vector) is enlarged statement-by-statement step-by-step by the sas compiler:

  1. The leading "retain _pre" statement makes the variable _PRE the leftmost in the PDV.  It's the first var seen by the compiler.
  2. The "if 0 then set" statement reads no data, but tells the compiler to make provision for all the variables in dataset_1, just to the right of _PRE.
  3. The "retain _post" put the corresponding variable in the next position, i.e. to the right of the last variable in dataset_1.
  4. the "SET dataset_1 dataset_2"  statement not only reads in the data, but forces the compiler to include all the as-yet-unreferenced variables - namely those in dataset_2 that are not in dataset_1.  They will all be to the right of _POST.
  5. The keep statement uses the double dash convention to specify a list of vars starting with _PRE and ending with _POST.
  6. Final cleanup: the "drop=" parameter tells the compiler to not allow those vars to go to dataset want.

View solution in original post


All Replies
Super User
Posts: 5,516

Re: KEEP without enumerating dozens of variables

Just to clarify ...

 

Should the final result include only the 50 common variables, or should it include all 100 variables from dataset 1?

Solution
‎01-11-2017 06:48 AM
Trusted Advisor
Posts: 1,022

Re: KEEP without enumerating dozens of variables

[ Edited ]

This is a case where one of the special features of PROC SQL, namely the ability to access metadata via the DICTIONARY container name, is very handy:

 

proc sql noprint;
   select a.name into :keeplist separated by ' '
   from
     (select name from dictionary.columns where libname="WORK" and memname="DATASET_1") as a
   inner join
     (select name from dictionary.columns where libname="WORK" and memname="DATASET_2") as b
on a.name=b.name;
quit;

%put &=keeplist;
data want; 
  set d1 (keep=&keeplist)
      d2 (keep=&keeplist) ;
run;

 

 

There are a LOT of dictionaries available via PROC SQL.   Dictionary COLUMNS is the one suitable for your task. Do a "select * from dictionary.dictionaries" to see them all.

 

Now, if your answer to @Astounding's question is that you don't want the common variables, but rather all the variables from one of the data sets  (i.e. exclude vars only in the 2nd dataset), you don't even need metadata access.  Instead, you can just manipulate the program-data-vector constructed by the sas compiler in a data step:

 

data want (drop=_pre _post);
  retain _pre .;
  if 0 then set dataset_1;
  retain _post .;
  set dataset_1 dataset_2;
  keep _pre -- _post ;
run;

 

 Edit entered to initialize the variables (to missing values) in the two RETAIN statements.

 This all depends on how the pdv (program data vector) is enlarged statement-by-statement step-by-step by the sas compiler:

  1. The leading "retain _pre" statement makes the variable _PRE the leftmost in the PDV.  It's the first var seen by the compiler.
  2. The "if 0 then set" statement reads no data, but tells the compiler to make provision for all the variables in dataset_1, just to the right of _PRE.
  3. The "retain _post" put the corresponding variable in the next position, i.e. to the right of the last variable in dataset_1.
  4. the "SET dataset_1 dataset_2"  statement not only reads in the data, but forces the compiler to include all the as-yet-unreferenced variables - namely those in dataset_2 that are not in dataset_1.  They will all be to the right of _POST.
  5. The keep statement uses the double dash convention to specify a list of vars starting with _PRE and ending with _POST.
  6. Final cleanup: the "drop=" parameter tells the compiler to not allow those vars to go to dataset want.
Super User
Posts: 11,343

Re: KEEP without enumerating dozens of variables

Proc append will append a data set and only have common variables in the result though you need to use FORCE option. So you would only need to Keep the values for dataset_1

 

Proc append base=dataset_1 (keep= list to keep) data=dataset_2 Force;

run;

 

HOWEVER as with a data step variables of the same name will have to have the same type and if character variables in dataset_2 are longer than in dataset_1 they may be truncated.

 

If the variables you want from dataset_1 have "nice" names like ABC1, ABC2  (a common prefix or start of the variable name) you can use : notation for a list, keep = ABC:  would keep all variables that start with the letters ABC, or you could use a range list: Keep = ABC1-ABC9 would keep all the ABC variables with suffixes between 1 and 9. Or the -- (two dashes) gets "sequential" varibles

Keep a--x   would keep the variables a to x in order (if a was the 10th variable column and x was the 15th variable column you would get 10th through 15th variables. This could be modified to request either the numeric or character variables in those columns: a-numeric-x or a-character-x

or you may be able to use _numeric_ or _character_ if you some of what you want is ALL numeric or ALL character variables.

Super User
Posts: 10,041

Re: KEEP without enumerating dozens of variables

It is easy for SQL.



data class1;
 set sashelp.class;
 keep name sex age;
run;
data class2;
 set sashelp.class;
 keep sex age weight;
run;
proc sql;
create table want as
select * from class1
union corresponding all
select * from class2;
quit;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 389 views
  • 3 likes
  • 5 in conversation