BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
rmlmrmlm
Obsidian | Level 7

Hello.

 

A dataset_1 has 100 variables

A dataset_2 has 1500 variables   (50 are common to dataset_1).

 

I want to append the observations of dataset_2 to dataset_1, but choosing only the 50 common variables, without writing 50 KEEP's, one by one.

 

Is this possible using some kind of KEEP?

 

DATA dataset_1;
     KEEP "50common_variables)

     SET dataset_2;

RUN;

 

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

This is a case where one of the special features of PROC SQL, namely the ability to access metadata via the DICTIONARY container name, is very handy:

 

proc sql noprint;
   select a.name into :keeplist separated by ' '
   from
     (select name from dictionary.columns where libname="WORK" and memname="DATASET_1") as a
   inner join
     (select name from dictionary.columns where libname="WORK" and memname="DATASET_2") as b
on a.name=b.name;
quit;

%put &=keeplist;
data want; 
  set d1 (keep=&keeplist)
      d2 (keep=&keeplist) ;
run;

 

 

There are a LOT of dictionaries available via PROC SQL.   Dictionary COLUMNS is the one suitable for your task. Do a "select * from dictionary.dictionaries" to see them all.

 

Now, if your answer to @Astounding's question is that you don't want the common variables, but rather all the variables from one of the data sets  (i.e. exclude vars only in the 2nd dataset), you don't even need metadata access.  Instead, you can just manipulate the program-data-vector constructed by the sas compiler in a data step:

 

data want (drop=_pre _post);
  retain _pre .;
  if 0 then set dataset_1;
  retain _post .;
  set dataset_1 dataset_2;
  keep _pre -- _post ;
run;

 

 Edit entered to initialize the variables (to missing values) in the two RETAIN statements.

 This all depends on how the pdv (program data vector) is enlarged statement-by-statement step-by-step by the sas compiler:

  1. The leading "retain _pre" statement makes the variable _PRE the leftmost in the PDV.  It's the first var seen by the compiler.
  2. The "if 0 then set" statement reads no data, but tells the compiler to make provision for all the variables in dataset_1, just to the right of _PRE.
  3. The "retain _post" put the corresponding variable in the next position, i.e. to the right of the last variable in dataset_1.
  4. the "SET dataset_1 dataset_2"  statement not only reads in the data, but forces the compiler to include all the as-yet-unreferenced variables - namely those in dataset_2 that are not in dataset_1.  They will all be to the right of _POST.
  5. The keep statement uses the double dash convention to specify a list of vars starting with _PRE and ending with _POST.
  6. Final cleanup: the "drop=" parameter tells the compiler to not allow those vars to go to dataset want.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

4 REPLIES 4
Astounding
PROC Star

Just to clarify ...

 

Should the final result include only the 50 common variables, or should it include all 100 variables from dataset 1?

mkeintz
PROC Star

This is a case where one of the special features of PROC SQL, namely the ability to access metadata via the DICTIONARY container name, is very handy:

 

proc sql noprint;
   select a.name into :keeplist separated by ' '
   from
     (select name from dictionary.columns where libname="WORK" and memname="DATASET_1") as a
   inner join
     (select name from dictionary.columns where libname="WORK" and memname="DATASET_2") as b
on a.name=b.name;
quit;

%put &=keeplist;
data want; 
  set d1 (keep=&keeplist)
      d2 (keep=&keeplist) ;
run;

 

 

There are a LOT of dictionaries available via PROC SQL.   Dictionary COLUMNS is the one suitable for your task. Do a "select * from dictionary.dictionaries" to see them all.

 

Now, if your answer to @Astounding's question is that you don't want the common variables, but rather all the variables from one of the data sets  (i.e. exclude vars only in the 2nd dataset), you don't even need metadata access.  Instead, you can just manipulate the program-data-vector constructed by the sas compiler in a data step:

 

data want (drop=_pre _post);
  retain _pre .;
  if 0 then set dataset_1;
  retain _post .;
  set dataset_1 dataset_2;
  keep _pre -- _post ;
run;

 

 Edit entered to initialize the variables (to missing values) in the two RETAIN statements.

 This all depends on how the pdv (program data vector) is enlarged statement-by-statement step-by-step by the sas compiler:

  1. The leading "retain _pre" statement makes the variable _PRE the leftmost in the PDV.  It's the first var seen by the compiler.
  2. The "if 0 then set" statement reads no data, but tells the compiler to make provision for all the variables in dataset_1, just to the right of _PRE.
  3. The "retain _post" put the corresponding variable in the next position, i.e. to the right of the last variable in dataset_1.
  4. the "SET dataset_1 dataset_2"  statement not only reads in the data, but forces the compiler to include all the as-yet-unreferenced variables - namely those in dataset_2 that are not in dataset_1.  They will all be to the right of _POST.
  5. The keep statement uses the double dash convention to specify a list of vars starting with _PRE and ending with _POST.
  6. Final cleanup: the "drop=" parameter tells the compiler to not allow those vars to go to dataset want.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ballardw
Super User

Proc append will append a data set and only have common variables in the result though you need to use FORCE option. So you would only need to Keep the values for dataset_1

 

Proc append base=dataset_1 (keep= list to keep) data=dataset_2 Force;

run;

 

HOWEVER as with a data step variables of the same name will have to have the same type and if character variables in dataset_2 are longer than in dataset_1 they may be truncated.

 

If the variables you want from dataset_1 have "nice" names like ABC1, ABC2  (a common prefix or start of the variable name) you can use : notation for a list, keep = ABC:  would keep all variables that start with the letters ABC, or you could use a range list: Keep = ABC1-ABC9 would keep all the ABC variables with suffixes between 1 and 9. Or the -- (two dashes) gets "sequential" varibles

Keep a--x   would keep the variables a to x in order (if a was the 10th variable column and x was the 15th variable column you would get 10th through 15th variables. This could be modified to request either the numeric or character variables in those columns: a-numeric-x or a-character-x

or you may be able to use _numeric_ or _character_ if you some of what you want is ALL numeric or ALL character variables.

Ksharp
Super User
It is easy for SQL.



data class1;
 set sashelp.class;
 keep name sex age;
run;
data class2;
 set sashelp.class;
 keep sex age weight;
run;
proc sql;
create table want as
select * from class1
union corresponding all
select * from class2;
quit;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 2267 views
  • 3 likes
  • 5 in conversation