08-03-2015 11:18 PM
I have 10 SAS data set with different columns (Some columns are same, and some are different). All these data sets are without identical IDs. My goal is to combine these 10 SAS data sets to 1 single master data set with identical IDs. By doing this, I will be able to retrieve any records with their original columns from master data set by select the unique IDs.
Like say, these 10 data set are from year 1990 to 2000.(e.g. DSN90 : DSN00), and I already created IDs '9000001','0900002'...for data set DSN90, and "9100001','9100002' for data set DSN91 and so on.
What I want is:
If I select ID='9000001' from master data set, I will get the first record of data set DSN90 with the its columns.
If I select ID='9100001' from master data set, I will get the first record of data set DSN91 with the its columns.
However, the columns from DSN90 is different with DSN91. What my master data set should look like? Please advise. Thanks in advance.
08-04-2015 11:48 AM
If you combine datasets as described for your example ID='9000001' selection result will have the columns associated with all of the other data sets as well but with all missing values for those that weren't in DSN90. If you don't want those extra columns you would have to post process the data to remove any column with all missing data. If you don't mind the blank columns then something like
set dsn90 dsn91 <etc>;
Either that or write a very complex extraction program that would require knowing which original data set an id is from.
I have a very hard time with the "advantage" of this structure may be.
I would be very tempted to consider a data set of ID and Source data set and use that for a control to pull records from the original
08-04-2015 01:41 PM
Some columns are same, and some are different)
That is your main issue. How do you want the output to look like for the column names that are the same? Where the columns are different?
If you already have ID's, you can use a Data Step to combine your 10 datasets relatively easily as well as to identify the source data set.
See the INDSNAME option highlighted by Rick yesterday:
08-04-2015 01:53 PM
this is where macro/macro variables com in handy. You can change following code to suit your needs but it should help you subset data and variables.
data all; ** will include both variables var1+var2;
set dsn90 dsn91;
/* macro variable for yr */
/* macro variable for list of variables in original data for that year*/
proc sql noprint;
select distinct(name) into: _varlst separated by " "
where upcase(libname)="WORK" and upcase(memname)="DSN&_YR.";
options symbolgen mlogic mprint;