BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
ANKH1
Pyrite | Level 9

Hi, I would like to merge multiple datasets by ID. Each dataset has different number of rows. Some datasets have more than one variable of interest, in this case is age, that is, the same dataset might have age1, age2, age3. The end goal is to be able to report the frequencies of variable age per ID and be able to know where each data point comes from (depicted by the variable ds). These are the sample datasets:

data ds1;
input ID age1;
datalines;
1 12
3 11
4 10
4 15
;
run;

data ds2;
input ID age2 age3 age4;
datalines;
1 13 13 .
1 23 14 .
2 25 . 21
;
run;

data ds3;
input ID age5;
datalines;
3 22
3 32
3 33
4 12
;
run;

The output that we are aiming for is this before using proc freq by ID:

 

IDageds
1121
1132
1132
1232
1142
2252
2212
3111
3223
3323
3333
4101
4151
4123

 

 

I tried this code but it omits the observation from ds2, ID1, age3=13:

data ages_all;
  /* First dataset */
  set ds1(rename=(age1=age) in=in1 keep=ID age1);
  
  /* Second dataset */
  set ds2(rename=(age2=age age3=age_renamed1 age4=age_renamed2) in=in2 keep=ID age2 age3 age4);
  
  /* Rename age_renamed1 and age_renamed2 to age */
  if in2 then age = coalesce(age_renamed1, age_renamed2);
  
  /* Third dataset */
  set ds3(rename=(age5=age) in=in3 keep=ID age5);

  by ID;
  
  ds = in1 + 2*in2 + 3*in3;
  drop age_renamed1 age_renamed2; /* Drop the temporary variables */
run;
1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

It would be easier to just use an array.

data want;
  set ds1 ds2 ds3 ;
  by id;
  array x age1-age5;
  do index=1 to dim(x);
     age=x[index];
     if not missing(age) then output;
  end;
  drop age1-age5;
run;

If you really did need to know which value came from which dataset you use the INDEX variable for that.  INDEX=1 is from DS1 and INDEX=5 is from DS3 and the other three are all from DS2.

View solution in original post

8 REPLIES 8
Tom
Super User Tom
Super User

It would be easier to just use an array.

data want;
  set ds1 ds2 ds3 ;
  by id;
  array x age1-age5;
  do index=1 to dim(x);
     age=x[index];
     if not missing(age) then output;
  end;
  drop age1-age5;
run;

If you really did need to know which value came from which dataset you use the INDEX variable for that.  INDEX=1 is from DS1 and INDEX=5 is from DS3 and the other three are all from DS2.

ANKH1
Pyrite | Level 9

Thanks! This is very helpful. Is there a way to say that age2,age3 and age4 belong to index 2? The problem is that we have more than 20 datasets and each one might or might have multiple age variables. 

Tom
Super User Tom
Super User

You could do it the same way as in your original program using the IN= dataset option.

For example to make a numeric variable:

set ds1(in=in1) ds2(in=in2) ds3(in=in3);
....
ds=in1 + 2*in2 +3*in3;
...

Or you could use the INDSNAME= option of the SET statement can create a character variable.

set ds1 ds2 ds3 indsname=indsname;
...
dsname=indsname;
...
ANKH1
Pyrite | Level 9
Thanks! I used your code with indsname, it worked! How can I accept two answers as solutions?
ballardw
Super User

Multiple SET statements behave in a way you likely do not expect.

You really need to watch what the observations notes in the log shows.

Brief example:

data one;
   input x;
datalines;
1
2
3
;

data two;
   input y;
datalines;
11
22
;

data sets;
   set one;
   set two;
run;

How many observations do you expect?

Now, read the LOG:

16   data sets;
17      set one;
18      set two;
19   run;

NOTE: There were 3 observations read from the data set WORK.ONE.
NOTE: There were 2 observations read from the data set WORK.TWO.
NOTE: The data set WORK.SETS has 2 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

3 observations from the first set but the final output data only has 2 observations.

Likely your output only includes the number of observations from the set with the smallest number.

 

 

MERGE is a very specific concept in SAS and what you did is not a merge. Especially when you want to combine records according to the value of one or more variables. SAS uses a BY statement with MERGE to align observations on a common value of one or more variables. Caution: Merge does not like, meaning you get unexpected results, if more than one data set contributes multiple values to a Merge.

ANKH1
Pyrite | Level 9
Ok, thanks for the explanation. I shouldn't say merge.
ballardw
Super User

@ANKH1 wrote:
Ok, thanks for the explanation. I shouldn't say merge.

AND realize that multiple SET statements is a much more complex issue than stacking (appending) data.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 1171 views
  • 3 likes
  • 4 in conversation