BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
gmejia
Calcite | Level 5

Hello everyone and thanks for your help.

I am combining multiple datasets, I've created a flag for each dataset - e.g., d1, d2, d3 etc.

I have multiple observations per person and each observation can come from a different dataset.

In this particular case, there are 3 possible datasets, but I've limited my sample to only those whose data comes from two different datasets.

The data looks something like:

ID   dataset

1          d1

1          d1

1          d1

1          d2

1          d1

2          d1

2          d3

2          d1

3          d2

3          d3

I want to find out how many people have data from d1/d2, how many from d1/d3, and how many from d2/d3.

(I am not interested in knowing how many observations come from each dataset - for example, for person 1 I do not want to know they have 4 observations from d1 and 1 observation from d2 -  I just want to know their data is from d1/d2).

I hope that makes sense...

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

Given you said "multiple" datasets, it seems you might be dealing with several datasets, not just 2 or 3.  If that's the case then the program below will scale nicely.  Using the "if first.id and sum(of in{*})>=2 keeps only one record per id, and only those id's present in 2 or more datasets.

 

Using the 2-dimensional array DD provides a convenient way to compactly assign all 2-dataset indicators.  

 

 

data have;
  input id  dataset :$2.  @@;
datalines;
1 d1  1 d2  1 d3
2 d3  2 d2
3 d3  3 d1
4 d2  4 d1
5 d3
6 d2
7 d1
run;

data want (drop=dataset _:);
  merge have (keep=id dataset where=(dataset='d1') in=in1)
        have (keep=id dataset where=(dataset='d2') in=in2)
        have (keep=id dataset where=(dataset='d3') in=in3) ;
  by id;
  array in {3} in1-in3;
  if first.id and sum(of in{*})>=2;

  array dd {3,3} _dum d1d2 d1d3
                 _dum _dum d2d3
                 _dum _dum _dum  ;
  do _row=1 to dim(in)-1;
    do _col=_row+1 to dim(in);
      if in{_row}=1 and in{_col}=1 then dd{_row,_col}=1;
    end;
  end;
run;

proc means data=want n missing;
  var d1d2--d2d3;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

4 REPLIES 4
error_prone
Barite | Level 11

Try this one:

 

data work.want(keep= Id d1d2 d1d3 d2d3);
   set work.have;
   by id;
   
   length inD1 inD2 inD3 d1d2 d1d3 d2d3 i 3;
   array in inD:;
   retain inD:;
   
   if first.id then do;
      call missing(of inD:);
   end;
   
   i = input(substr(dataset, 2), 1.);
   in{i} = 1;
   
   if last.id then do;
      d1d2 = inD1 and inD2;
      d1d3 = inD1 and inD3;
      d2d3 = inD2 and inD3;
      output;
   end;
run;
mkeintz
PROC Star

Given you said "multiple" datasets, it seems you might be dealing with several datasets, not just 2 or 3.  If that's the case then the program below will scale nicely.  Using the "if first.id and sum(of in{*})>=2 keeps only one record per id, and only those id's present in 2 or more datasets.

 

Using the 2-dimensional array DD provides a convenient way to compactly assign all 2-dataset indicators.  

 

 

data have;
  input id  dataset :$2.  @@;
datalines;
1 d1  1 d2  1 d3
2 d3  2 d2
3 d3  3 d1
4 d2  4 d1
5 d3
6 d2
7 d1
run;

data want (drop=dataset _:);
  merge have (keep=id dataset where=(dataset='d1') in=in1)
        have (keep=id dataset where=(dataset='d2') in=in2)
        have (keep=id dataset where=(dataset='d3') in=in3) ;
  by id;
  array in {3} in1-in3;
  if first.id and sum(of in{*})>=2;

  array dd {3,3} _dum d1d2 d1d3
                 _dum _dum d2d3
                 _dum _dum _dum  ;
  do _row=1 to dim(in)-1;
    do _col=_row+1 to dim(in);
      if in{_row}=1 and in{_col}=1 then dd{_row,_col}=1;
    end;
  end;
run;

proc means data=want n missing;
  var d1d2--d2d3;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
gmejia
Calcite | Level 5

It worked beautifully, thank you!

Astounding
PROC Star

Is this a better mousetrap?  For convenience, I will assume you have up to 9 sources (so the maximum value is "d9") and the length of DATASET is $ 2.  If the actual data is different, adjustments can be made.

 

data want;

length all_sources $ 27;

do until (last.ID);

   set have;

   by id;

   start = input(substr(dataset, 2), 1.);

   substr(all_sources, start*3-2, 2) = dataset;

end;

drop start;

run;

 

proc freq data=want;

   tables all_sources;

run;

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1378 views
  • 0 likes
  • 4 in conversation