BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
gmejia
Calcite | Level 5

Hello everyone and thanks for your help.

I am combining multiple datasets, I've created a flag for each dataset - e.g., d1, d2, d3 etc.

I have multiple observations per person and each observation can come from a different dataset.

In this particular case, there are 3 possible datasets, but I've limited my sample to only those whose data comes from two different datasets.

The data looks something like:

ID   dataset

1          d1

1          d1

1          d1

1          d2

1          d1

2          d1

2          d3

2          d1

3          d2

3          d3

I want to find out how many people have data from d1/d2, how many from d1/d3, and how many from d2/d3.

(I am not interested in knowing how many observations come from each dataset - for example, for person 1 I do not want to know they have 4 observations from d1 and 1 observation from d2 -  I just want to know their data is from d1/d2).

I hope that makes sense...

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

Given you said "multiple" datasets, it seems you might be dealing with several datasets, not just 2 or 3.  If that's the case then the program below will scale nicely.  Using the "if first.id and sum(of in{*})>=2 keeps only one record per id, and only those id's present in 2 or more datasets.

 

Using the 2-dimensional array DD provides a convenient way to compactly assign all 2-dataset indicators.  

 

 

data have;
  input id  dataset :$2.  @@;
datalines;
1 d1  1 d2  1 d3
2 d3  2 d2
3 d3  3 d1
4 d2  4 d1
5 d3
6 d2
7 d1
run;

data want (drop=dataset _:);
  merge have (keep=id dataset where=(dataset='d1') in=in1)
        have (keep=id dataset where=(dataset='d2') in=in2)
        have (keep=id dataset where=(dataset='d3') in=in3) ;
  by id;
  array in {3} in1-in3;
  if first.id and sum(of in{*})>=2;

  array dd {3,3} _dum d1d2 d1d3
                 _dum _dum d2d3
                 _dum _dum _dum  ;
  do _row=1 to dim(in)-1;
    do _col=_row+1 to dim(in);
      if in{_row}=1 and in{_col}=1 then dd{_row,_col}=1;
    end;
  end;
run;

proc means data=want n missing;
  var d1d2--d2d3;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

4 REPLIES 4
error_prone
Barite | Level 11

Try this one:

 

data work.want(keep= Id d1d2 d1d3 d2d3);
   set work.have;
   by id;
   
   length inD1 inD2 inD3 d1d2 d1d3 d2d3 i 3;
   array in inD:;
   retain inD:;
   
   if first.id then do;
      call missing(of inD:);
   end;
   
   i = input(substr(dataset, 2), 1.);
   in{i} = 1;
   
   if last.id then do;
      d1d2 = inD1 and inD2;
      d1d3 = inD1 and inD3;
      d2d3 = inD2 and inD3;
      output;
   end;
run;
mkeintz
PROC Star

Given you said "multiple" datasets, it seems you might be dealing with several datasets, not just 2 or 3.  If that's the case then the program below will scale nicely.  Using the "if first.id and sum(of in{*})>=2 keeps only one record per id, and only those id's present in 2 or more datasets.

 

Using the 2-dimensional array DD provides a convenient way to compactly assign all 2-dataset indicators.  

 

 

data have;
  input id  dataset :$2.  @@;
datalines;
1 d1  1 d2  1 d3
2 d3  2 d2
3 d3  3 d1
4 d2  4 d1
5 d3
6 d2
7 d1
run;

data want (drop=dataset _:);
  merge have (keep=id dataset where=(dataset='d1') in=in1)
        have (keep=id dataset where=(dataset='d2') in=in2)
        have (keep=id dataset where=(dataset='d3') in=in3) ;
  by id;
  array in {3} in1-in3;
  if first.id and sum(of in{*})>=2;

  array dd {3,3} _dum d1d2 d1d3
                 _dum _dum d2d3
                 _dum _dum _dum  ;
  do _row=1 to dim(in)-1;
    do _col=_row+1 to dim(in);
      if in{_row}=1 and in{_col}=1 then dd{_row,_col}=1;
    end;
  end;
run;

proc means data=want n missing;
  var d1d2--d2d3;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
gmejia
Calcite | Level 5

It worked beautifully, thank you!

Astounding
PROC Star

Is this a better mousetrap?  For convenience, I will assume you have up to 9 sources (so the maximum value is "d9") and the length of DATASET is $ 2.  If the actual data is different, adjustments can be made.

 

data want;

length all_sources $ 27;

do until (last.ID);

   set have;

   by id;

   start = input(substr(dataset, 2), 1.);

   substr(all_sources, start*3-2, 2) = dataset;

end;

drop start;

run;

 

proc freq data=want;

   tables all_sources;

run;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1020 views
  • 0 likes
  • 4 in conversation