BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
gmejia
Calcite | Level 5

Hello everyone and thanks for your help.

I am combining multiple datasets, I've created a flag for each dataset - e.g., d1, d2, d3 etc.

I have multiple observations per person and each observation can come from a different dataset.

In this particular case, there are 3 possible datasets, but I've limited my sample to only those whose data comes from two different datasets.

The data looks something like:

ID   dataset

1          d1

1          d1

1          d1

1          d2

1          d1

2          d1

2          d3

2          d1

3          d2

3          d3

I want to find out how many people have data from d1/d2, how many from d1/d3, and how many from d2/d3.

(I am not interested in knowing how many observations come from each dataset - for example, for person 1 I do not want to know they have 4 observations from d1 and 1 observation from d2 -  I just want to know their data is from d1/d2).

I hope that makes sense...

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

Given you said "multiple" datasets, it seems you might be dealing with several datasets, not just 2 or 3.  If that's the case then the program below will scale nicely.  Using the "if first.id and sum(of in{*})>=2 keeps only one record per id, and only those id's present in 2 or more datasets.

 

Using the 2-dimensional array DD provides a convenient way to compactly assign all 2-dataset indicators.  

 

 

data have;
  input id  dataset :$2.  @@;
datalines;
1 d1  1 d2  1 d3
2 d3  2 d2
3 d3  3 d1
4 d2  4 d1
5 d3
6 d2
7 d1
run;

data want (drop=dataset _:);
  merge have (keep=id dataset where=(dataset='d1') in=in1)
        have (keep=id dataset where=(dataset='d2') in=in2)
        have (keep=id dataset where=(dataset='d3') in=in3) ;
  by id;
  array in {3} in1-in3;
  if first.id and sum(of in{*})>=2;

  array dd {3,3} _dum d1d2 d1d3
                 _dum _dum d2d3
                 _dum _dum _dum  ;
  do _row=1 to dim(in)-1;
    do _col=_row+1 to dim(in);
      if in{_row}=1 and in{_col}=1 then dd{_row,_col}=1;
    end;
  end;
run;

proc means data=want n missing;
  var d1d2--d2d3;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

4 REPLIES 4
error_prone
Barite | Level 11

Try this one:

 

data work.want(keep= Id d1d2 d1d3 d2d3);
   set work.have;
   by id;
   
   length inD1 inD2 inD3 d1d2 d1d3 d2d3 i 3;
   array in inD:;
   retain inD:;
   
   if first.id then do;
      call missing(of inD:);
   end;
   
   i = input(substr(dataset, 2), 1.);
   in{i} = 1;
   
   if last.id then do;
      d1d2 = inD1 and inD2;
      d1d3 = inD1 and inD3;
      d2d3 = inD2 and inD3;
      output;
   end;
run;
mkeintz
PROC Star

Given you said "multiple" datasets, it seems you might be dealing with several datasets, not just 2 or 3.  If that's the case then the program below will scale nicely.  Using the "if first.id and sum(of in{*})>=2 keeps only one record per id, and only those id's present in 2 or more datasets.

 

Using the 2-dimensional array DD provides a convenient way to compactly assign all 2-dataset indicators.  

 

 

data have;
  input id  dataset :$2.  @@;
datalines;
1 d1  1 d2  1 d3
2 d3  2 d2
3 d3  3 d1
4 d2  4 d1
5 d3
6 d2
7 d1
run;

data want (drop=dataset _:);
  merge have (keep=id dataset where=(dataset='d1') in=in1)
        have (keep=id dataset where=(dataset='d2') in=in2)
        have (keep=id dataset where=(dataset='d3') in=in3) ;
  by id;
  array in {3} in1-in3;
  if first.id and sum(of in{*})>=2;

  array dd {3,3} _dum d1d2 d1d3
                 _dum _dum d2d3
                 _dum _dum _dum  ;
  do _row=1 to dim(in)-1;
    do _col=_row+1 to dim(in);
      if in{_row}=1 and in{_col}=1 then dd{_row,_col}=1;
    end;
  end;
run;

proc means data=want n missing;
  var d1d2--d2d3;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
gmejia
Calcite | Level 5

It worked beautifully, thank you!

Astounding
PROC Star

Is this a better mousetrap?  For convenience, I will assume you have up to 9 sources (so the maximum value is "d9") and the length of DATASET is $ 2.  If the actual data is different, adjustments can be made.

 

data want;

length all_sources $ 27;

do until (last.ID);

   set have;

   by id;

   start = input(substr(dataset, 2), 1.);

   substr(all_sources, start*3-2, 2) = dataset;

end;

drop start;

run;

 

proc freq data=want;

   tables all_sources;

run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 740 views
  • 0 likes
  • 4 in conversation