Hi, I am trying to 1) count the number of RBCs before the first unit of plasma for each id and 2) calculate the plasma ratio (plasma/(plasma+rbc)) at the time of the first unit of plasma for each id.
here is what my data looks like:
id | product | time |
1 | rbc | 1 |
1 | rbc | 2 |
1 | plasma | 3 |
1 | rbc | 4 |
1 | plasma | 5 |
2 | plasma | 1 |
2 | rbc | 2 |
2 | rbc | 3 |
But this is what I want it to look like:
id | # of rbcs before plasma | ratio | time |
1 | 2 | .333 | 3 |
2 | 0 | 1 | 1 |
Are there ever any other values of product besides "rbc" and "plasma"? If so, you need to provide an example and the rules of how they might affect the results you want.
This works for your example data. Please note use of the data step to create a working data set to run code against.
data have; input id $ product $ time; datalines; 1 rbc 1 1 rbc 2 1 plasma 3 1 rbc 4 1 plasma 5 2 plasma 1 2 rbc 2 2 rbc 3 ; data want; set have; by id; retain rbccount plasmacount; if first.id then do ; rbccount=0; plasmacount=0; end; if product='rbc' and plasmacount=0 then rbccount+1; if product='plasma' and plasmacount=0 then plasmacount+1; if plasmacount=1 then do; ratio=plasmacount/(plasmacount+rbccount); output; plasmacount=.; end; keep id rbccount ratio time; label rbccount = '# of rbc before first plasma'; run;
The BY Id expects your ID values to be sorted. If they aren't you can add the NOTSORTED option to the By statement. That would then treat different blocks of a repeated ID as a different sequence with another output row.
The BY statement in a data step creates automatic variables FIRST (and LAST) that indicate whether the current observation is the first or last of a group. The values are 1/0 (true / false) and can be tested as such to set or reset values when occurring as shown.
The RETAIN instruction means that the values are kept across the boundary of a data step so are available for use with later observations in the set. Caveat: Retain with a variable that exists in the source data set gets reset to the value read in from the data set so not useful to retain an already existing variable.
The OUTPUT statement, when used, will only write to the output set when executed. So the above code will only write the first time 'plasma' is encountered per Id. It spends some time spinning wheels without output so can be inefficient with very large data sets with many more records after the first plasma record.
Are there ever any other values of product besides "rbc" and "plasma"? If so, you need to provide an example and the rules of how they might affect the results you want.
This works for your example data. Please note use of the data step to create a working data set to run code against.
data have; input id $ product $ time; datalines; 1 rbc 1 1 rbc 2 1 plasma 3 1 rbc 4 1 plasma 5 2 plasma 1 2 rbc 2 2 rbc 3 ; data want; set have; by id; retain rbccount plasmacount; if first.id then do ; rbccount=0; plasmacount=0; end; if product='rbc' and plasmacount=0 then rbccount+1; if product='plasma' and plasmacount=0 then plasmacount+1; if plasmacount=1 then do; ratio=plasmacount/(plasmacount+rbccount); output; plasmacount=.; end; keep id rbccount ratio time; label rbccount = '# of rbc before first plasma'; run;
The BY Id expects your ID values to be sorted. If they aren't you can add the NOTSORTED option to the By statement. That would then treat different blocks of a repeated ID as a different sequence with another output row.
The BY statement in a data step creates automatic variables FIRST (and LAST) that indicate whether the current observation is the first or last of a group. The values are 1/0 (true / false) and can be tested as such to set or reset values when occurring as shown.
The RETAIN instruction means that the values are kept across the boundary of a data step so are available for use with later observations in the set. Caveat: Retain with a variable that exists in the source data set gets reset to the value read in from the data set so not useful to retain an already existing variable.
The OUTPUT statement, when used, will only write to the output set when executed. So the above code will only write the first time 'plasma' is encountered per Id. It spends some time spinning wheels without output so can be inefficient with very large data sets with many more records after the first plasma record.
/*Just have some fun.*/
data have;
input id $ product $ time;
datalines;
1 rbc 1
1 rbc 2
1 plasma 3
1 rbc 4
1 plasma 5
2 plasma 1
2 rbc 2
2 rbc 3
;
data want;
do until(last.id);
set have;
by id;
if product='plasma' and not found then do;found=1;_time=time;end;
end;
rbccount=_time-1;
ratio=divide(1,_time);
keep id rbccount ratio _time;
label rbccount = '# of rbc before first plasma';
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.