BookmarkSubscribeRSS Feed
katemarieb
Calcite | Level 5

Hello all,

I am working with a dataset in which I care mostly about 3 variables: pos_test1--pos_test50 (positive test result), complete_date and collect_date1--collect_date50. All variables are unique to the individual, and &do_n is a macro currently set to 50 (collect_date1-collect_date50, pos_test1-pos_test50). I want to identify the closest date of the collect_date variable range with a positive test result to to the complete_date. Currently, it is accurately selecting that ANY positive test linked with a collect_date variable exists, but not matching the date correctly. Here is the code that I am using: 

 

data want;
data have;

array pos_test [&do_n] pos_test1-pos_test&do_n;
array collect_date [&do_n] collect_date1-collect_date&do_n;

do i = 1 to &do_n;
if pos_test[i] =1 then do; closest=collect_date[1]; diff = (collect_date[1]-complete_date); if collect_date[i] > complete_date then do; if . < (collect_date[i]-complete_date) < diff then closest=collect_date[i]; end; end; end; format closest mmddyy10.; run;

 

It returns the correct individual cases, however, it does not return the correct date. It always returns collect_date1. My intention is to select the matching pair of pos_test and collect_date. For example a test on 5/12/21 had a positive result, but 5/20/21 had a negative. This code is identifying that the 5/12/21 positive test exists, and including that case in the selected output, but is still outputting the "closest" variable as 5/20/21. 

 

I hope this makes sense. Please feel free to ask clarifying questions and thank you in advance for the help. 

4 REPLIES 4
Patrick
Opal | Level 21

It's hard to really investigate such questions without sample data and fully functional code to actually execute and replicate what you describe.

From the looks of it and your description may-be the issue is caused by you using a value of 1 instead of variable i as array index in below code section.

closest=collect_date[1];
diff = (collect_date[1]-complete_date);
ballardw
Super User

@Patrick wrote:

It's hard to really investigate such questions without sample data and fully functional code to actually execute and replicate what you describe.

From the looks of it and your description may-be the issue is caused by you using a value of 1 instead of variable i as array index in below code section.

closest=collect_date[1];
diff = (collect_date[1]-complete_date);

I've never made that mistake in 30+ years of programming. Honest! 😈

 

katemarieb
Calcite | Level 5

I completely understand the difficulty, apologies! It is a protected dataset, but it looks something like this once in SAS and cleaned:

 

 

 Pos_test1Pos_test2Pos_test3Complete_dateClosestCollect_date1Collect_date2Collect_date3
ID111.5/21/20216/20/20216/20/20216/10/2021.
ID21..3/2/20215/10/20215/10/2021..
ID30016/4/20217/10/20217/10/20216/20/20216/22/2021

 

Maybe this helps to visualize. Each individual can have up to 50 tests, but many have only 1, or just a few. 

 

Thank you for the tip about the 1, an easy oversight! However, that was intentional. Oftentimes the collect_date1 is the date desired, so I set the baseline at collect_date1. Then, in later steps, I attempted to write that IF any other collect_date[2-50] are closer in difference from the complete_date than collect_date1, that should be chosen as "closest" instead. 

 

The issue with the current code is that for ID1, it chooses collect_date1 as the closest date to the complete_date, instead of collect_date2. ID2 is correct by default, but then the issue returns with ID3. It should ignore collect_date1, because pos_test1 is 0. However, it accurately recognizes that collect_date3 and pos_test3 should be included, and then it returns the value of collect_date1 as the closest value. 

 

I'm not sure why it isn't overwriting the original closest variable when it finds a closest date. Maybe I don't need the placeholder of the first 2 collect_date1 lines? I will try that to see if it fixes it. 

 

 

 

 

ballardw
Super User

Here is a data set of your "example" data for others to use as a start.

data have;
  informat id $5. pos_test1-Pos_test3 f1. 
           Complete_date	Closest	Collect_date1	Collect_date2	Collect_date3 mmddyy10.;
  format Complete_date	Closest	Collect_date1	Collect_date2	Collect_date3 mmddyy10.;
  input id 	Pos_test1	Pos_test2	Pos_test3	Complete_date	Closest	Collect_date1	Collect_date2	Collect_date3;
datalines;
ID1	1	1	.	5/21/2021	6/20/2021	6/20/2021	6/10/2021	.
ID2	1	.	.	3/2/2021	5/10/2021	5/10/2021	.	        .
ID3	0	0	1	6/4/2021	7/10/2021	7/10/2021	6/20/2021	6/22/2021
;

Now a question/comment: Why is the Collectdate_1 later then Collect_date2 but Collect_date_3 is later than 2 but before 1?

I think that if your collect_dates were in chronologic order (with the positive values correlating) this would be an extremely simple exercise.

 

As it is, you could use another array that holds the differences and search that.

If I understand your requirement this may be one way.

data want;
   set have;
   array pt (*) pos_test: ;
   array cd (*) collect_date: ;
   /* to hold the differences, since new variables need explicit size*/
   array dd (3) ; 
   do i=1 to dim(pt);
      if pt[i]=1 then dd[i]=(abs(cd[i] - complete_date));
   end;
   if min(of dd(*))> 0 then pos = whichn(min(of dd(*)),of dd(*));
   if pos>0 then new_closest= cd[pos];
format new_closest mmddyy10.; run;

The dd array holds the date differences only when the test is positive. Absolute values are used to get "closest".

The Whichn, and character version Whichc, returns the first position that the value in the first position is found occurs in a list of following values. The "if min(of " bit of code is to only try to find a minimum position when there is at least one positive test.

I also create a new_closest so you can actually compare with the current value.

 

I really didn't understand why you were restricting

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 870 views
  • 1 like
  • 3 in conversation