Solved: Selection based on pairs of records within BY group where test occurre...

aldreilly · Posted 03-06-2017 10:46 AM

Hi,

I am processing blood lead tests. Below is a sample of the data:

Record_id	Collected_date	Pb_result	Days
LEAD1000	9/14/2007	9	.
LEAD1005	4/30/2007	5	.
LEAD1005	1/22/2008	7	267
LEAD1007	3/9/2007	6	.
LEAD1009	7/11/2007	17	.
LEAD1009	8/31/2007	22	51
LEAD1009	9/21/2007	10	21
LEAD1009	2/22/2008	15	154
LEAD1011	1/22/2007	6	.
LEAD1011	7/9/2007	7	168
LEAD1011	9/17/2007	10	70
LEAD1012	10/22/2007	8	.

The record_id is the unique id for each child in the dataset, collected_date is the date of collection/test, Pb_result is the result of the test, and Days is the number of days between tests for the same child. I have sorted the data by record_id and collected_date. I need to look for confirmatory pairs - tests where the the same child (record_id) received a blood lead test within 1 to 84 days of a previous test, and then select the test with the highest result. Record_id LEAD1009 had 2 tests within 51 days of each other and the second test, where the record for the 8/31/2007 test with Pb_result=22, should be output. The next pair would be the test performed on 8/31/2007 and 9/21/2007, where the 8/31/2007 test should again be output because it was the higher test result of the two.The third and fourth tests for LEAD1009 should not result in any output because the time between tests was greater than 84 days. If the child had only one test then I'd like to output those records to a different dataset.

I have attached an additional schematic of what I am trying to accomplish for further clarification.

I just can't seem to wrap my head around how to successfully code to achieve the results I am looking for. Thanks in advance for any help.

mkeintz · Posted 03-06-2017 01:36 PM

Are you saying that you want LEAD1009 8/31/2007 output 2 times?

Once because it is within 84 days of 7/11/2007 and has a higher pb?
Once because it is within 84 days of 9/21/2007 and has a higher pb?

If so, then this works:

data have;
  input Record_id :$8. Collected_date :mmddyy10. Pb_result Days ;
  format collected_date yymmddn8.;
datalines;
LEAD1000 9/14/2007 9 . 
LEAD1005 4/30/2007 5 . 
LEAD1005 1/22/2008 7 267 
LEAD1007 3/9/2007 6 . 
LEAD1009 7/11/2007 17 . 
LEAD1009 8/31/2007 22 51     
LEAD1009 9/21/2007 10 21 
LEAD1009 2/22/2008 15 154 
LEAD1011 1/22/2007 6 . 
LEAD1011 7/9/2007 7 168 
LEAD1011 9/17/2007 10 70     
LEAD1012 10/22/2007 8 . 
run;

data want (drop=nxt_:);
  set have ;
  by record_id;

  /* Compare to predecessor */
  if pb_result>lag(pb_result) and first.record_id=0 and days<=84 then output;

  /* ReCompare, this time to successor */
  merge have (keep=record_id)
        have (firstobs=2 keep=pb_result days rename=(pb_result=nxt_pb days=nxt_days));
  if pb_result>nxt_pb and last.record_id=0 and nxt_days<=84 then output;

run;

Notes;

Comparing to predessor is pretty straightforward, but one caveat In the IF statement make sure that the "pb_result>lag(pb_result) condition precedes the other conditions. Because all the conditions are connected by "and", sas is smart enough to stop checking once a failed condition is detected. But since we always want the lag queue (this queue have size 1) updated, it should be the first condition.
SAS has no lead functino. To the compare to successor uses the MERGE statement (notice no accompanying BY statement), where the data set is merged with itself, but the second HAVE has "firstobs=2" meaning that it reads one observation ahead of the first have. The "rename=" parameters is needed to avoid two records trying to populate the same variable.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 03-06-2017 01:36 PM

Are you saying that you want LEAD1009 8/31/2007 output 2 times?

Once because it is within 84 days of 7/11/2007 and has a higher pb?
Once because it is within 84 days of 9/21/2007 and has a higher pb?

If so, then this works:

data have;
  input Record_id :$8. Collected_date :mmddyy10. Pb_result Days ;
  format collected_date yymmddn8.;
datalines;
LEAD1000 9/14/2007 9 . 
LEAD1005 4/30/2007 5 . 
LEAD1005 1/22/2008 7 267 
LEAD1007 3/9/2007 6 . 
LEAD1009 7/11/2007 17 . 
LEAD1009 8/31/2007 22 51     
LEAD1009 9/21/2007 10 21 
LEAD1009 2/22/2008 15 154 
LEAD1011 1/22/2007 6 . 
LEAD1011 7/9/2007 7 168 
LEAD1011 9/17/2007 10 70     
LEAD1012 10/22/2007 8 . 
run;

data want (drop=nxt_:);
  set have ;
  by record_id;

  /* Compare to predecessor */
  if pb_result>lag(pb_result) and first.record_id=0 and days<=84 then output;

  /* ReCompare, this time to successor */
  merge have (keep=record_id)
        have (firstobs=2 keep=pb_result days rename=(pb_result=nxt_pb days=nxt_days));
  if pb_result>nxt_pb and last.record_id=0 and nxt_days<=84 then output;

run;

Notes;

Comparing to predessor is pretty straightforward, but one caveat In the IF statement make sure that the "pb_result>lag(pb_result) condition precedes the other conditions. Because all the conditions are connected by "and", sas is smart enough to stop checking once a failed condition is detected. But since we always want the lag queue (this queue have size 1) updated, it should be the first condition.
SAS has no lead functino. To the compare to successor uses the MERGE statement (notice no accompanying BY statement), where the data set is merged with itself, but the second HAVE has "firstobs=2" meaning that it reads one observation ahead of the first have. The "rename=" parameters is needed to avoid two records trying to populate the same variable.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Selection based on pairs of records within BY group where test occurred within defined time period

Re: Selection based on pairs of records within BY group where test occurred within defined time peri

Re: Selection based on pairs of records within BY group where test occurred within defined time peri

Catch up on SAS Innovate 2026

Selection based on pairs of records within BY group where test occurred within defined time period

Re: Selection based on pairs of records within BY group where test occurred within defined time peri

Re: Selection based on pairs of records within BY group where test occurred within defined time peri

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away