Solved: Re: Filter records after a certain date found on another record

loishaggard · Posted 12-08-2020 06:43 PM

I have test result data for patients. Once a patient tests positive (result=1) any future tests for that patient should be deleted from the file.

So I know I need to identify the record with result=1 and grab that date, but I'm at a loss for how to apply that date within patient ID to delete records with a higher date value.

Thanks in advance.

PATIENT_ID	_SPECIMEN_COLL_DT	RESULT
1007215	23-Sep-20	2
1007215	25-Sep-20	2
1007215	27-Sep-20	1
1007215	1-Nov-20	2	delete
1007215	10-Nov-20	2	delete
1007215	14-Nov-20	2	delete
1007215	30-Nov-20	2	delete
1007215	6-Dec-20	9	delete
1342677	19-Oct-20	2
1342677	7-Nov-20	2
1345898	25-Aug-20	9
1345898	27-Sep-20	2
1345898	11-Oct-20	2
1345898	23-Oct-20	1
1345898	30-Oct-20	2	delete
1345898	6-Nov-20	2	delete
1345898	22-Nov-20	2	delete
1345898	29-Nov-20	2	delete
1345898	6-Dec-20	2	delete
1408252	21-Sep-20	2
1439221	25-Oct-20	2
1439245	8-Oct-20	2
1442703	21-Oct-20	2
1451575	24-Aug-20	2
1451575	24-Nov-20	2
1462769	9-Sep-20	2
1469679	15-Sep-20	2
1469679	3-Oct-20	2
1469679	7-Oct-20	1
1469679	3-Nov-20	2	delete
1502202	17-Aug-20	2
1502202	29-Aug-20	2
1502202	19-Sep-20	2
1502202	28-Sep-20	2
1502202	13-Nov-20	2
1502202	21-Nov-20	2
1502202	6-Dec-20	2
1557673	20-Aug-20	2
1560593	30-Aug-20	2
1569074	11-Sep-20	2
1571807	15-Sep-20	2
1602114	13-Sep-20	2
1743463	7-Jul-20	2
1751405	7-Jun-20	2
1751737	14-Jul-20	2
1760386	27-Jul-20	2
1760410	28-Jul-20	2
2022188	8-Nov-20	2
2076478	11-Nov-20	2
2077550	6-Oct-20	3
2078033	30-Nov-20	2
2079325	5-Dec-20	9
2081383	26-Nov-20	2
2085676	24-Oct-20	2
2086149	6-Dec-20	2
2086958	4-Dec-20	2
2088387	18-Sep-20	9
2088387	17-Nov-20	1
2088387	8-Dec-20	2	delete
2093450	8-Dec-20	2
2094033	28-Nov-20	2
2094452	21-Oct-20	2
2097616	29-Oct-20	2
2111066	20-Nov-20	2
2112099	18-Sep-20	2
2114632	6-Oct-20	2
2118460	25-Sep-20	2
2118460	29-Sep-20	1
2118460	10-Nov-20	2	delete
2118460	20-Nov-20	2	delete
2118460	26-Nov-20	2	delete
2118460	4-Dec-20	2	delete
2118460	6-Dec-20	2	delete
2133328	30-Nov-20	2
2133571	13-Oct-20	2
2134005	8-Dec-20	2
2134473	29-Nov-20	2
2153388	11-Nov-20	2
2166498	20-Nov-20	2
2167076	5-Dec-20	2
2168477	8-Dec-20	2
2168481	8-Nov-20	2
2168517	8-Nov-20	2
2168548	20-Nov-20	2
2168564	2-Nov-20	2
2168574	15-Oct-20	2
2168601	5-Dec-20	2
2168638	4-Dec-20	2
2168642	6-Oct-20	2
2169406	8-Dec-20	2
2172274	21-Oct-20	2
2172274	20-Nov-20	1
2172274	4-Dec-20	9	delete
2174225	6-Dec-20	2
2176363	14-Nov-20	2
2177143	5-Oct-20	2
2177326	8-Dec-20	2
2178406	8-Dec-20	2
2178412	5-Oct-20	2
2178416	13-Nov-20	2

hhinohar · Posted 12-08-2020 07:46 PM

There was a similar case.
https://communities.sas.com/t5/SAS-Programming/How-to-delete-observations-based-on-dates-in-a-by-gro...

data have;
	infile datalines dlm="09"x;
	input PATIENT_ID _SPECIMEN_COLL_DT:anydtdte. RESULT;
	format _SPECIMEN_COLL_DT yymmdds10.;
	datalines;
1007215	23-Sep-20	2
1007215	25-Sep-20	2
1007215	27-Sep-20	1
1007215	1-Nov-20	2
1007215	10-Nov-20	2
1007215	14-Nov-20	2
1007215	30-Nov-20	2
1007215	6-Dec-20	9
1342677	19-Oct-20	2
1342677	7-Nov-20	2
1345898	25-Aug-20	9
1345898	27-Sep-20	2
1345898	11-Oct-20	2
1345898	23-Oct-20	1
1345898	30-Oct-20	2
1345898	6-Nov-20	2
1345898	22-Nov-20	2
1345898	29-Nov-20	2
1345898	6-Dec-20	2
1408252	21-Sep-20	2
1439221	25-Oct-20	2
1439245	8-Oct-20	2
1442703	21-Oct-20	2
1451575	24-Aug-20	2
1451575	24-Nov-20	2
1462769	9-Sep-20	2
1469679	15-Sep-20	2
1469679	3-Oct-20	2
1469679	7-Oct-20	1
1469679	3-Nov-20	2
1502202	17-Aug-20	2
1502202	29-Aug-20	2
1502202	19-Sep-20	2
1502202	28-Sep-20	2
1502202	13-Nov-20	2
1502202	21-Nov-20	2
1502202	6-Dec-20	2
1557673	20-Aug-20	2
1560593	30-Aug-20	2
1569074	11-Sep-20	2
1571807	15-Sep-20	2
1602114	13-Sep-20	2
1743463	7-Jul-20	2
1751405	7-Jun-20	2
1751737	14-Jul-20	2
1760386	27-Jul-20	2
1760410	28-Jul-20	2
2022188	8-Nov-20	2
2076478	11-Nov-20	2
2077550	6-Oct-20	3
2078033	30-Nov-20	2
2079325	5-Dec-20	9
2081383	26-Nov-20	2
2085676	24-Oct-20	2
2086149	6-Dec-20	2
2086958	4-Dec-20	2
2088387	18-Sep-20	9
2088387	17-Nov-20	1
2088387	8-Dec-20	2
2093450	8-Dec-20	2
2094033	28-Nov-20	2
2094452	21-Oct-20	2
2097616	29-Oct-20	2
2111066	20-Nov-20	2
2112099	18-Sep-20	2
2114632	6-Oct-20	2
2118460	25-Sep-20	2
2118460	29-Sep-20	1
2118460	10-Nov-20	2
2118460	20-Nov-20	2
2118460	26-Nov-20	2
2118460	4-Dec-20	2
2118460	6-Dec-20	2
2133328	30-Nov-20	2
2133571	13-Oct-20	2
2134005	8-Dec-20	2
2134473	29-Nov-20	2
2153388	11-Nov-20	2
2166498	20-Nov-20	2
2167076	5-Dec-20	2
2168477	8-Dec-20	2
2168481	8-Nov-20	2
2168517	8-Nov-20	2
2168548	20-Nov-20	2
2168564	2-Nov-20	2
2168574	15-Oct-20	2
2168601	5-Dec-20	2
2168638	4-Dec-20	2
2168642	6-Oct-20	2
2169406	8-Dec-20	2
2172274	21-Oct-20	2
2172274	20-Nov-20	1
2172274	4-Dec-20	9
2174225	6-Dec-20	2
2176363	14-Nov-20	2
2177143	5-Oct-20	2
2177326	8-Dec-20	2
2178406	8-Dec-20	2
2178412	5-Oct-20	2
2178416	13-Nov-20	2
;
run;

data want;
	*condition is set when result hits 1 or last patient;
	do until(last.patient_id | result=1);
		set have;
		by patient_id;
		*set _N_ as flag;
		if first.patient_id then _N_=0;
		*output observation until flag hits the condition;
		if ^_N_ then output;
	end;
run;

View solution in original post

ballardw · Posted 12-08-2020 06:54 PM

See if this gets you started;

data want;
    set have;
    by patient_id;
   retain deleteflag;
   if first.patient_id then call missing(deleteflag);
   if deleteflag=1 then delete;
   if result=1 then deleteflag=1;
   drop deleteflag;
run;

Assumes your data is sorted by Patient_id and the dates.

Retain will keep a value across iterations of a data step.

When you use a By statement then SAS creates automatic variables First. and Last. for each variable on the by statement that indicate whether the current record is the first or last value for that variable within a by group. So you reset values for the first record of the Patient_id with "If first.patient_id" (do pay attention to the presence of the dot between the key word and the variable name).

So we set a flag the first time a value of 1 is encountered. The next record has the flag set so the Delete removes the record.

loishaggard · Posted 12-08-2020 07:06 PM

Thanks. I'll give it a try!

mkeintz · Posted 12-08-2020 07:20 PM

Assuming that data are sorted by _SPECIMENT_COOL_DT for each PATIENT_ID, and that no PATIENT_ID has more than one result=1 record, this self-match-merge with a rename will do what you want:

data want (drop=_cutoff_date);
  merge have (where=(result=1) rename=(_SPECIMEN_COLL_DT=_cutoff_date))
        have;
  by patient_id;
  if _SPECIMEN_COLL_DT<=_cutoff_date;
run;

And if a patient_id has no result=1 records, then the entire patient's history will be kept.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hhinohar · Posted 12-08-2020 07:46 PM

There was a similar case.
https://communities.sas.com/t5/SAS-Programming/How-to-delete-observations-based-on-dates-in-a-by-gro...

data have;
	infile datalines dlm="09"x;
	input PATIENT_ID _SPECIMEN_COLL_DT:anydtdte. RESULT;
	format _SPECIMEN_COLL_DT yymmdds10.;
	datalines;
1007215	23-Sep-20	2
1007215	25-Sep-20	2
1007215	27-Sep-20	1
1007215	1-Nov-20	2
1007215	10-Nov-20	2
1007215	14-Nov-20	2
1007215	30-Nov-20	2
1007215	6-Dec-20	9
1342677	19-Oct-20	2
1342677	7-Nov-20	2
1345898	25-Aug-20	9
1345898	27-Sep-20	2
1345898	11-Oct-20	2
1345898	23-Oct-20	1
1345898	30-Oct-20	2
1345898	6-Nov-20	2
1345898	22-Nov-20	2
1345898	29-Nov-20	2
1345898	6-Dec-20	2
1408252	21-Sep-20	2
1439221	25-Oct-20	2
1439245	8-Oct-20	2
1442703	21-Oct-20	2
1451575	24-Aug-20	2
1451575	24-Nov-20	2
1462769	9-Sep-20	2
1469679	15-Sep-20	2
1469679	3-Oct-20	2
1469679	7-Oct-20	1
1469679	3-Nov-20	2
1502202	17-Aug-20	2
1502202	29-Aug-20	2
1502202	19-Sep-20	2
1502202	28-Sep-20	2
1502202	13-Nov-20	2
1502202	21-Nov-20	2
1502202	6-Dec-20	2
1557673	20-Aug-20	2
1560593	30-Aug-20	2
1569074	11-Sep-20	2
1571807	15-Sep-20	2
1602114	13-Sep-20	2
1743463	7-Jul-20	2
1751405	7-Jun-20	2
1751737	14-Jul-20	2
1760386	27-Jul-20	2
1760410	28-Jul-20	2
2022188	8-Nov-20	2
2076478	11-Nov-20	2
2077550	6-Oct-20	3
2078033	30-Nov-20	2
2079325	5-Dec-20	9
2081383	26-Nov-20	2
2085676	24-Oct-20	2
2086149	6-Dec-20	2
2086958	4-Dec-20	2
2088387	18-Sep-20	9
2088387	17-Nov-20	1
2088387	8-Dec-20	2
2093450	8-Dec-20	2
2094033	28-Nov-20	2
2094452	21-Oct-20	2
2097616	29-Oct-20	2
2111066	20-Nov-20	2
2112099	18-Sep-20	2
2114632	6-Oct-20	2
2118460	25-Sep-20	2
2118460	29-Sep-20	1
2118460	10-Nov-20	2
2118460	20-Nov-20	2
2118460	26-Nov-20	2
2118460	4-Dec-20	2
2118460	6-Dec-20	2
2133328	30-Nov-20	2
2133571	13-Oct-20	2
2134005	8-Dec-20	2
2134473	29-Nov-20	2
2153388	11-Nov-20	2
2166498	20-Nov-20	2
2167076	5-Dec-20	2
2168477	8-Dec-20	2
2168481	8-Nov-20	2
2168517	8-Nov-20	2
2168548	20-Nov-20	2
2168564	2-Nov-20	2
2168574	15-Oct-20	2
2168601	5-Dec-20	2
2168638	4-Dec-20	2
2168642	6-Oct-20	2
2169406	8-Dec-20	2
2172274	21-Oct-20	2
2172274	20-Nov-20	1
2172274	4-Dec-20	9
2174225	6-Dec-20	2
2176363	14-Nov-20	2
2177143	5-Oct-20	2
2177326	8-Dec-20	2
2178406	8-Dec-20	2
2178412	5-Oct-20	2
2178416	13-Nov-20	2
;
run;

data want;
	*condition is set when result hits 1 or last patient;
	do until(last.patient_id | result=1);
		set have;
		by patient_id;
		*set _N_ as flag;
		if first.patient_id then _N_=0;
		*output observation until flag hits the condition;
		if ^_N_ then output;
	end;
run;

novinosrin · Posted 12-08-2020 09:44 PM


data have;
	input PATIENT_ID _SPECIMEN_COLL_DT :date9. RESULT;
	format _SPECIMEN_COLL_DT date9.;
	datalines;
1007215	23-Sep-20	2
1007215	25-Sep-20	2
1007215	27-Sep-20	1
1007215	1-Nov-20	2
1007215	10-Nov-20	2
1007215	14-Nov-20	2
1007215	30-Nov-20	2
1007215	6-Dec-20	9
1342677	19-Oct-20	2
1342677	7-Nov-20	2
1345898	25-Aug-20	9
1345898	27-Sep-20	2
1345898	11-Oct-20	2
1345898	23-Oct-20	1
1345898	30-Oct-20	2
1345898	6-Nov-20	2
1345898	22-Nov-20	2
1345898	29-Nov-20	2
1345898	6-Dec-20	2
1408252	21-Sep-20	2
1439221	25-Oct-20	2
1439245	8-Oct-20	2
1442703	21-Oct-20	2
1451575	24-Aug-20	2
1451575	24-Nov-20	2
1462769	9-Sep-20	2
1469679	15-Sep-20	2
1469679	3-Oct-20	2
1469679	7-Oct-20	1
1469679	3-Nov-20	2
1502202	17-Aug-20	2
1502202	29-Aug-20	2
1502202	19-Sep-20	2
1502202	28-Sep-20	2
1502202	13-Nov-20	2
1502202	21-Nov-20	2
1502202	6-Dec-20	2
1557673	20-Aug-20	2
1560593	30-Aug-20	2
1569074	11-Sep-20	2
1571807	15-Sep-20	2
1602114	13-Sep-20	2
1743463	7-Jul-20	2
1751405	7-Jun-20	2
1751737	14-Jul-20	2
1760386	27-Jul-20	2
1760410	28-Jul-20	2
2022188	8-Nov-20	2
2076478	11-Nov-20	2
2077550	6-Oct-20	3
2078033	30-Nov-20	2
2079325	5-Dec-20	9
2081383	26-Nov-20	2
2085676	24-Oct-20	2
2086149	6-Dec-20	2
2086958	4-Dec-20	2
2088387	18-Sep-20	9
2088387	17-Nov-20	1
2088387	8-Dec-20	2
2093450	8-Dec-20	2
2094033	28-Nov-20	2
2094452	21-Oct-20	2
2097616	29-Oct-20	2
2111066	20-Nov-20	2
2112099	18-Sep-20	2
2114632	6-Oct-20	2
2118460	25-Sep-20	2
2118460	29-Sep-20	1
2118460	10-Nov-20	2
2118460	20-Nov-20	2
2118460	26-Nov-20	2
2118460	4-Dec-20	2
2118460	6-Dec-20	2
2133328	30-Nov-20	2
2133571	13-Oct-20	2
2134005	8-Dec-20	2
2134473	29-Nov-20	2
2153388	11-Nov-20	2
2166498	20-Nov-20	2
2167076	5-Dec-20	2
2168477	8-Dec-20	2
2168481	8-Nov-20	2
2168517	8-Nov-20	2
2168548	20-Nov-20	2
2168564	2-Nov-20	2
2168574	15-Oct-20	2
2168601	5-Dec-20	2
2168638	4-Dec-20	2
2168642	6-Oct-20	2
2169406	8-Dec-20	2
2172274	21-Oct-20	2
2172274	20-Nov-20	1
2172274	4-Dec-20	9
2174225	6-Dec-20	2
2176363	14-Nov-20	2
2177143	5-Oct-20	2
2177326	8-Dec-20	2
2178406	8-Dec-20	2
2178412	5-Oct-20	2
2178416	13-Nov-20	2
;
run;

proc sql;
 create table want as
 select * 
 from have
 group by PATIENT_ID
 having _SPECIMEN_COLL_DT<=max((result=1)*_SPECIMEN_COLL_DT)
 order by PATIENT_ID, _SPECIMEN_COLL_DT;
quit;

loishaggard · Posted 12-09-2020 12:35 PM

Thanks so much, everyone, But so far, no go.

The mkeintz novinosrin solutions deleted appropriate records for any patient who had had a positive test, and also all records for patients who had not had any positive test. Those latter records need to be maintained in the dataset.

The hhinohar solution yielded a dataset with 33 records with missing values for every variable on every record. There was a problem reading in PATIENT_ID. I'll look more into it.

The solution, using the data provided, should delete the 18 records marked with "delete," yielding a dataset with 82 records. It should maintain all the records for patient ID that never had a positive result, as well as the records with the positive results, and the records for patients with a positive result for which the specimen was collected before the date of the positive result.

Thanks again for you efforts!

Classroom Training Available!