BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
cheroij
Fluorite | Level 6

Hello SAS experts,

I have a dataset and would like to identify an individual's true disease type based on the following:

 

  1. If an individual has Type A or B, select the record closest to the reference date based on service start or end date.
  2. If an individual only has "Unspecified" and no valid A or B, retain the "Unspecified" record.
  3. The reference date is the same for each individual.
  4. Each individual should appear only once in the final output dataset (ie one line per individual ID)

 

Below is the dataset I have and the expected output.

 

 

data WORK.SUBTYPE_SAMPLE;
infile datalines dsd truncover;
input ID:BEST12. Type:$12. Reference_date:DATE9. service_start:DATE9. service_end:DATE9.;
format ID BEST12. Reference_date DATE9. service_start DATE9. service_end DATE9.;
datalines;
1 A 04JAN2016 10JAN2016 21JAN2016
1 B 04JAN2016 09JUL2018 09NOV2019
1 Unspecified 04JAN2016 06JAN2016 10FEB2016
2 B 08JUN2019 08DEC2019 19DEC2019
2 Unspecified 08JUN2019 22OCT2019 09AUG2019
3 Unspecified 02FEB2017 02APR2017 15APR2017
4 A 01JAN2020 03MAR2020 24MAR2020
4 A 01JAN2020 05MAY2018 10MAY2018
4 Unnspecified 01JAN2020 02JAN2020 03JAN2020
5 A 09SEP2016 11NOV2016 15NOV2016
5 B 09SEP2016 09SEP2016 10NOV2016
6 A 03MAR2016 30AUG2016 02NOV2016
6 A 03MAR2016 14OCT2016 19OCT2016
6 A 03MAR2016 26MAR2016 19DEC2016
6 Unspecified 03MAR2016 20OCT2016 21OCT2016
6 Unspecified 03MAR2016 12DEC2016 28DEC2016
6 B 03MAR2016 28JUN2016 15AUG2016
7 B 10OCT2022 11OCT2022 14NOV2022
8 Unspecified 01JAN2019 05MAY2019 06MAY2019
8 Unspecified 01JAN2019 07MAY2019 08MAY2019
;;;;

 

 

Want:

Idtrue_type
1A
2B
3Unspecified
4A
5B
6A
7B
8Unspecified

 

Thank you

1 ACCEPTED SOLUTION

Accepted Solutions
quickbluefish
Barite | Level 11

The first part is just a slight re-work of your input dataset - it was producing all sorts of errors trying to read as it was:

data WORK.SUBTYPE_SAMPLE;
infile cards dsd truncover firstobs=1 dlm=',';
length ID $12 type $12 reference_date service_start service_end 4;
informat reference_date service_start service_end date9.;
format Reference_date DATE9. service_start DATE9. service_end DATE9.;
input ID Type Reference_date service_start service_end;
cards;
1,A,04JAN2016,10JAN2016,21JAN2016
1,B,04JAN2016,09JUL2018,09NOV2019
1,Unspecified,04JAN2016,06JAN2016,10FEB2016
2,B,08JUN2019,08DEC2019,19DEC2019
2,Unspecified,08JUN2019,22OCT2019,09AUG2019
3,Unspecified,02FEB2017,02APR2017,15APR2017
4,A,01JAN2020,03MAR2020,24MAR2020
4,A,01JAN2020,05MAY2018,10MAY2018
4,Unnspecified,01JAN2020,02JAN2020,03JAN2020
5,A,09SEP2016,11NOV2016,15NOV2016
5,B,09SEP2016,09SEP2016,10NOV2016
6,A,03MAR2016,30AUG2016,02NOV2016
6,A,03MAR2016,14OCT2016,19OCT2016
6,A,03MAR2016,26MAR2016,19DEC2016
6,Unspecified,03MAR2016,20OCT2016,21OCT2016
6,Unspecified,03MAR2016,12DEC2016,28DEC2016
6,B,03MAR2016,28JUN2016,15AUG2016
7,B,10OCT2022,11OCT2022,14NOV2022
8,Unspecified,01JAN2019,05MAY2019,06MAY2019
8,Unspecified,01JAN2019,07MAY2019,08MAY2019
;
run;

proc sort data=subtype_sample; by id; run;

data want;
set subtype_sample;
by ID;
length true_type $12 closest 4 anyAB 3;
retain true_type closest anyAB;
if first.ID then do;
	closest=10000;
	true_type='';
	anyAB=0;
end;
dist=min(
	abs(service_start-reference_date), 
	abs(service_end-reference_end)
	);
if type in ('A', 'B') then do;
	anyAB=1;
	if dist<closest then do;
		true_type=type;
		closest=dist;
	end;
end;
else if anyAB=0 then true_type='Unspecified';
if last.ID then output;
keep ID true_type closest;
run;

proc print data=want; run;

quickbluefish_0-1741309032284.png

 

View solution in original post

4 REPLIES 4
quickbluefish
Barite | Level 11

The first part is just a slight re-work of your input dataset - it was producing all sorts of errors trying to read as it was:

data WORK.SUBTYPE_SAMPLE;
infile cards dsd truncover firstobs=1 dlm=',';
length ID $12 type $12 reference_date service_start service_end 4;
informat reference_date service_start service_end date9.;
format Reference_date DATE9. service_start DATE9. service_end DATE9.;
input ID Type Reference_date service_start service_end;
cards;
1,A,04JAN2016,10JAN2016,21JAN2016
1,B,04JAN2016,09JUL2018,09NOV2019
1,Unspecified,04JAN2016,06JAN2016,10FEB2016
2,B,08JUN2019,08DEC2019,19DEC2019
2,Unspecified,08JUN2019,22OCT2019,09AUG2019
3,Unspecified,02FEB2017,02APR2017,15APR2017
4,A,01JAN2020,03MAR2020,24MAR2020
4,A,01JAN2020,05MAY2018,10MAY2018
4,Unnspecified,01JAN2020,02JAN2020,03JAN2020
5,A,09SEP2016,11NOV2016,15NOV2016
5,B,09SEP2016,09SEP2016,10NOV2016
6,A,03MAR2016,30AUG2016,02NOV2016
6,A,03MAR2016,14OCT2016,19OCT2016
6,A,03MAR2016,26MAR2016,19DEC2016
6,Unspecified,03MAR2016,20OCT2016,21OCT2016
6,Unspecified,03MAR2016,12DEC2016,28DEC2016
6,B,03MAR2016,28JUN2016,15AUG2016
7,B,10OCT2022,11OCT2022,14NOV2022
8,Unspecified,01JAN2019,05MAY2019,06MAY2019
8,Unspecified,01JAN2019,07MAY2019,08MAY2019
;
run;

proc sort data=subtype_sample; by id; run;

data want;
set subtype_sample;
by ID;
length true_type $12 closest 4 anyAB 3;
retain true_type closest anyAB;
if first.ID then do;
	closest=10000;
	true_type='';
	anyAB=0;
end;
dist=min(
	abs(service_start-reference_date), 
	abs(service_end-reference_end)
	);
if type in ('A', 'B') then do;
	anyAB=1;
	if dist<closest then do;
		true_type=type;
		closest=dist;
	end;
end;
else if anyAB=0 then true_type='Unspecified';
if last.ID then output;
keep ID true_type closest;
run;

proc print data=want; run;

quickbluefish_0-1741309032284.png

 

Ksharp
Super User
data WORK.SUBTYPE_SAMPLE;
infile cards dsd truncover firstobs=1 dlm=',';
length ID $12 type $12 reference_date service_start service_end 4;
informat reference_date service_start service_end date9.;
format Reference_date DATE9. service_start DATE9. service_end DATE9.;
input ID Type Reference_date service_start service_end;
cards;
1,A,04JAN2016,10JAN2016,21JAN2016,
1,B,04JAN2016,09JUL2018,09NOV2019,
1,Unspecified,04JAN2016,06JAN2016,10FEB2016
2,B,08JUN2019,08DEC2019,19DEC2019,
2,Unspecified,08JUN2019,22OCT2019,09AUG2019
3,Unspecified,02FEB2017,02APR2017,15APR2017
4,A,01JAN2020,03MAR2020,24MAR2020,
4,A,01JAN2020,05MAY2018,10MAY2018,
4,Unnspecified,01JAN2020,02JAN2020,03JAN2020
5,A,09SEP2016,11NOV2016,15NOV2016,
5,B,09SEP2016,09SEP2016,10NOV2016,
6,A,03MAR2016,30AUG2016,02NOV2016,
6,A,03MAR2016,14OCT2016,19OCT2016,
6,A,03MAR2016,26MAR2016,19DEC2016,
6,Unspecified,03MAR2016,20OCT2016,21OCT2016
6,Unspecified,03MAR2016,12DEC2016,28DEC2016
6,B,03MAR2016,28JUN2016,15AUG2016,
7,B,10OCT2022,11OCT2022,14NOV2022,
8,Unspecified,01JAN2019,05MAY2019,06MAY2019
8,Unspecified,01JAN2019,07MAY2019,08MAY2019
;
run;
proc sql;
create table want as
select distinct ID,type
 from SUBTYPE_SAMPLE
  where type in ('A' 'B')
   group by ID
    having abs(reference_date-service_start)=min(abs(reference_date-service_start))
union
select distinct ID,type
from (select * from SUBTYPE_SAMPLE group by ID having sum(type='Unspecified')=count(*))
group by ID
having abs(reference_date-service_start)=min(abs(reference_date-service_start))

order by ID
;

quit;
mkeintz
PROC Star

Given that the data are already sorted by ID, then you can read each ID group with all the 'A' and 'B' types preceding all 'Unspecified' types, via

 

data want;
  set have (where=(type in ('A','B')))
      have (where=(type ='Unspecified')) ;
  ...
;

Then:

 

data want;
  set have (where=(type in ('A','B')))
      have (where=(type ='Unspecified')) ;
  by id;
  retain min_dist .  true_type '           ' ;

  _dist=min(abs(service_start-reference_date)
           ,abs(service_end-reference_date));

  if first.id=1 then do;
    min_dist=_dist;
    true_type=type;
  end;
  else if type in ('A','B') and _dist<min_dist then do;
    true_type=type;
    min_dist=_dist;
  end;

  if last.id;
  keep id true_type ;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 828 views
  • 3 likes
  • 4 in conversation