DATA Step, Macro, Functions and more

Grouping and manipulating data

Reply
New Contributor
Posts: 4

Grouping and manipulating data

[ Edited ]
Hi all,
I am trying to do a couple of things with the sample below and have search online a lot for it but it
isn't working. Will it be possible to do using data step?

1. I want to find the top two diagnosis(diag) with a table showing the count from the entire data set
per patient or simply from diag1 if not possible. Something like AA|3, BB|4.
Instead of a full list of all the diagnosis which will be thousands in a large data set.
Tried various freq and count commands but it doesn't work.

2. Couldn't figure this out but is it possible to find associated diagnosis per patient?
For example patients diagnosed with 'AA' also have 'JI' and patients with 'BB' also have 'GE'.

3. In a large data set will it be possible to link agegroups to diag1 , to see which diagnosis is common to each agegroup?

Thank you



data sample;
input MYDOB ADMIN$ SEX$ ID DIAG1$ DIAG2$ DIAG3$ DIAG4$ ;
DATALINES;
111924 KL M 1 AA JI GE .
121926 OH M 2 BB AA . GE
121982 KL F 3 BB . . AA
101980 KL M 4 AA . . .
111979 OH F 5 JI . AA .
101959 KL M 6 GE . . .
121999 OH F 7 BB . . .
102008 KL M 8 AA JI . .
112001 KL F 9 JI . . AA
102016 OH M 10 BB . GE .
RUN;


data age2;
set SAMPLE;
newdate1=mdy(12,31,2016);
format newdate1 mmyyn.;
run;


/* Calculate age*/
data cleaned (keep= ADMIN SEX ID DIAG1 diag2 diag3 diag4 AGE agegroup);
retain ID AGE ADMIN SEX DIAG1 DIAG2 DIAG3 DIAG4 agegroup; /* rearrange*/
set age2;
OPTIONS YEARCUTOFF=1917;
a = put(MYDOB,6.);
NEWDOB = input (a, mmddyy6.);
format NEWDOB MMYYN.;
age = INT(YRDIF(NEWDOB, newdate1,'ACTUAL'));
/*create age group*/
if age<=18 then agegroup ='Young';
else if 19<=age<64 then agegroup='Intemediate';
else if age >64 then agegroup= 'Retired';
run;

 

Super User
Posts: 22,823

Re: Grouping and manipulating data

Posted in reply to exbalterate

All of that's possible. What exactly is your question here though? 

 

Spoiler

exbalterate wrote:
Hi all,
I am trying to do a couple of things with the sample below and have search online a lot for it but it
isn't working. Will it be possible to do using data step?

1. I want to find the top two diagnosis(diag) with a table showing the count from the entire data set
per patient or simply from diag1 if not possible. Something like AA|3, BB|4.
Instead of a full list of all the diagnosis which will be thousands in a large data set.
Tried various freq and count commands but it doesn't work.

2. Couldn't figure this out but is it possible to find associated diagnosis per patient?
For example patients diagnosed with 'AA' also have 'JI' and patients with 'BB' also have 'GE'.

3. In a large data set will it be possible to link agegroups to diag1 , to see which diagnosis is common to each agegroup?

Thank you



data sample;
input MYDOB ADMIN$ SEX$ ID DIAG1$ DIAG2$ DIAG3$ DIAG4$ ;
DATALINES;
111924 KL M 1 AA JI GE .
121926 OH M 2 BB AA . GE
121982 KL F 3 BB . . AA
101980 KL M 4 AA . . .
111979 OH F 5 JI . AA .
101959 KL M 6 GE . . .
121999 OH F 7 BB . . .
102008 KL M 8 AA JI . .
112001 KL F 9 JI . . AA
102016 OH M 10 BB . GE .
RUN;


data age2;
set SAMPLE;
newdate1=mdy(12,31,2016);
format newdate1 mmyyn.;
run;


/* Calculate age*/
data cleaned (keep= ADMIN SEX ID DIAG1 diag2 diag3 diag4 AGE agegroup);
retain ID AGE ADMIN SEX DIAG1 DIAG2 DIAG3 DIAG4 agegroup; /* rearrange*/
set age2;
OPTIONS YEARCUTOFF=1917;
a = put(MYDOB,6.);
NEWDOB = input (a, mmddyy6.);
format NEWDOB MMYYN.;
age = INT(YRDIF(NEWDOB, newdate1,'ACTUAL'));
/*create age group*/
if age<=18 then agegroup ='Young';
else if 19<=age<64 then agegroup='Intemediate';
else if age >64 then agegroup= 'Retired';
run;

 


New Contributor
Posts: 4

Re: Grouping and manipulating data

How to do it? I cannot figure it out

 

Thanks

Super User
Posts: 22,823

Re: Grouping and manipulating data

Posted in reply to exbalterate

I would suggest simplifying it in several ways then. 

Please only include one question at a time. 

 

So it looks like you have medical data and are looking for the top diagnosis. Based on the sample data you've provided initially, and thank you for that, what would you expect as output?


It would also help if you include anything you've tried so far. We're happy to help but when you post your requirements and data but no specific questions on what you need help with it comes across as someone asking others to do the work. 

Trusted Advisor
Posts: 1,825

Re: Grouping and manipulating data

[ Edited ]

You asked for:

1) I want to find the top two diagnosis(diag) ... per patient

2) find associated diagnosis per patient

It is not clear for me:

  1)  what do you mean by top two diagnosis(diag) ... per patient ? there is only one observation (in the sample) per ID !

  2)  do you mean by associated diagnosis  as most frequent couple of diagnosis in patiants

 

I assume next code may give you a hint to achieve your target:

%let nwdate = 31DEC2016;
data sample; retain newdate1 "&nwdate"d; format newdate1 mmyyn.; input MYDOB ADMIN$ SEX$ ID DIAG1$ DIAG2$ DIAG3$ DIAG4$ ; DATALINES; 111924 KL M 1 AA JI GE . 121926 OH M 2 BB AA . GE 121982 KL F 3 BB . . AA 101980 KL M 4 AA . . . 111979 OH F 5 JI . AA . 101959 KL M 6 GE . . . 121999 OH F 7 BB . . . 102008 KL M 8 AA JI . . 112001 KL F 9 JI . . AA 102016 OH M 10 BB . GE . RUN; data cleaned (keep= ADMIN SEX ID DIAG AGE agegroup); retain ID AGE ADMIN SEX DIAG agegroup; /* rearrange*/ set sample; OPTIONS YEARCUTOFF=1917; a = put(MYDOB,6.); NEWDOB = input (a, mmddyy6.); format NEWDOB MMYYN.; age = INT(YRDIF(NEWDOB, newdate1,'ACTUAL')); /*create age group*/ if age<=18 then agegroup ='Young'; else if 19<=age<64 then agegroup='Intemediate'; else if age >64 then agegroup= 'Retired'; array dg $ diag1-diag4; do i=1 to dim(dg); diag = dg(i); if not missing(diag) then output; end; run; proc sort data=cleaned; by sex agegroup id; run; proc freq data=cleaned; by sex; /* optional */ table agegroup * diag; run;

pay attenstion:

  - I have combined your first two steps into one, while creating the sample dataset within the newdate1 variable.

  - The newdate is given as a macro variable assigned, preceeding  thr code.

  - I made the wide dataset into a long one, each observation with one DIAG field (instead diag1-diag4)

  - I believe there are better procedurs then PROC FREQ that I have used to calculate frequency and find

    couples of associated diagnosis, but I am not familiar with those procedures.

Ask a Question
Discussion stats
  • 4 replies
  • 165 views
  • 3 likes
  • 3 in conversation