Hi everyone,
I have been struggling with some code. I have clinical data where each row represents one admission. I would like to restructure the data so that each row represents one patients. However, there is a number of tasks I would like to conduct during the process:
1. Patients specific information is mostly, but now always contained in the first admission. For example gender is always on the first admission, but the date of death (deathdate) can be in the row of any admission. That's why I would like to retain the value of deathdate when not missing !
2. I would like to keep some specific data from the first admission always such as tumor stage.
3. I would like to conduct some operations between values from different admissions for example: calculate the difference in duration between the two admission dates.
How can I do this most efficiently with the lowest number of new variables created, smallest number of DATA, DROP and RENAME statements?
Please find examples of the data I have and what I want further.
I used to use DO loop like this
data want; do until (Last.ID); set have; by ID; select (admission); when ('1') do; GenderNew = Gender; StageNew = stage; deathdate1 = deathdate; admission_date1 = admission_date; end; when ('2') do; deathdate2 = deathdate; admission_date2 = admission_date; end; otherwise; end; end; drop admission gender stage deathdate admission_date; run; data want; set want; format deathdate ddmmyy10.; rename GenderNew = Gender StageNew = Stage; Duration = admission_date2 - admission_date1; deathdate = max(deathdate2, deathdate1); drop admission_date1 admission_date2 deathdate1 deathdate2; run;
However, my method is annoying. Specially, that I need to create many new variables from the first observation instead of retaining them somehow. I have about 100 variables that I need to keep and it does not makes sense to make 100 new variables.
Is there a more efficient way?
Thanks in forward.
Data example:
data have;
input id admission gender $ stage admission_date deathdate;
format deathdate ddmmyy10.;
cards;
1 1 m 2 5000 .
1 2 . . 5100 6500
2 1 f 1 5600 6600
2 2 . . 5900 .
3 1 f 4 5627 .
3 2 . 3 5830 7000
3 3 . 1 6000 .
;
run;
data want;
input id gender stage Duration deathdate;
format deathdate ddmmyy10.;
cards;
1 m 2 100 6500
2 f 1 200 6600
3 f 4 373 7000
;
run;
If you have that nice ADMISSION variable to make it easy to find the first observation per ID then you can probably use UPDATE statement.
For example using your HAVE dataset we can do something like this:
data want;
update have(where=(admission=1))
have(rename=(stage=stage_last admission_date=admit_last))
;
by id;
run;
The UPDATE statement needs an original dataset that can have at most one observation per BY group. The transaction dataset is then read observation by observation and any non-missing value overwrites the current value.
By renaming STAGE and ADMISSION_DATE in the transaction dataset they will become new variables in the output and the original variable pulled from the original dataset (admission=1 record) will not be modified.
Results:
admission_ stage_ Obs id admission gender stage date deathdate last admit_last 1 1 2 M 2 1973-09-09 1977-10-18 2 1973-12-18 2 2 2 F 1 1975-05-02 1978-01-26 1 1976-02-26 3 3 2 F 4 1975-05-29 . 3 1975-12-18
if I have more than two admissions per patient? I mostly have one observation, but sometimes up to 7.
@Ubai wrote:
if I have more than two admissions per patient? I mostly have one observation, but sometimes up to 7.
It doesn't matter to the UPDATE statement. The end result is the one observation per by group with all of the updates applied.
Hello @Ubai
While there could be many approaches to solving your issue, I would consider applying database normalization techniques. Have a look at this good article from Microsoft https://docs.microsoft.com/en-us/office/troubleshoot/access/database-normalization-description this will give a fairly good idea of this approach to your case.
Hi Tom,
I love your approach. It is very efficient. However, I have missed pointing out something in my data. I have multiple admissions up to seven per id and not just only two. I need to extract the date from each admission and create for example: admit_1 admit_2 admit_3.
Is there a way to add a where statement and extract those variables?
That's why I used the select and when statements in the code I am using.
Best
data have; input id admission gender $ stage admission_date deathdate; format deathdate ddmmyy10.; cards; 1 1 m 2 5000 . 1 2 . . 5100 6500 2 1 f 1 5600 6600 2 2 . . 5900 . 3 1 f 4 5627 . 3 2 . 3 5830 . ; run; proc stdize data=have out=temp missing=0 reponly;run; proc sql; create table want as select id, (select gender from temp where id=a.id and admission=1) as gender, (select stage from temp where id=a.id and admission=1) as stage, range(admission_date) as admission_date, range(deathdate) as deathdate from temp as a group by id; quit;
What do you think of this code?
data between; set have;
if admission = 2 then admission_2 = admission_date;
if admission = 3 then admission_3 = admission_date;
data want;
update between(where=(admission=1))
between(rename=(admission_date=admit_last))
;
by id;
Dur1 = admission_2 - admission_date;
Dur2 = admission_3 - admission_date;
drop admission_date admission_2 admission_3 admit_last;
run;
I wouldn't do that.
Just add an OUTPUT statement and you will get multiple observations instead.
Hello @Ubai
I would prefer dividing the original dataset into three parts (Normalization).
1.The first table say "Patients" contains the patient information with patient_id as the primary key.
2.Second Table say "Admissions" table with the other information. It has Admission_id as its primary key and patiend_id as the foreign key. It has details related to admissions and no patient information.
The only patient information would be the patient_id. It will have information about admission and anything connected to it.
3.A third table say adverse_event table, that records information of the patients death. It will have only two columns namely patient_id and date_of_death.
Thus patients who have died only will be listed here.
Thus you will be performing the analysis only on the much smaller second table (Admissions) and should be simpler.
In the final report the patient information can be incorporated to the extent needed.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.