BookmarkSubscribeRSS Feed
Ubai
Quartz | Level 8

Hi everyone,

 

I have been struggling with some code. I have clinical data where each row represents one admission. I would like to restructure the data so that each row represents one patients. However, there is a number of tasks I would like to conduct during the process:

1. Patients specific information is mostly, but now always contained in the first admission. For example gender is always on the first admission, but the date of death (deathdate) can be in the row of any admission. That's why I would like to retain the value of deathdate when not missing !

2. I would like to keep some specific data from the first admission always such as tumor stage.

3. I would like to conduct some operations between values from different admissions for example: calculate the difference in duration between the two admission dates.

 

How can I do this most efficiently with the lowest number of new variables created, smallest number of DATA, DROP and RENAME statements?

Please find examples of the data I have and what I want further.

 

I used to use DO loop like this

 

data want;
	do until (Last.ID);
set have;
by ID;
	select (admission);
		when ('1') do; GenderNew = Gender; StageNew = stage; deathdate1 = deathdate; admission_date1 = admission_date; end;
		when ('2') do; deathdate2 = deathdate; admission_date2 = admission_date; end;
	otherwise;
	end;
end;
drop admission gender stage deathdate admission_date;
run;

data want; set want;
format deathdate ddmmyy10.;
rename GenderNew = Gender StageNew = Stage;
Duration = admission_date2 - admission_date1;
deathdate = max(deathdate2, deathdate1);
drop admission_date1 admission_date2 deathdate1 deathdate2;
run;

 

 

However, my method is annoying. Specially, that I need to create many new variables from the first observation instead of retaining them somehow. I have about 100 variables that I need to keep and it does not makes sense to make 100 new variables. 

 

Is there a more efficient way?

 

Thanks in forward.

 

 

Data example:

data have;
input id admission gender $ stage admission_date deathdate;
format deathdate ddmmyy10.;
cards;
1 1 m 2 5000 .
1 2 . . 5100 6500
2 1 f 1 5600 6600
2 2 . . 5900 .
3 1 f 4 5627 .
3 2 . 3 5830 7000
3 3 . 1 6000 .
;
run;
data want;
input id gender stage Duration deathdate;
format deathdate ddmmyy10.;
cards;
1 m 2 100 6500
2 f 1 200 6600
3 f 4 373 7000
;
run;

 

9 REPLIES 9
Tom
Super User Tom
Super User

If you have that nice ADMISSION variable to make it easy to find the first observation per ID then you can probably use UPDATE statement.

For example using your HAVE dataset we can do something like this:

data want;
  update have(where=(admission=1))
         have(rename=(stage=stage_last admission_date=admit_last))
  ;
  by id;
run;

The UPDATE statement needs an original dataset that can have at most one observation per BY group.  The transaction dataset is then read observation by observation and any non-missing value overwrites the current value.

 

By renaming STAGE and ADMISSION_DATE in the transaction dataset they will become new variables in the output and the original variable pulled from the original dataset (admission=1 record) will not be modified.

 

Results:

                                             admission_                  stage_
Obs    id    admission    gender    stage       date        deathdate     last     admit_last

 1      1        2           M        2      1973-09-09    1977-10-18       2      1973-12-18
 2      2        2           F        1      1975-05-02    1978-01-26       1      1976-02-26
 3      3        2           F        4      1975-05-29             .       3      1975-12-18

Ubai
Quartz | Level 8

 if I have more than two admissions per patient? I mostly have one observation, but sometimes up to 7.

Tom
Super User Tom
Super User

@Ubai wrote:

 if I have more than two admissions per patient? I mostly have one observation, but sometimes up to 7.


It doesn't matter to the UPDATE statement.  The end result is the one observation per by group with all of the updates applied.

 

 

Sajid01
Meteorite | Level 14

Hello @Ubai 
While there could be many approaches to solving your issue, I would consider applying database normalization techniques. Have a look at this good article from Microsoft https://docs.microsoft.com/en-us/office/troubleshoot/access/database-normalization-description  this will give a fairly good idea of this approach to your case.

 

Ubai
Quartz | Level 8

Hi Tom,

 

I love your approach. It is very efficient. However, I have missed pointing out something in my data. I have multiple admissions up to seven per id and not just only two. I need to extract the date from each admission and create for example: admit_1 admit_2 admit_3.

Is there a way to add a where statement and extract those variables?
That's why I used the select and when statements in the code I am using.

Best

Ksharp
Super User
data have;
input id admission gender $ stage admission_date deathdate;
format deathdate ddmmyy10.;
cards;
1 1 m 2 5000 .
1 2 . . 5100 6500
2 1 f 1 5600 6600
2 2 . . 5900 .
3 1 f 4 5627 .
3 2 . 3 5830 .
;
run;

proc stdize data=have out=temp missing=0 reponly;run;

proc sql;
create table want as
select id,
(select gender from temp where id=a.id and admission=1) as gender,
(select stage from temp where id=a.id and admission=1) as stage,
range(admission_date) as admission_date,
range(deathdate) as deathdate
 from temp as a
  group by id;
quit;
Ubai
Quartz | Level 8

What do you think of this code?

 

data between; set have;
if admission = 2 then admission_2 = admission_date;
if admission = 3 then admission_3 = admission_date;

data want;
  update between(where=(admission=1))
         between(rename=(admission_date=admit_last))
  ;
  by id;
Dur1 = admission_2 - admission_date;
Dur2 = admission_3 - admission_date;
drop admission_date admission_2 admission_3 admit_last;
run;
Tom
Super User Tom
Super User

I wouldn't do that.

Just add an OUTPUT statement and you will get multiple observations instead.

Sajid01
Meteorite | Level 14

Hello @Ubai 

I would prefer dividing the original dataset into three parts (Normalization).
1.The first table say "Patients" contains the patient information with patient_id as the primary key.
2.Second Table say "Admissions" table with the other information. It has Admission_id as its primary key and patiend_id as the foreign key. It has details related to admissions and no patient information.
The only patient information would be the patient_id. It will have information about admission and anything connected to it.
3.A third table say adverse_event table, that records information of the patients death. It will have only two columns namely patient_id and date_of_death.
Thus patients who have died only will be listed here.
Thus you will be performing the analysis only on the much smaller second table (Admissions) and should be simpler.
In the final report the patient information can be incorporated to the extent needed.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 1134 views
  • 4 likes
  • 4 in conversation