Hi,
In order to do some data preparation, I need to run a program like the following, where I can use a SQL SELECT statement within a data step. I am not sure whether this is something doable is SAS or I need to use a higher-level language (e.g., Java) for that.
I appreciate if you could help me out.
//update: I want to assign different values to each hospital visit of a patient based on the time it has passed between their last discharge and the current visit. If they visit much earlier than the hitherto data set average, they will get a larger number showing they are riskier.
I have attached a sample data with desired output.
Thanks for your great support!
data in;
set out;
by patient_sk;
if first.patient_sk then prob = (select avg(readmission_time) from out where discharged_dt_tm <= date1); // date1 is the value of the date_var in the current record
else prob = (admitted_dt_tm - lag(discharged_dt_tm)) / (select avg(readmission_time) from out where discharged_dt_tm <= date1);
run;
Let's change the names of the dataset to HAVE and WANT since your example code is using IN and OUT in a confusing way.
You should be able to just calculate the mean value BEFORE starting the data step. If you really want to compare the value of Z to value of Z on the previous record then you need have a variable that orders the rows within the X groups. Let's assume that you want to order them by DATE_VAR.(and that DATE_VAR is unique within values of X).
proc sql ;
create table middle as
select a.*,(select mean(b.z) from have b where b.date_var <= a.date_var) as avg_z
from have a
order by x, date_var
;
quit;
Now you can use a data step to calculate your new Y variable. Note that you should NOT use the LAG() function in conditionally executed code as it messes up how it keeps track of the previous values.
data want ;
set middle;
by x;
y = divide(z - lag(z),avg_z );
if first.x then y = avg_z ;
run;
My SQL isn't strong enough to do this, but I think I can give you an approach, based on the strengths of SQL vs. a DATA step.
First, append the average z value (respecting the date conditions) to every observation. PROC SQL should be able to do that.
Then go through a DATA step to compute y.
Hi,
Thanks for helping. I like your idea, but I think that approach will still need a combination of Proc and data steps. The select statement is not one statement; it varies based on where in the data set the current processing is.
Post sample data and expected output please. This is straightforward in multiple steps, but I assume your purpose here is to do it in one step?
Thank you Reeza. I just updated the question with a sample.
I don't care in how many steps the job is done. If I can create the final data set, I will be happy!
Is the average of z for date_ var <= date1 supposed to be taken over the whole datsaset or only within the current x group?
and are x groups sorted by date_var ?
Average of z should be taken over the whole data set (in fact a subset of the whole data set that matches with the WHERE clause).
Yes, x groups are sorted by date_var (in ascending order).
Let's change the names of the dataset to HAVE and WANT since your example code is using IN and OUT in a confusing way.
You should be able to just calculate the mean value BEFORE starting the data step. If you really want to compare the value of Z to value of Z on the previous record then you need have a variable that orders the rows within the X groups. Let's assume that you want to order them by DATE_VAR.(and that DATE_VAR is unique within values of X).
proc sql ;
create table middle as
select a.*,(select mean(b.z) from have b where b.date_var <= a.date_var) as avg_z
from have a
order by x, date_var
;
quit;
Now you can use a data step to calculate your new Y variable. Note that you should NOT use the LAG() function in conditionally executed code as it messes up how it keeps track of the previous values.
data want ;
set middle;
by x;
y = divide(z - lag(z),avg_z );
if first.x then y = avg_z ;
run;
Awesome! Thanks for sharing your knowledge.
Here is the sample data for those who prefer csv data
ENCOUNTER_ID,PATIENT_SK,DISCHARGED_DT_TM,ADMITTED_DT_TM,readmission_time,prob(desired output) 410338226,10000174,2015-04-18T20:30:00,2015-04-13T13:42:00,,avg() of whole data set before 2015-04-18 410730615,10000174,2015-07-16T15:49:00,2015-07-13T15:43:00,85.80069444,85 / avg() before 2015-07-16 205494260,10000214,2012-07-28T18:17:00,2012-07-26T19:43:00,,avg() of whole data set before 2012-07-28 51023730,10000388,2007-02-17T19:18:00,2007-01-29T11:44:00,,avg() of whole data set before 2007-02-17 50974455,10000388,2007-03-07T19:35:00,2007-03-04T16:11:00,14.87013889,14.87 / avg() before 2007-03-07 51216578,10000388,2007-04-19T16:31:00,2007-04-09T06:12:00,32.44236111,32.44 / avg() before 2007-04-19
No simple way to do this in a single step. My closest approximation is to define a view:
data out;
infile datalines dsd;
informat DISCHARGED_DT_TM ADMITTED_DT_TM E8601DT.;
format DISCHARGED_DT_TM ADMITTED_DT_TM E8601DT19.;
input ENCOUNTER_ID PATIENT_SK DISCHARGED_DT_TM ADMITTED_DT_TM readmission_time;
z = 10*_n_; /* Some value */
datalines;
410338226,10000174,2015-04-18T20:30:00,2015-04-13T13:42:00,,avg() of whole data set before 2015-04-18
410730615,10000174,2015-07-16T15:49:00,2015-07-13T15:43:00,85.80069444,85 / avg() before 2015-07-16
205494260,10000214,2012-07-28T18:17:00,2012-07-26T19:43:00,,avg() of whole data set before 2012-07-28
51023730,10000388,2007-02-17T19:18:00,2007-01-29T11:44:00,,avg() of whole data set before 2007-02-17
50974455,10000388,2007-03-07T19:35:00,2007-03-04T16:11:00,14.87013889,14.87 / avg() before 2007-03-07
51216578,10000388,2007-04-19T16:31:00,2007-04-09T06:12:00,32.44236111,32.44 / avg() before 2007-04-19
;
proc sql;
create view outView as
select
a.PATIENT_SK,
a.ADMITTED_DT_TM,
mean(b.z) as avgz
from
out as a inner join
out as b on b.ADMITTED_DT_TM <= a.ADMITTED_DT_TM
group by a.PATIENT_SK, a.ADMITTED_DT_TM;
quit;
data in;
merge out outview;
by PATIENT_SK ADMITTED_DT_TM;
lagz = lag(z);
if first.PATIENT_SK
then y = avgz;
else y = (z-lagz) / avgz;
drop lagz avgz;
run;
proc sql; drop view outView; quit;
Thank you so much PGStats. I really appreciate your help and sharing of your knowledge.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.