Hi,
In order to do some data preparation, I need to run a program like the following, where I can use a SQL SELECT statement within a data step. I am not sure whether this is something doable is SAS or I need to use a higher-level language (e.g., Java) for that.
I appreciate if you could help me out.
//update: I want to assign different values to each hospital visit of a patient based on the time it has passed between their last discharge and the current visit. If they visit much earlier than the hitherto data set average, they will get a larger number showing they are riskier.
I have attached a sample data with desired output.
Thanks for your great support!
data in;
set out;
by patient_sk;
if first.patient_sk then prob = (select avg(readmission_time) from out where discharged_dt_tm <= date1); // date1 is the value of the date_var in the current record
else prob = (admitted_dt_tm - lag(discharged_dt_tm)) / (select avg(readmission_time) from out where discharged_dt_tm <= date1);
run;
Let's change the names of the dataset to HAVE and WANT since your example code is using IN and OUT in a confusing way.
You should be able to just calculate the mean value BEFORE starting the data step. If you really want to compare the value of Z to value of Z on the previous record then you need have a variable that orders the rows within the X groups. Let's assume that you want to order them by DATE_VAR.(and that DATE_VAR is unique within values of X).
proc sql ;
create table middle as
select a.*,(select mean(b.z) from have b where b.date_var <= a.date_var) as avg_z
from have a
order by x, date_var
;
quit;
Now you can use a data step to calculate your new Y variable. Note that you should NOT use the LAG() function in conditionally executed code as it messes up how it keeps track of the previous values.
data want ;
set middle;
by x;
y = divide(z - lag(z),avg_z );
if first.x then y = avg_z ;
run;
My SQL isn't strong enough to do this, but I think I can give you an approach, based on the strengths of SQL vs. a DATA step.
First, append the average z value (respecting the date conditions) to every observation. PROC SQL should be able to do that.
Then go through a DATA step to compute y.
Hi,
Thanks for helping. I like your idea, but I think that approach will still need a combination of Proc and data steps. The select statement is not one statement; it varies based on where in the data set the current processing is.
Post sample data and expected output please. This is straightforward in multiple steps, but I assume your purpose here is to do it in one step?
Thank you Reeza. I just updated the question with a sample.
I don't care in how many steps the job is done. If I can create the final data set, I will be happy!
Is the average of z for date_ var <= date1 supposed to be taken over the whole datsaset or only within the current x group?
and are x groups sorted by date_var ?
Average of z should be taken over the whole data set (in fact a subset of the whole data set that matches with the WHERE clause).
Yes, x groups are sorted by date_var (in ascending order).
Let's change the names of the dataset to HAVE and WANT since your example code is using IN and OUT in a confusing way.
You should be able to just calculate the mean value BEFORE starting the data step. If you really want to compare the value of Z to value of Z on the previous record then you need have a variable that orders the rows within the X groups. Let's assume that you want to order them by DATE_VAR.(and that DATE_VAR is unique within values of X).
proc sql ;
create table middle as
select a.*,(select mean(b.z) from have b where b.date_var <= a.date_var) as avg_z
from have a
order by x, date_var
;
quit;
Now you can use a data step to calculate your new Y variable. Note that you should NOT use the LAG() function in conditionally executed code as it messes up how it keeps track of the previous values.
data want ;
set middle;
by x;
y = divide(z - lag(z),avg_z );
if first.x then y = avg_z ;
run;
Awesome! Thanks for sharing your knowledge.
Here is the sample data for those who prefer csv data
ENCOUNTER_ID,PATIENT_SK,DISCHARGED_DT_TM,ADMITTED_DT_TM,readmission_time,prob(desired output) 410338226,10000174,2015-04-18T20:30:00,2015-04-13T13:42:00,,avg() of whole data set before 2015-04-18 410730615,10000174,2015-07-16T15:49:00,2015-07-13T15:43:00,85.80069444,85 / avg() before 2015-07-16 205494260,10000214,2012-07-28T18:17:00,2012-07-26T19:43:00,,avg() of whole data set before 2012-07-28 51023730,10000388,2007-02-17T19:18:00,2007-01-29T11:44:00,,avg() of whole data set before 2007-02-17 50974455,10000388,2007-03-07T19:35:00,2007-03-04T16:11:00,14.87013889,14.87 / avg() before 2007-03-07 51216578,10000388,2007-04-19T16:31:00,2007-04-09T06:12:00,32.44236111,32.44 / avg() before 2007-04-19
No simple way to do this in a single step. My closest approximation is to define a view:
data out;
infile datalines dsd;
informat DISCHARGED_DT_TM ADMITTED_DT_TM E8601DT.;
format DISCHARGED_DT_TM ADMITTED_DT_TM E8601DT19.;
input ENCOUNTER_ID PATIENT_SK DISCHARGED_DT_TM ADMITTED_DT_TM readmission_time;
z = 10*_n_; /* Some value */
datalines;
410338226,10000174,2015-04-18T20:30:00,2015-04-13T13:42:00,,avg() of whole data set before 2015-04-18
410730615,10000174,2015-07-16T15:49:00,2015-07-13T15:43:00,85.80069444,85 / avg() before 2015-07-16
205494260,10000214,2012-07-28T18:17:00,2012-07-26T19:43:00,,avg() of whole data set before 2012-07-28
51023730,10000388,2007-02-17T19:18:00,2007-01-29T11:44:00,,avg() of whole data set before 2007-02-17
50974455,10000388,2007-03-07T19:35:00,2007-03-04T16:11:00,14.87013889,14.87 / avg() before 2007-03-07
51216578,10000388,2007-04-19T16:31:00,2007-04-09T06:12:00,32.44236111,32.44 / avg() before 2007-04-19
;
proc sql;
create view outView as
select
a.PATIENT_SK,
a.ADMITTED_DT_TM,
mean(b.z) as avgz
from
out as a inner join
out as b on b.ADMITTED_DT_TM <= a.ADMITTED_DT_TM
group by a.PATIENT_SK, a.ADMITTED_DT_TM;
quit;
data in;
merge out outview;
by PATIENT_SK ADMITTED_DT_TM;
lagz = lag(z);
if first.PATIENT_SK
then y = avgz;
else y = (z-lagz) / avgz;
drop lagz avgz;
run;
proc sql; drop view outView; quit;
Thank you so much PGStats. I really appreciate your help and sharing of your knowledge.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.