Solved: Re: Need to call a SQL query in a data step

hamed_majidi_gmail_com · Posted 08-24-2017 08:21 PM

Hi,

In order to do some data preparation, I need to run a program like the following, where I can use a SQL SELECT statement within a data step. I am not sure whether this is something doable is SAS or I need to use a higher-level language (e.g., Java) for that.

I appreciate if you could help me out.

//update: I want to assign different values to each hospital visit of a patient based on the time it has passed between their last discharge and the current visit. If they visit much earlier than the hitherto data set average, they will get a larger number showing they are riskier.

I have attached a sample data with desired output.

Thanks for your great support!

data in;
set out;
by patient_sk;

if first.patient_sk then prob = (select avg(readmission_time) from out where discharged_dt_tm <= date1); // date1 is the value of the date_var in the current record
else prob = (admitted_dt_tm - lag(discharged_dt_tm)) / (select avg(readmission_time) from out where discharged_dt_tm <= date1);
run;

Tom · Posted 08-24-2017 10:57 PM

Let's change the names of the dataset to HAVE and WANT since your example code is using IN and OUT in a confusing way.

You should be able to just calculate the mean value BEFORE starting the data step. If you really want to compare the value of Z to value of Z on the previous record then you need have a variable that orders the rows within the X groups. Let's assume that you want to order them by DATE_VAR.(and that DATE_VAR is unique within values of X).

proc sql ;
create table middle as
  select a.*,(select mean(b.z) from have b where b.date_var <= a.date_var) as avg_z
  from have a
  order by x, date_var
;
quit;

Now you can use a data step to calculate your new Y variable. Note that you should NOT use the LAG() function in conditionally executed code as it messes up how it keeps track of the previous values.

data want ;
   set middle;
   by x;
   y = divide(z - lag(z),avg_z );
   if first.x then y = avg_z ;
run;

View solution in original post

Astounding · Posted 08-24-2017 08:50 PM

My SQL isn't strong enough to do this, but I think I can give you an approach, based on the strengths of SQL vs. a DATA step.

First, append the average z value (respecting the date conditions) to every observation. PROC SQL should be able to do that.

Then go through a DATA step to compute y.

hamed_majidi_gmail_com · Posted 08-24-2017 10:41 PM

Hi,

Thanks for helping. I like your idea, but I think that approach will still need a combination of Proc and data steps. The select statement is not one statement; it varies based on where in the data set the current processing is.

Reeza · Posted 08-24-2017 10:45 PM

Post sample data and expected output please. This is straightforward in multiple steps, but I assume your purpose here is to do it in one step?

hamed_majidi_gmail_com · Posted 08-24-2017 11:13 PM

Thank you Reeza. I just updated the question with a sample.

I don't care in how many steps the job is done. If I can create the final data set, I will be happy!

PGStats · Posted 08-24-2017 10:03 PM

Is the average of z for date_ var <= date1 supposed to be taken over the whole datsaset or only within the current x group?

and are x groups sorted by date_var ?

PG

hamed_majidi_gmail_com · Posted 08-24-2017 10:38 PM

Average of z should be taken over the whole data set (in fact a subset of the whole data set that matches with the WHERE clause).

Yes, x groups are sorted by date_var (in ascending order).

Tom · Posted 08-24-2017 10:57 PM

Let's change the names of the dataset to HAVE and WANT since your example code is using IN and OUT in a confusing way.

You should be able to just calculate the mean value BEFORE starting the data step. If you really want to compare the value of Z to value of Z on the previous record then you need have a variable that orders the rows within the X groups. Let's assume that you want to order them by DATE_VAR.(and that DATE_VAR is unique within values of X).

proc sql ;
create table middle as
  select a.*,(select mean(b.z) from have b where b.date_var <= a.date_var) as avg_z
  from have a
  order by x, date_var
;
quit;

Now you can use a data step to calculate your new Y variable. Note that you should NOT use the LAG() function in conditionally executed code as it messes up how it keeps track of the previous values.

data want ;
   set middle;
   by x;
   y = divide(z - lag(z),avg_z );
   if first.x then y = avg_z ;
run;

hamed_majidi_gmail_com · Posted 08-25-2017 03:39 PM

Awesome! Thanks for sharing your knowledge.

PGStats · Posted 08-24-2017 11:23 PM

Here is the sample data for those who prefer csv data

ENCOUNTER_ID,PATIENT_SK,DISCHARGED_DT_TM,ADMITTED_DT_TM,readmission_time,prob(desired output)
410338226,10000174,2015-04-18T20:30:00,2015-04-13T13:42:00,,avg() of whole data set before 2015-04-18
410730615,10000174,2015-07-16T15:49:00,2015-07-13T15:43:00,85.80069444,85 / avg() before 2015-07-16
205494260,10000214,2012-07-28T18:17:00,2012-07-26T19:43:00,,avg() of whole data set before 2012-07-28
51023730,10000388,2007-02-17T19:18:00,2007-01-29T11:44:00,,avg() of whole data set before 2007-02-17
50974455,10000388,2007-03-07T19:35:00,2007-03-04T16:11:00,14.87013889,14.87 / avg() before 2007-03-07
51216578,10000388,2007-04-19T16:31:00,2007-04-09T06:12:00,32.44236111,32.44 / avg() before 2007-04-19

PG

PGStats · Posted 08-24-2017 11:50 PM

No simple way to do this in a single step. My closest approximation is to define a view:

data out;
infile datalines dsd;
informat DISCHARGED_DT_TM ADMITTED_DT_TM E8601DT.;
format DISCHARGED_DT_TM ADMITTED_DT_TM E8601DT19.;
input ENCOUNTER_ID PATIENT_SK DISCHARGED_DT_TM ADMITTED_DT_TM readmission_time;
z = 10*_n_; /* Some value */
datalines;
410338226,10000174,2015-04-18T20:30:00,2015-04-13T13:42:00,,avg() of whole data set before 2015-04-18
410730615,10000174,2015-07-16T15:49:00,2015-07-13T15:43:00,85.80069444,85 / avg() before 2015-07-16
205494260,10000214,2012-07-28T18:17:00,2012-07-26T19:43:00,,avg() of whole data set before 2012-07-28
51023730,10000388,2007-02-17T19:18:00,2007-01-29T11:44:00,,avg() of whole data set before 2007-02-17
50974455,10000388,2007-03-07T19:35:00,2007-03-04T16:11:00,14.87013889,14.87 / avg() before 2007-03-07
51216578,10000388,2007-04-19T16:31:00,2007-04-09T06:12:00,32.44236111,32.44 / avg() before 2007-04-19
;

proc sql;
create view outView as
select 
	a.PATIENT_SK,
	a.ADMITTED_DT_TM,
	mean(b.z) as avgz
from 
	out as a inner join
	out as b on b.ADMITTED_DT_TM <= a.ADMITTED_DT_TM
group by a.PATIENT_SK, a.ADMITTED_DT_TM;
quit;

data in;
merge out outview;
by PATIENT_SK ADMITTED_DT_TM;
lagz = lag(z);
if first.PATIENT_SK 
	then y = avgz;
	else y = (z-lagz) / avgz;
drop lagz avgz;
run;

proc sql; drop view outView; quit;

PG

hamed_majidi_gmail_com · Posted 08-25-2017 03:41 PM

Thank you so much PGStats. I really appreciate your help and sharing of your knowledge.

Registration is open

SAS Training: Just a Click Away