BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

Hi,

In order to do some data preparation, I need to run a program like the following, where I can use a SQL SELECT statement within a data step. I am not sure whether this is something doable is SAS or I need to use a higher-level language (e.g., Java) for that.

I appreciate if you could help me out.

 

//update: I want to assign different values to each hospital visit of a patient based on the time it has passed between their last discharge and the current visit. If they visit much earlier than the hitherto data set average, they will get a larger number showing they are riskier.

I have attached a sample data with desired output.

 

Thanks for your great support! 

 

data in;
   set out;
   by patient_sk;

   if first.patient_sk then prob = (select avg(readmission_time) from out where discharged_dt_tm <= date1); // date1 is the value of the date_var in the current record
   else prob = (admitted_dt_tm - lag(discharged_dt_tm)) / (select avg(readmission_time) from out where discharged_dt_tm <= date1);
run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Let's change the names of the dataset to HAVE and WANT since your example code is using IN and OUT in a confusing way.

You should be able to just calculate the mean value BEFORE starting the data step.  If you really want to compare the value of Z to value of Z on the previous record then you need have a variable that orders the rows within the X groups.  Let's assume that you want to order them by DATE_VAR.(and that DATE_VAR is unique within values of X).

 

proc sql ;
create table middle as
  select a.*,(select mean(b.z) from have b where b.date_var <= a.date_var) as avg_z
  from have a
  order by x, date_var
;
quit;

Now you can use a data step to calculate your new Y variable.   Note that you should NOT use the LAG() function in conditionally executed code as it messes up how it keeps track of the previous values.

data want ;
   set middle;
   by x;
   y = divide(z - lag(z),avg_z );
   if first.x then y = avg_z ;
run;

View solution in original post

11 REPLIES 11
Astounding
PROC Star

My SQL isn't strong enough to do this, but I think I can give you an approach, based on the strengths of SQL vs. a DATA step.

 

First, append the average z value (respecting the date conditions) to every observation.  PROC SQL should be able to do that.

 

Then go through a DATA step to compute y.

hamed_majidi_gmail_com
Calcite | Level 5

Hi,

 

Thanks for helping. I like your idea, but I think that approach will still need a combination of Proc and data steps. The select statement is not one statement; it varies based on where in the data set the current processing is.

Reeza
Super User

Post sample data and expected output please. This is straightforward in multiple steps, but I assume your purpose here is to do it in one step?

hamed_majidi_gmail_com
Calcite | Level 5

Thank you Reeza. I just updated the question with a sample.

 

I don't care in how many steps the job is done. If I can create the final data set, I will be happy!

PGStats
Opal | Level 21

Is the average of z for date_ var <= date1 supposed to be taken over the whole datsaset or only within the current x group?

 

and are x groups sorted by date_var ?

PG
hamed_majidi_gmail_com
Calcite | Level 5

Average of z should be taken over the whole data set (in fact a subset of the whole data set that matches with the WHERE clause).

 

Yes, x groups are sorted by date_var (in ascending order).

Tom
Super User Tom
Super User

Let's change the names of the dataset to HAVE and WANT since your example code is using IN and OUT in a confusing way.

You should be able to just calculate the mean value BEFORE starting the data step.  If you really want to compare the value of Z to value of Z on the previous record then you need have a variable that orders the rows within the X groups.  Let's assume that you want to order them by DATE_VAR.(and that DATE_VAR is unique within values of X).

 

proc sql ;
create table middle as
  select a.*,(select mean(b.z) from have b where b.date_var <= a.date_var) as avg_z
  from have a
  order by x, date_var
;
quit;

Now you can use a data step to calculate your new Y variable.   Note that you should NOT use the LAG() function in conditionally executed code as it messes up how it keeps track of the previous values.

data want ;
   set middle;
   by x;
   y = divide(z - lag(z),avg_z );
   if first.x then y = avg_z ;
run;
hamed_majidi_gmail_com
Calcite | Level 5

Awesome! Thanks for sharing your knowledge.

PGStats
Opal | Level 21

Here is the sample data for those who prefer csv data

 

ENCOUNTER_ID,PATIENT_SK,DISCHARGED_DT_TM,ADMITTED_DT_TM,readmission_time,prob(desired output)
410338226,10000174,2015-04-18T20:30:00,2015-04-13T13:42:00,,avg() of whole data set before 2015-04-18
410730615,10000174,2015-07-16T15:49:00,2015-07-13T15:43:00,85.80069444,85 / avg() before 2015-07-16
205494260,10000214,2012-07-28T18:17:00,2012-07-26T19:43:00,,avg() of whole data set before 2012-07-28
51023730,10000388,2007-02-17T19:18:00,2007-01-29T11:44:00,,avg() of whole data set before 2007-02-17
50974455,10000388,2007-03-07T19:35:00,2007-03-04T16:11:00,14.87013889,14.87 / avg() before 2007-03-07
51216578,10000388,2007-04-19T16:31:00,2007-04-09T06:12:00,32.44236111,32.44 / avg() before 2007-04-19
PG
PGStats
Opal | Level 21

No simple way to do this in a single step. My closest approximation is to define a view:

 

data out;
infile datalines dsd;
informat DISCHARGED_DT_TM ADMITTED_DT_TM E8601DT.;
format DISCHARGED_DT_TM ADMITTED_DT_TM E8601DT19.;
input ENCOUNTER_ID PATIENT_SK DISCHARGED_DT_TM ADMITTED_DT_TM readmission_time;
z = 10*_n_; /* Some value */
datalines;
410338226,10000174,2015-04-18T20:30:00,2015-04-13T13:42:00,,avg() of whole data set before 2015-04-18
410730615,10000174,2015-07-16T15:49:00,2015-07-13T15:43:00,85.80069444,85 / avg() before 2015-07-16
205494260,10000214,2012-07-28T18:17:00,2012-07-26T19:43:00,,avg() of whole data set before 2012-07-28
51023730,10000388,2007-02-17T19:18:00,2007-01-29T11:44:00,,avg() of whole data set before 2007-02-17
50974455,10000388,2007-03-07T19:35:00,2007-03-04T16:11:00,14.87013889,14.87 / avg() before 2007-03-07
51216578,10000388,2007-04-19T16:31:00,2007-04-09T06:12:00,32.44236111,32.44 / avg() before 2007-04-19
;

proc sql;
create view outView as
select 
	a.PATIENT_SK,
	a.ADMITTED_DT_TM,
	mean(b.z) as avgz
from 
	out as a inner join
	out as b on b.ADMITTED_DT_TM <= a.ADMITTED_DT_TM
group by a.PATIENT_SK, a.ADMITTED_DT_TM;
quit;

data in;
merge out outview;
by PATIENT_SK ADMITTED_DT_TM;
lagz = lag(z);
if first.PATIENT_SK 
	then y = avgz;
	else y = (z-lagz) / avgz;
drop lagz avgz;
run;

proc sql; drop view outView; quit; 
PG
hamed_majidi_gmail_com
Calcite | Level 5

Thank you so much PGStats. I really appreciate your help and sharing of your knowledge.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 2156 views
  • 1 like
  • 5 in conversation