Re: How to create variable based on nested for + if with proc sql

catkat96 · Posted 02-12-2021 09:08 AM

I'm trying to create a table from information on another table. So, on the first one, let's call it Data1, I have panel data (many observations/individuals for many days).

On the second one, which I called "DataOutcome", I am trying to get one row per observation, and summarize something on the other table: if the last date of an observation is later than some day, then a variable will take value 0, or 1 in any other case.

This is my code so far, it just gets the id's from another table and creates the date that would define this new variable's outcome.

Apologies if this is a very bad question, I'm quite new at SAS/SQL.

Here's a data snippet:

data OriginalData;
   infile datalines delimiter=',';
   input date $ agreement_number $ outcome;
   datalines;
05/11/2020,1000,Answered
06/11/2020,1001,Answered
08/11/2020,1002,Not Answered
08/11/2020,1003,Answered
;
data Data1;
   infile datalines delimiter=',';
   input agreement_number $ date;
   datalines;
1000, 06/11/2020
1000, 07/11/2020
1001, 06/11/2020
1001, 07/11/2020
1001, 08/11/2020
1001, 09/11/2020
1002, 08/11/2020
1003, 08/11/2020
1003, 08/11/2020
;

proc SQL;
create table DataOutcome as
select Agreement_Number,
	Date,
    Outcome,
	intnx("day",Date,2) as new_date format=ddmmyy11.

from work.OriginalData
where outcome = "Answered";

/* for each Agreement_Number
       if the last date in Data1 is between Date and new_date
           New_var = 1
*/
Group by 1;
Quit;

maguiremq · Posted 02-12-2021 09:12 AM

Please do provide a data snippet using DATALINES. This seems like it would be done more efficiently in a DATA step, but I can't really tell. I don't see a join condition anywhere in your code, either.

catkat96 · Posted 02-12-2021 09:35 AM

Included a datalines snippet on the original post.

I'm not using joins because I don't really need to join anything (I believe), I'm creating a completely new table... but perhaps there is a use for JOIN here that I'm not aware of.

utrocketeng · Posted 02-12-2021 09:57 AM

if i understand your objective correctly, i believe this code should do the trick:

data OriginalData;
   infile datalines delimiter=',';
   input date :ddmmyy10. agreement_number $ outcome $ ;
   format date ddmmyy10.;
   datalines;
05/11/2020,1000,Answered
06/11/2020,1001,Answered
08/11/2020,1002,Not Answered
08/11/2020,1003,Answered
;

data Data1;
   infile datalines delimiter=',';
   input agreement_number $ date :ddmmyy10.;
   format date ddmmyy10.;
   datalines;
1000, 06/11/2020
1000, 07/11/2020
1001, 06/11/2020
1001, 07/11/2020
1001, 08/11/2020
1001, 09/11/2020
1002, 08/11/2020
1003, 08/11/2020
1003, 08/11/2020
;

PROC SQL;
CREATE TABLE work.dataMaxDate AS SELECT agreement_number, Max(Date) as MostRecentDate FROM work.data1 GROUP BY agreement_number;
CREATE TABLE work.final AS
SELECT 
	a.*
	,CASE when b.MostRecentDate is not null then 1 else 0 end as New_Var
FROM 
	originaldata a
	LEFT JOIN work.dataMaxDate b on a.agreement_number = b.agreement_number and b.MostRecentDate >= a.date and b.MostRecentDate < a.date+2
;
QUIT;

note that you want to look at the work.final table.

good luck

catkat96 · Posted 02-15-2021 07:08 AM

Hi, thanks for your reply!

I think it's kinda doing what I want sans the filtering by "Answered". I tried adding that but I can't get it to work, can you tell me where am I mistaken here?

PROC SQL;
CREATE TABLE work.dataMaxDate AS SELECT agreement_number, Max(Date) as MostRecentDate FROM work.data1 GROUP BY agreement_number;
CREATE TABLE work.final AS
SELECT 
	a.*
	,CASE when b.MostRecentDate is not null then 1 else 0 end as New_Var
FROM 
	originaldata a
	LEFT JOIN work.dataMaxDate (WHERE = (outcome = "Answered")) b on a.agreement_number = b.agreement_number and b.MostRecentDate >= a.date and b.MostRecentDate < a.date+2
;
QUIT;

maguiremq · Posted 02-12-2021 02:04 PM

I still think we're missing something. This code doesn't capture exactly what you put in your "OriginalData", but I don't know if you intend to count duplicate dates as "Answered". Some clarification there might help. You could also use PROC TRANSPOSE, I think. Nonetheless, PROC SQL:

proc sql;
	create table 	want as
		select		
					agreement_number,
					min(date) as min_date
						format = mmddyy10.,
					case
						when calculated min_date = max(date) then "Not Answered"
						else "Answered"
					end as outcome
		from		
					data1
		group by
					agreement_number
		having
					agreement_number = min(agreement_number);
quit;

I highly recommend that you format your code. It helps spot errors more easily, like your semicolon after your WHERE clause in your query. Some people probably think my formatting is a little overkill, but consistency is key. Edit: it looks as though the "OriginalData" and "Data1" don't have the same dates. I hope I'm not missing something obvious.

Thank you providing some example data.

catkat96 · Posted 02-15-2021 10:23 AM

Hi, thanks for the reply!

I think I might have not explained my problem properly. I have both OriginalData and Data1 already, I want to create a new table mixing info from both: I want to

-check OriginalData to identify those agreements that are "Answered"

-identify the date when that happened

-identify what is two days after that date (date + 2)

-check on the Data1 if those agreements' last date is larger than date + 2 (then assign value 0) or is between date and date + 2 (then assign value 1).

Is there anything I'm forgetting to make it clear?

How to create variable based on nested for + if with proc sql