Proc Sql Create Table Error

pkantak · Posted 12-27-2016 09:40 PM

Hi SAS Experts,

I'm trying to create a table which has various columns including a column (Patients_Enrolled_Per_Week) that consists of a count of another column that is constrained between two date macrovariables (datebegin & dateend). I'm getting an error which I'll post below. I've been trying to figure out why this is happening but I'm stumped. I'm attaching my code, the error, and an example of what I want my data to look like. Please let me know if you have answers! I'm at my wit's end!!

Thanks much!!

Pete

/* program for creating enrollment table*/
%let date=20161218;
%let datebegin=12/19/2016;
%let dateend=12/22/2016;
/*import csv*/
 proc import out=work.enrollmentbysubject datafile="Z:\ADAPTABLE\Mytrus\Pickups from Mytrus\&date.\enrollmentbysubject.csv" 
		dbms=csv replace;
	getnames=yes;
	datarow=2;
	guessingrows=20000;
run;
/*delete space in CDRN column*/
data work.enrollmentbysubject;
	set work.enrollmentbysubject;
	if compress(cats(CDRN), '.')=' ' then
		delete;
run;
/*convert datetimes in column "Randomized" to SAS datetimes*/
data work.temp; set work.enrollmentbysubject;
Randomized_num=input(Randomized,anydtdtm23.);
Format Randomized_num DateTime23.3;
run;
/*separate SAS datetime to SAS date with format MM/DD/YYYY*/
data work.temp2; set work.temp;
Randomized_Date=datepart(Randomized_num);
Format Randomized_date MMDDYYS10.;
run;
/*create table that I want*/
proc sql noprint feedback;
create table work.dailyenrollmenttable_&date. AS
	select distinct cdrn, site,
/*count of Golden Tickets entered*/
(count(Invitation_Code_Used)) AS Golden_Tickets_Entered,
/*count of Randomized Patients*/
(count(Randomized)) AS Enrolled,
/*percentage enrolled per golden ticket entered*/
(Count(Randomized)) / (Count(Invitation_Code_Used)) AS Percent_Enrolled format percent8.2,
/*number of enrolled patients per week*/
(count(Randomized_date)) AS Patients_Enrolled_Per_week
from Work.temp2
where Randomized_date between "&datebegin." and "&dateend."
Group By site;
quit;

Error Message:

690 proc sql noprint feedback;
691 create table work.dailyenrollmenttable_&date. AS
692 select distinct cdrn, site,
693 /*count of Golden Tickets entered*/
694 (count(Invitation_Code_Used)) AS Golden_Tickets_Entered,
695 /*count of Randomized Patients*/
696 (count(Randomized)) AS Enrolled,
697 /*percentage enrolled per golden ticket entered*/
698 (Count(Randomized)) / (Count(Invitation_Code_Used)) AS Percent_Enrolled format percent8.2,
699 /*number of enrolled patients per week*/
700 (count(Randomized_date)) AS Patients_Enrolled_Per_week
701 from Work.temp2
702 where Randomized_date between "&datebegin." and "&dateend."
703 Group By site;
ERROR: Expression using IN has components that are of different data types.
NOTE: The IN referred to may have been transformed from an OR to an IN at some point during
PROC SQL WHERE clause optimization.
NOTE: Statement transforms to:

select distinct TEMP2.CDRN, TEMP2.Site, COUNT(TEMP2.Invitation_Code_Used) as
Golden_Tickets_Entered, COUNT(TEMP2.Randomized) as Enrolled, COUNT(TEMP2.Randomized) /
COUNT(TEMP2.Invitation_Code_Used) as Percent_Enrolled format=PERCENT8.2,
COUNT(TEMP2.Randomized_Date) as Patients_Enrolled_Per_week
from WORK.TEMP2
where (TEMP2.Randomized_Date between '12/19/2016' and '12/22/2016')
group by TEMP2.Site;

704 quit;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SQL used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

Example Desired Output:

Site ID	CDRN	Site	Golden Tickets Entered	Enrolled	% Enrolled Per Golden Ticket Entered	# Enrolled (12/19/16-12/27/16)
A1	CDRN1	Site1	1000	700	70%	6
A2	CDRN2	Site2	2000	1000	50%	2

mkeintz · Posted 12-27-2016 11:06 PM

Your sql has generated the expression

where (TEMP2.Randomized_Date between '12/19/2016' and '12/22/2016')

But randommised_date is numeric and the limits are character.

So try this:

Redefine your macrovars to:

%let datebegin=19dec2016;
%let dateend=22dec2016;

Then change your code from

where Randomized_date between "&datebegin." and "&dateend."

to

where Randomized_date between "&datebegin"d and "&dateend"d

This will filter randomized_date to a range between two date literals (i.e. like "19dec2016"d and "22dec2016"d). You'll now be comparing randomized_date to numeric values.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

pkantak · Posted 12-28-2016 07:16 PM

Hi Mkeintz,

Thanks much for your reply! I truly appreciate it. Unfortunately, your suggestions didn't work for me. It was much the same as Laurie's suggestion below (i.e.: returned empty datasets). I responded to her message with some example data. I really hope that can help in answering my question!

Thanks so much,

Pete

LaurieF · Posted 12-27-2016 11:27 PM

Without seeing your CSV file (if you could attach it, that would be great!), it's hard to be precise. But I know immediately what the ERROR condition relates to - you haven't told the where clause that you're comparing randomized_date with two date constants.

Change the top of your code to:

%let datebegin = %sysevalf('19dec2016'd);
%let dateend = %sysevalf('22dec2016'd);

and the where clause to:

where Randomized_date between &datebegin and &dateend

(note - no quotes!)

and Bob's your auntie.

That's how I'm skinning this particular cat.

PS: Change the data steps to this, just to save unnecessary processing)

data work.enrollmentbysubject;
	set work.enrollmentbysubject;
	if compress(cats(CDRN), '.')=' ' then
		delete;
	Randomized_date = datepart(input(Randomized, anydtdtm23.));
    format Randomized_date yymmdds10.;
run;

Also, may I ask why you've got a distinct keyword in your select clause? I'm always a bit concerned when I see that, because it always has such a tendency to cover up data irregularities (and slows the processing down). It'd be better to remove that and append cdrn to your group by clause.

This is how I'd do it:

proc sql noprint feedback;
create table work.dailyenrollmenttable_&date. AS
	select cdrn, site,
           (count(Invitation_Code_Used)) AS Golden_Tickets_Entered,
           (count(Randomized)) AS Enrolled,
           calculated golden_tickets_entered / calculated enrolled as percent_enrolled format=percent8.2,
           (count(Randomized_date)) AS Patients_Enrolled_Per_week
      from enrollmentbysubject
     where Randomized_date between &datebegin and &dateend
    Group By cdrn, site;
quit;

But because I can't see the data, I can't be sure it'll give you what you want.

pkantak · Posted 12-28-2016 07:14 PM

Hi Laurie,

Thanks so much for your reply, I really appreciate it! I tried your suggestions, and unfortunately it didn't work out for me. It keeps returning empty datasets (i.e.: It creates all the columns I want, but there are no values for the different variables). I can't attach the data, as there is sensitive patient info, but I can add example data. It is listed below. I only want to count the randomized column if it has a datetime variable and is between my set macrovariables. Please keep in mind, the data below is raw (with some hidden columns that contained the sensitive info) and before using datepart on the randomized column. I'm also pasting in the log when I ran it.

Example data:

CDRN	Site	Randomized
C1 NYC	Cornell	12/19/2016 17:18
C1 NYC	Columbia	12/20/2016 9:37
C2 Mid Atlantic	Vanderbilt University	12/20/2016 19:51
C2 Mid Atlantic	Vanderbilt University	12/19/2016 16:47
C3 Florida	UF
C3 Florida	UF	12/22/2016 18:52
C4 California	UC Irvine	12/22/2016 9:48
C4 California	UC Irvine

247 /*convert datetimes in column "Randomized" to SAS datetimes*/
248 data work.enrollmentbysubject;
249 set work.enrollmentbysubject;
250 if compress(cats(CDRN), '.')=' ' then
251 delete;
252 Randomized_date = datepart(input(Randomized, anydtdtm23.));
253 format Randomized_date yymmdds10.;
254 run;

NOTE: Missing values were generated as a result of performing an operation on missing values.
Each place is given by: (Number of times) at (Line):(Column).
883 at 252:23
NOTE: There were 1578 observations read from the data set WORK.ENROLLMENTBYSUBJECT.
NOTE: The data set WORK.ENROLLMENTBYSUBJECT has 1577 observations and 28 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

255 /*create table that I want*/
256 proc sql noprint feedback;
257 create table work.dailyenrollmenttable_&date. AS
258 select cdrn, site,
259 (count(Invitation_Code_Used)) AS Golden_Tickets_Entered,
260 (count(Randomized)) AS Enrolled,
261 calculated golden_tickets_entered / calculated enrolled as percent_enrolled
261! format=percent8.2,
262 (count(Randomized_date)) AS Patients_Enrolled_Per_week
263 from enrollmentbysubject
264 where Randomized_date between &datebegin and &dateend
265 Group By cdrn, site;
NOTE: Statement transforms to:

select ENROLLMENTBYSUBJECT.CDRN, ENROLLMENTBYSUBJECT.Site,
COUNT(ENROLLMENTBYSUBJECT.Invitation_Code_Used) as Golden_Tickets_Entered,
COUNT(ENROLLMENTBYSUBJECT.Randomized) as Enrolled, calculated Golden_Tickets_Entered /
calculated Enrolled as percent_enrolled format=PERCENT8.2,
COUNT(ENROLLMENTBYSUBJECT.Randomized_date) as Patients_Enrolled_Per_week
from WORK.ENROLLMENTBYSUBJECT
where (ENROLLMENTBYSUBJECT.Randomized_date between 20807 and 20810)
group by ENROLLMENTBYSUBJECT.CDRN, ENROLLMENTBYSUBJECT.Site;

NOTE: Table WORK.DAILYENROLLMENTTABLE_20161218 created, with 0 rows and 6 columns.

266 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

Does that help any?

Thanks again for your help!

Pete

mkeintz · Posted 12-28-2016 07:30 PM

You have the note:

NOTE: Missing values were generated as a result of performing an operation on missing values.
Each place is given by: (Number of times) at (Line):Column).
883 at 252:23

referring to this statement:

252 Randomized_date = datepart(input(Randomized, anydtdtm23.));

So out of 1,578 total records read, this happens 883 times. Is that to be expected? If not, I'd solve that problem first, as it may be the reason your final data set is empty. After all, you have eliminated the data type error, and you now have a logic error - i.e. your filter, which counts and evaluates randomized_date, is unintentionally removing cases you presumably know to be in the data set.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

LaurieF · Posted 12-28-2016 07:31 PM

Hmm - try this - it works for me:

%let date=20161218;
%let datebegin = %sysevalf('19dec2016'd);
%let dateend = %sysevalf('22dec2016'd);

data enrollmentbysubject;
infile cards dsd dlm=',' missover;
attrib cdrn length=$ 20;
attrib site length=$ 30;
attrib randomized length=8. informat=anydtdtm23. format=datetime23.;
input cdrn
      site
      randomized;
cards;
C1 NYC,Cornell,12/19/2016 17:18
C1 NYC,Columbia,12/20/2016 9:37
C2 Mid Atlantic,Vanderbilt University,12/20/2016 19:51
C2 Mid Atlantic,Vanderbilt University,12/19/2016 16:47
C3 Florida,UF	 
C3 Florida,UF,12/22/2016 18:52
C4 California,UC Irvine,12/22/2016 9:48
C4 California,UC Irvine	 
;
run;

data enrollmentbysubject;
set enrollmentbysubject;
attrib randomized_date length=4 format=yymmdds10.;
if strip(cdrn) notin('.', ' '); 
Randomized_date = datepart(Randomized);
if ranuni(225465114) < .5 then
   invitation_code_used = 'Y';
run;

proc sql noprint feedback;
create table work.dailyenrollmenttable_&date. AS
	select cdrn, site,
           (count(Invitation_Code_Used)) AS Golden_Tickets_Entered,
           (count(Randomized)) AS Enrolled,
           calculated golden_tickets_entered / calculated enrolled as percent_enrolled format=percent8.2,
           (count(Randomized_date)) AS Patients_Enrolled_Per_week
      from enrollmentbysubject
     where Randomized_date between &datebegin and &dateend
    group by cdrn, site;
quit;

Note that I've randomly assigned 'Y' to invitation_code_used.

Additionally, noting we are two countries divided by a common language, Laurie is a male name…

pkantak · Posted 01-04-2017 11:11 AM

Hi Laurie,

I'm sorry for the late reply, with holiday travel etc I got a bit behind on things. Hope you had a great New Year!

I'm SO SORRY. Thank you so much for correcting me.

Based on your code, I formulated the code below, which works for me. However, there are some small problems:

/* program for creating enrollment table*/
%let date=20170102;
%let datebegin = %sysevalf('26dec2016'd);
%let dateend = %sysevalf('02jan2017'd);
/*import csv*/
 proc import out=work.enrollmentbysubject datafile="Z:\ADAPTABLE\Mytrus\Pickups from Mytrus\&date.\enrollmentbysubject.csv" 
		dbms=csv replace;
	getnames=yes;
	datarow=2;
	guessingrows=20000;
run;

data work.enrollmentbysubject;
	set work.enrollmentbysubject;
	if compress(cats(CDRN), '.')=' ' then
		delete;
run;
/*convert datetimes in column "Randomized" to SAS datetimes*/
data work.enrollmentbysubject;
	set work.enrollmentbysubject;
	Randomized_date = datepart(input(Randomized, anydtdtm23.));
    format Randomized_date yymmdds10.;
run;
/*create table that I want*/
proc sql noprint feedback;
create table work.dailyenrollmenttable_&date. AS
	select distinct cdrn, site,
           count(Invitation_Code_Used) AS Golden_Tickets_Entered,
           count(Randomized) AS Enrolled,
           (calculated enrolled / calculated golden_tickets_entered) as percent_enrolled format=percent8.2
	from work.enrollmentbysubject
	 Group by site;
quit;
proc sql noprint feedback;
create table work.patientsenrolledperweek_&date. AS
select distinct cdrn, site, count(Randomized_date) AS Patients_Enrolled_Per_week
from work.enrollmentbysubject
where Randomized_date between &datebegin and &dateend
group by site;
quit;

1) if I move the below statements into my first proc sql statement, then the table that is returned just has the patients enrolled per week. Meaning, for example Penn St has 88 Golden Tickets Entered, 28 enrolled, and 31.82% enrolled. But if I add the Patients_Enrolled_Per Week column it will put that value in all of the columns. Do you know why this is? So the way I got around that was just by making 2 separate tables and than manually inputting the "patients enrolled per week" into the other table. Doesn't seem like the most efficient thing in the world...

count(Randomized_date) AS Patients_Enrolled_Per_week

where Randomized_date between &datebegin and &dateend

2) Lets say for discussion purposes, that I have 15 sites that I'm gathering enrollment data for (meaning 15 rows in my table). If one site, say Montefiore Medical Center, does not enroll any patients that week, how do I populate the Patients_Enrolled_Per_Week table with a "0". Right now it only is showing me sites that do have a positive count.

3) In your code below, I don't really understand the point of your

if ranuni(225465114) < .5 then
   invitation_code_used = 'Y';

in your data step.

Thanks again for helping me out with this, you've been a lifesaver!

Pete

LaurieF · Posted 01-05-2017 05:06 PM

Don't apologise. Nobody expect anybody to do anything over this period. Especially in New Zealand. It's New Year doesn't officially (not really) end until about the 20th - I've had people say "Happy New Year" when I meet them for the first time in February! (Idiots)

I'm going to cover the two easy ones first:

coalesce(count(Randomized_date), 0) AS Patients_Enrolled_Per_week

The coalesce (in this context) doesn't kick in until after the group by has been processed.

if ranuni(225465114) < .5 then
   invitation_code_used = 'Y';

I didn't have access to your invitation_code_used, so I threw a coin to see if it came down Y. 225465114 is an old (UK) phone number of mine - I use it as a random seed when I want to reproduce results.

As for your other problem: if I interpret what you're asking correctly, the first block of code woudl be getting all Patients_Enrolled_Per_Week over all enrollments for all time per site; the second block is only calculating them for that week.

What you could do, if you don't need dailyenrollmenttable_&date for anything else, is to create it as a view.

I'm still a little concerned about the distinct keyword - I don't think it's doing you any favours.

pkantak · Posted 01-06-2017 09:50 AM

Oh wow that's interesting. I didn't know that! And I definitely didn't get anything done in that time frame.

Regarding the "if ranuni" I understand. For some reason it didn't click in my head that you were creating data for SAS since you didn't have my excel file.......sorry about that!

Ok so I changed the "distinct" to "group by cdrn, site" and the works fine.

As far as:

coalesce(count(Randomized_date), 0) AS Patients_Enrolled_Per_week

That didn't change anything in the patientsenrolledperweek_&date. table. If I understand the function of "coalesce" it makes a priority the count of the Randomized_date variable, and then if there is a null value (i.e.: they don't have any patients ) it will choose from the second argument (i.e.: in this case 0). Your logic makes sense to me, but I'm not sure why its still only returning non-zero values.

You're correct. The dailyenrollmenttable_&date is a table that essentially shows a snapshot of how many patients were randomized by each site up until the day that I pull the data. And then the patients_enrolled_per_week is just a snapshot of that week for all sites.

Unfortunately the dailyenrollmenttable_&date is actually the main file that I need. I just need the patientsenrolledperweek_&date ammended as the last column. My brute force method would just be to create two tables with the same number of rows and just merge them before I export to an excel file. But that only works if I can get the coalesce to work.

Any thoughts would as always, be greatly appreciated!

Thanks,

Pete

LaurieF · Posted 01-06-2017 05:06 PM

The coalesce probably made no difference because randomized_date has no missing values. If in fact there are, I'm puzzled.

I think making the perweek dataset a subset of the master is the best way to go - set up all the data correctly in one, then do the fancy work in the subset.

Proc Sql Create Table Error

Re: Proc Sql Create Table Error

Re: Proc Sql Create Table Error

Re: Proc Sql Create Table Error

Re: Proc Sql Create Table Error

Re: Proc Sql Create Table Error

Re: Proc Sql Create Table Error

Re: Proc Sql Create Table Error

Re: Proc Sql Create Table Error

Re: Proc Sql Create Table Error

Re: Proc Sql Create Table Error

Registration is open

SAS Training: Just a Click Away