After 35+ years of using data step programming I am trying to learn Proc SQL. I have something I've written and works but believe it can be streamlined.
The code below reads in a CSV file containing a list of tape volumes in a virtual tape library and the status of the volumes (ie: 'Cataloged' or 'Scratch'). The end goal is to merge that with a text file coming from the mainframe listing dataset names and tape volume and building control cards to delete the z/OS catalog entry for any tape volume in that mainframe listing that the virtual tape library says are already in 'scratch' status. I also had an interest in knowing how well compressed the data is on the virtual tape system.
I have not included the data step processing that follows this for getting that mainframe list and the merge. Again, it all works. The entire program is also attached.
Can you suggest how to streamline the four PROC SQL steps?
This is SAS V9.4.
filename CATLIST "z:\mainframe\GPSE\Secure Agent\Reports\LISTCAT_HLQ_GPSEBKP"; filename OUTTAPES "z:\Mainframe\GPSE\Secure Agent\Reports\LISTCAT_TAPES.txt"; filename BLDNSCR "z:\Mainframe\GPSE\Secure Agent\IDCAMS_NSCR.txt"; /* ---------------------------------------------------------------------------- */ /* Part-1: Get the tape volumes and status from a SecureAgent generated report. */ /* The SecureAgent generated report is read in by the PROC IMPORT. */ /* ---------------------------------------------------------------------------- */ /* Import the All Volumes report. Note it is a CSV with column headings on line-1 */ /* ---------------------------------------------------------------------------- */ Proc Import datafile = "z:\mainframe\GPSE\Secure Agent\Reports\All Volumes_20230203.csv" out = Work.DSPL_CSV dbms = csv replace; getnames = yes; run; /* ---------------------------------------------------------------------------- */ /* Read all fields. Sum the CompressedSize and UncompressedSize fields */ /* Also determines the compression ratio */ /* ---------------------------------------------------------------------------- */ Proc SQL; create table Work.DSPL_TBL as select * from Work.DSPL_CSV; quit; /* ---------------------------------------------------------------------------- */ /* Determine the mean(Ratio) which is the overall compression ratio. */ /* ---------------------------------------------------------------------------- */ Proc SQL; title 'Sum of compression ratio'; create table Work.Compressed_Ratio as select sum(CompressedSize) as All_CompressedSize, sum(UncompressedSize) as All_UncompressedSize, TapeStatus from Work.DSPL_TBL where CompressedSize > 0 group by TapeStatus order by TapeStatus; quit; proc SQL; title 'Mean of Compressed and Uncompressed Sizes and Compression Ratios'; create table Work.Ratio as select All_CompressedSize, All_UncompressedSize, TapeStatus, (All_UncompressedSize / All_CompressedSize) as Ratio, mean(calculated Ratio) as Overall_Ratio from Work.Compressed_Ratio group by TapeStatus order by TapeStatus ; Quit; proc SQL; title 'Mean of Compressed and Uncompressed Sizes and Compression Ratios'; create table Work.Overall_Ratio as select All_CompressedSize, All_UncompressedSize, TapeStatus, Overall_Ratio, mean (Ratio) as Total_Ratio from Work.Ratio ; Quit;
Note that the above code works but I suspect it can all be done in one or two procs rather than the four I have here.
I'm unfamiliar what your end result looks like, and I kind of whipped this one up quickly - I'm not sure if it will throw errors on some of this. Please let me know if this gets you closer - I think using some subqueries would generally help in this situation.
proc sql;
create table single_query as
select distinct
a.All_CompressedSize
, a.All_UncompressedSize
, a.TapeStatus
, a.Overall_Ratio
from
(
select
sum(CompressedSize) as All_CompressedSize
, sum(UncompressedSize) as All_UncompressedSize
, (calculated All_UncompressedSize / calculated All_CompressedSize) as Ratio
, TapeStatus
, mean(calculated Ratio) as Overall_Ratio
from
work.DSPL_TBL
where
CompressedSize > 0
group by
TapeStatus
) a
order by
a.TapeStatus
;
quit;
create table single_query as select distinct a.ALL_CompressedSize , a.All_UncompressedSize , a.tapeStatus , a.Overall_Ratio from ( select sum(CompressedSize) as All_CompressedSize , sum(UncompressedSize) as All_UncompressedSize , (Calculated All_UncompressedSize / Calculated All_CompressedSize) as Ratio , TapeStatus , mean(calculated Ratio) as Overall_Ratio from Work.DSPL_CSV where CompressedSize > 0 group by TapeStatus ) a order by TapeStatus ; quit;
Results in:
ERROR: Summary functions nested in this way are not supported.
ERROR: Summary functions nested in this way are not supported.
In my original code I wind up with Work.Overall_Ratio which had two records containing five fields: All_CompressedSize, All_UncompressedSize, TapeStatus, Overall_Ratio, and Total_Ratio.
The two records are summary records based on the value of TapeStatus.
I've tried something else:
Proc SQL; create table Work.DSPL_TBL1 as select *, sum(CompressedSize) as All_CompressedSize, sum(UncompressedSize) as All_UncompressedSize, mean(calculated All_CompressedSize, mean(calculated All_UncompressedSize) from Work.DSPL_CSV where CompressedSize > 0 group by TapeStatus order by TapeStatus; quit; When this runs I get the same error from above: ERROR: Summary functions nested in this way are not supported. ERROR: Summary functions nested in this way are not supported.
I've seen that error in many tickets out here but I'm not understanding how to do the nesting.
If you want to combine detail and summarised data then do it as a join on a sub-query:
Proc SQL;
create table Work.DSPL_TBL1 as
select *
from Work.DSPL_CSV as A
left join
(select TapeStatus,
sum(CompressedSize) as All_CompressedSize,
sum(UncompressedSize) as All_UncompressedSize,
mean(calculated All_CompressedSize) as as All_CompressedSize_Mean,
mean(calculated All_UncompressedSize) as All_UncompressedSize_Mean
from Work.DSPL_CSV
where CompressedSize > 0
group by TapeStatus
) as B
on A.TapeStatus = B.TapeStatus
order by A.TapeStatus;
quit;
Means of ratios when each is calculated such as
All_UncompressedSize / All_CompressedSize)
are unreliable and often just plain wrong unless the denominators are the same, or in the case of large numbers of values at least pretty similar. See if the output of your shown code makes sense when this is the input data set.
data Work.DSPL_TBL; input tapestatus Compressedsize uncompressedsize; datalines; 1 999 1000 1 999 1000 1 999 1000 1 999 1000 1 999 1000 1 999 1000 1 999 1000 1 999 1000 1 999 1000 2 1 2 ;
My gut feeling is that the TOTAL_Ratio that results is way low.
What you might look at:
Proc summary data= Work.DSPL_TBL; class tapestatus; var Compressedsize uncompressedsize; output out=demo sum= ; run;
This sums your variables, leaving the same variable name. There will be a variable _type_ that has values of 0 and 1. A short lokk will show that Type=0 has a summary of ALL the records and the _type_=1 has just each tape status.
Then calculate the ratios. This will show the components used for the ratio.
data ratios; set demo; ratio = uncompressedsize / Compressedsize; run;
Note: You would be better off writing a data step instead of the Proc Import. A minor change to a file can result in your imported data having issues such as variable name changes, variable types, lengths of character variables and with real headache generating changes things like date variables become datetime (or vice versa) without notice until something later fails.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.