Howdy,
I work at a data company and pull data all day :-). Some of our tables are large and the initial data pull can take a while. Is there a more efficient way than PROC SQL?
Here is a typical data pull.
Jeff
PROC SQL;
CREATE TABLE WORK.TECH_01 AS
SELECT t1.industrysegment AS ProdLvl_1,
t1.categorygroup AS ProdLvl_2,
t1.category AS ProdLvl_3,
t1.subcategory AS ProdLvl_4,
t1.itemnumberunsuppressed AS iItem_uns,
t1.outletfamily,
t1.outlet_storeid,
t1.ppmonth,
t1.ppweek,
/* Dollars */
(SUM(t1.totalvalue)) AS Dollars,
/* Units */
(SUM(t1.unitssold)) AS Units
FROM TECSWP2.vw_tecmlwk_fact_nc_uns t1
WHERE t1.subcategory IN
(
60638,
60640
) AND t1.totalvalue >= 1 AND t1.unitssold >= 1 AND t1.ppmonth BETWEEN &Month-23 and &Month
GROUP BY t1.industrysegment,
t1.categorygroup,
t1.category,
t1.subcategory,
t1.itemnumberunsuppressed,
t1.outletfamily,
t1.outlet_storeid,
t1.ppmonth,
t1.ppweek;
QUIT;
It is possible to redo your SQL as a data step. It may even be faster, if your database server is slow at calculating sums, and the connection between the server and your SAS machine is fast:
%let byvars=industrysegment categorygroup category subcategory itemnumberunsuppressed outletfamily outlet_storeid ppmonth ppweek;
data WORK.TECH_01;
set TECSWP2.vw_tecmlwk_fact_nc_uns(keep=&byvars totalvalue unitssold);
WHERE subcategory IN (60638,60640)
AND totalvalue >= 1 AND unitssold >= 1 AND ppmonth BETWEEN &Month-23 and &Month;
by &byvars;
if first.%scan(&byvars,-1) then do;
Dollars=totalvalue;
Units=unitssold;
end;
else do;
Dollars+totalvalue;
Units+unitssold;
end;
if last.%scan(&byvars,-1);
drop totalvalue unitssold;
run;
But the data server still has to sort the data before delivering it to SAS, so there is no guarantee of a performance gain.
This wouldn't be a data step, at best it could be a proc means but you SQL would be as efficient especially if its pulling from an RDBMS system.
I'm curious as to how you're storing month such that this works....
&Month-23
proc means data=tecswp2.vw_tecmlwk_fact_nc_uns noprint NWAY;
where subcategory in (60638, 60640) and totalvalue>=1 and unitssold>=1 and ppmonth between &Month-23 and &Month;
class [list your group by variables here];
var totalvalue unitssold;
output out=tech_01 sum(totalValue) = Dollars sum(unitssold) = Units;
run;
You can rename the variables in the same step on the OUT statement as well:
output out=tech_01 (rename=(industrySegment = Prodlvl_1 categoryGroup=ProdLvl2)
sum(totalValue) = Dollars sum(unitssold) = Units;
That should be enough to get you going if you want to test out the performance difference.
From the look of your query you are reading from an external database. There is probably no point in converting this to a DATA step as behind the scenes SAS has to convert it back to SQL for external databases anyway. You could try using SQL passthru instead of the implicit SQL you are using.
For large data extracts you will be limited by the network bandwidth between the database and SAS anyway. One way to prove exactly how long the database takes is to run your query without transmitting lots of data is to turn it into a simple row count:
PROC SQL;
select count(*) from
(SELECT t1.industrysegment AS ProdLvl_1,
t1.categorygroup AS ProdLvl_2,
t1.category AS ProdLvl_3,
t1.subcategory AS ProdLvl_4,
t1.itemnumberunsuppressed AS iItem_uns,
t1.outletfamily,
t1.outlet_storeid,
t1.ppmonth,
t1.ppweek,
/* Dollars */
(SUM(t1.totalvalue)) AS Dollars,
/* Units */
(SUM(t1.unitssold)) AS Units
FROM TECSWP2.vw_tecmlwk_fact_nc_uns t1
WHERE t1.subcategory IN
(
60638,
60640
) AND t1.totalvalue >= 1 AND t1.unitssold >= 1 AND t1.ppmonth BETWEEN &Month-23 and &Month
GROUP BY t1.industrysegment,
t1.categorygroup,
t1.category,
t1.subcategory,
t1.itemnumberunsuppressed,
t1.outletfamily,
t1.outlet_storeid,
t1.ppmonth,
t1.ppweek
);
QUIT;
It is possible to redo your SQL as a data step. It may even be faster, if your database server is slow at calculating sums, and the connection between the server and your SAS machine is fast:
%let byvars=industrysegment categorygroup category subcategory itemnumberunsuppressed outletfamily outlet_storeid ppmonth ppweek;
data WORK.TECH_01;
set TECSWP2.vw_tecmlwk_fact_nc_uns(keep=&byvars totalvalue unitssold);
WHERE subcategory IN (60638,60640)
AND totalvalue >= 1 AND unitssold >= 1 AND ppmonth BETWEEN &Month-23 and &Month;
by &byvars;
if first.%scan(&byvars,-1) then do;
Dollars=totalvalue;
Units=unitssold;
end;
else do;
Dollars+totalvalue;
Units+unitssold;
end;
if last.%scan(&byvars,-1);
drop totalvalue unitssold;
run;
But the data server still has to sort the data before delivering it to SAS, so there is no guarantee of a performance gain.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.