BookmarkSubscribeRSS Feed
maaya88
Calcite | Level 5

I have a couple questions that I googled but cant really get a solid answer from my findings.

My Questions:

When should I use DATA step and when should I use PROC SQL? Almost certain I will get the answer "It depends ...", I would also like to know "on what?".

Would converting my current PROC SQL to DATA Step improve the runtime of my below code

What I am trying to accomplish: Create a table, details on the table are something like this:

Select tb1.v1, tb1.v2 ,... tb1.v8, tb2.v1, Case ... when ... end, tb3.v1, TRIM(tb4.v1) , ... (tb4.V2||tb3.v2) FROM tb1 Inner join tb2 on ... , left outer join tb3 on ..., left outer join tb4 on ... where ...

each table is 20 million rows.

process used to take 4 hours, before simplified, now it takes about 45 mins (I still think this is too long, maybe I am wrong)

3 REPLIES 3
Kurt_Bremser
Super User

If your SQL does a many-to-many join, you can not (easily) replicate it with a DATA step.

 

Replacing a multi-table join with a series of PROC SORT and DATA steps can improve performance. If your joins are lookups, you can combine those in a single DATA step through the use of hash objects (where the lookup tables are loaded into memory).

 

For more detailed help, we need to see the whole code as is, know the relationships between the tables with regard to the join keys, and know the sizes of the tables and involved variables.

 

Finally, if your tables reside in a remote, non-SAS relational database, you might be able to push processing into the DB.

LinusH
Tourmaline | Level 20

If you want help with optimising this specific query, please attach the full log, with OPTIONS FULLSTIMER; and _method option in PROC SQL.

SQL does lot of internal sorting so knowing your MEMSIZE and SORTSIZE options is also valuable,a s well as your systems actual available memory.

Data never sleeps
ballardw
Super User

One place that the data step really beats just writing the code, compared with SQL, is if you are doing the same step to multiple variables. The data step allows ARRAY declarations or variable lists to define a group of variables and then you can loop over them or use some of the functions such as SUM, MEAN and other functions by using notation such as: x= Sum( of arrayname(*) )  which would sum all of the variables in the array for that observation. If you have ever attempted to write such with 50 variable in sql you might appreciate that short hand. If your variables have nice names, such as sequentially numbered suffixes like Var1 to Var100 you can use lists in either array definition or many function calls: y  = sum (of Var1-Var100) for example adds all of those variables.

Or if the variables are adjacent columns in a data set use two --    between the variables to indicate a list.

Or if the variables have a common stem in the name you can use a colon to indicate "use all variables that start with" such as z = mean( of Meas_:  ).

Similar lists can also be used with data set names on a SET or Merge statement (though I would be very careful with merge). So if you want to stack all the data sets whose names start with ABC in a library it would look something like:

data want;
   set mylib.ABC:  ;
run;

If you've ever written SQL to stack (UNION) multiple data sets this may be of interest though if your ABC sets aren't built consistently you find out quickly.

 

None of the variable information functions work in SQL, nor LAG and DIF functions. I expect it to take a while before you need the variable information functions but it seems like Lag and/or Dif are often encountered relatively early.

 

BY group processing in a data step adds automatic variables so you can test if the value of a By variable is the first or last of a group. No real equivalent in SQL because there isn't a standard order records are processed.

 

JOINS are something the Data step does not do. There are ways to combine data, SET will stack records, Merge side by side (sort of), UPDATE and MODIFY can align values but have pretty stringent restrictions on matching records and results.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 448 views
  • 0 likes
  • 4 in conversation