Ask The Expert - Why Choose Between SAS Data Step and PROC SQL When You Can Have Both? On-Demand Recording, Q&A and Slides

2 Likes

Want a review, or missed out due to scheduling? You can still watch this Ask the Expert session to learn where the SAS DATA step has a distinct advantage over SQL and where you just can’t beat SQL.

Watch the Webinar

You will learn:

How to choose SAS programming techniques that will optimize your code – and your time.
How to combine PROC SQL and the SAS DATA step to get the best of both.
How easy it is to switch between the two languages to get what you want.

The questions from the Q&A segment held at the end of the webinar are listed below and the slides from the webinar are attached.

Q&A

Is there a benefit in using WHERE vs ON within the SQL join?

I might share my code so that we can talk about the WHERE and the ON. What the user is asking is about when you craft a join. I use the WHERE clause, but you could also use an ON clause. If you use an ON clause, you need to specify something like this is the first table, then specify inner joint syntax. On the ON clause, you are going to specify the joint condition, and it is certainly possible. The difference has to do with the logical query processing order and the ON is part of the FROM. The FROM is the first thing that SQL is going to tackle in order of execution and then the WHERE. So, you have micro efficiency when you stack the ON with the FROM.

I use the dictionary tables often when I need to search for things. It often takes a VERY long time to come back. It can be frustrating.

Absolutely, and that is why the “compare and contrast” in the dictionary tables. I would try not to lean on the SAS help library to look at dictionary tables but go to PROC SQL to look at the dictionary tables. In addition, anytime you are querying dictionary tables, you saw in my demo how huge it is, I would ask you to stick WHERE clauses in there so that you limit the search to exclusively what you are looking for. Feel free to cheat and borrow this piece of code that you are seeing on my screen to find scene name columns or use this piece of code. I also have a paper called Know thy data that I wrote for SAS Global Forum and goes through a lot of detail on dictionary tables for maximum efficiency.

In your experience, do indexed tables work better for joins (SQL) or merges (Data Steps)?

Indexing is always going to be efficient before you join or merge. In my experience, whether a merge or a join is more efficient is up to your benchmarking statistics. I would strongly encourage you to use the options fullstimer and take our Programming III class because there is a whole segment dedicated to how we do benchmarking. How many runs you must do, at what time of day you must do the benchmarking, and so on. Also, a sort before indexing is beneficial to reduce I/O cycles.

What about cartesian joins in the Data Step?

Can be done, but not as straight forward as Data Step.

Your advantages of SQL are great. But I often work with matching data that is 'dirty,' meaning there can be failed matching or duplicate matches that, in theory, should not be there. Thus, I appreciate the Data Step's ability to closely track matches (with "IN = " I can see where each match was found and where not).

I completely agree! These are all tools, and you use the best one for the job. Cleaning data is much easier in the DATA step. Joining is easier in SQL in my opinion.

Inequality joins are a huge Proc SQL advantage!

I completely agree! Those are fantastic in SQL. 

What would be the exact equivalent of a DATA step MERGE variable1(IN=variable1_1) variable2(IN=variable2_1) in PROC SQL?

check this blog for lots of comparison between SAS data step merge & SQL https://blogs.sas.com/content/sastraining/2015/05/27/life-saver-tip-for-comparing-proc-sql-join-with...

I would like to see equivalent code in PROC SQL for one to many, one to one.

Equivalent of one-to-many in PROC SQL could be LEFT JOIN or RIGHT JOIN depending on the structure of the data source.  One-to-one would be INNER JOIN.

How do you get to a dictionary table?

Go to the SASHELP library and look for tables starting with "V".

I disagree on the clear advantage of Proc SQL on joining tables. It depends on the specifics of the case. The Data Step gains an advantage in cases where the join might be dependent on more complex logic, where the specificity of the Data Step becomes an advantage in both controlling and debugging the process.

Yes, you are right, I pointed this out, the data step typically outperforms whenever it comes to complex data manipulation. So, if you are considering combining tables horizontally and want to use complex logic, head over the data step.

Is there a list of dictionary tables?

Here is a nice paper on that: https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/070-30.pdf

What would be equivalent of DATA MERGE with variable1 (IN=Variable1_1) variable2 (IN=variable2_1)?

Here you are simply creating temporary variables. The IF Variable1_1 and Variable2_1 will will merge the data.

Does the data have to be sorted by "visit"?

If using the DATA step to merge data, you need to presort the data, -since the DATA step works row by row. Data does not need to be explicitly sorted for SQL.

Do you have PROC SORT before this?

PROC SQL data does not need to be explicitly sorted. With the DATA step, if you are joining or doing and group analysis, data typically needs to be passed explicitly through a PROC SORT

When you use SQL to join tables, can you use either INNER JOIN or LEFT JOIN for multiple tables at the same time?

If I understand this correctly, you want to know if you can do an INNER JOIN and a LEFT JOIN in the same query. The answer is yes, you can join more than two tables with SQL.

Does PROC SQL do a full outer join if you list 2 tables in the 'from' statement and no 'join' statement?

If you do not have a “join” statement but reference two tables in the FROM clause, a cartesian product will be produced, which is typically something you do not want. That means all rows will join with all rows.

Did you get the same number of results in both the DATA STEP AND PROC SQL?

If you perform the same “join,” the results should be the same.

In the PROC SQL code, how do we know what type of "Join" is being used?

This is in reference to merging with the DATA step. In this example, the IF statement determines the join. Here, if it is on both tables, it will join, similarly to an inner join.

In the ON clause you will specify the join: INNER, LEFT, RIGHT, OUTER. 

How about if you join variables where one is character and another one is numeric? Which is the winner?

Key Variable being joined must be the same type. Use the PUT or INPUT function to convert type.

Aren’t dictionary tables like PROC CONTENT?

Kind of. Dictionary tables give a lot of additional information. PROC CONTENTS simply gives you information about a table or library.

Would it be fair to say the advantage that the DATA Step has is that you can accomplish many different tasks/data transformations with a single DATA Step, but SQL is oriented toward a single type of task? Would you agree?

The DATA Step gives you the utmost control on what you are doing. PROC SQL uses an optimizer to return results. A lot depends on your main objectives. I find if I want to join or aggregate data, I typically use SQL. If I want to prepare data with the utmost control, I use the DATA Step.

Can SQL be used in every DBMS (Database Management System)?

Most databases use SQL.

Can I join 2 variables with different type2 using SQL?

No, you would need to have the same data type. You can do that by creating a new column.

Can we use PROC SQL to implement overly complex SQL queries?

Of course! It can be as complex as you want.

Can you transpose data with SQL or is PROC TRANSPOSE better?

Proc Transpose is simpler to code.

What is the difference between Proc Append and just including two or more datasets in the DATA Step set statement? 

PROC APPEND only works with 2 datasets at a time, while the Data Step can concatenate multiple datasets in one step.

Does the content of the SASHELP library change over time?

SAS may add additional sample tables.

Why did you use the sashelp.vcolumn for the DATA Step and dictionary.column for PROC SQL?

Both provide the same information. However, PROC SQL Is more efficient. Here's a paper that explains more, check section 2.3 https://www.pharmasug.org/proceedings/2018/BB/PharmaSUG-2018-BB11.pdf

Do you need a retain statement for that DATA Step?

That DATA step was using the sum statement which automatically retains the column: https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lestmtsref/n1dfiqj146yi2cn1maeju9wo7ijs.ht...

In the DATA Step merge example, the keep statement was placed at the end of the DATA Step (just before the run). Does it matter where you place it? Isn't it more efficient to put the keep directly as an option to the dataset itself on the merge? As in "merge a (in=… keep=…) b (in=… keep=);

The KEEP statement can be placed anywhere in the DATA Step. All it does is tell SAS what columns to NOT output to the new data set. But it allows you to use those columns in the DATA Step.

When you are dealing with exceptionally large datasets (millions of rows), is DATA Step or PROC SQL better?

This depends on the processing you are doing and where the data is coming from. Formal benchmarking is required to find the most efficient way. There are a variety of variables that need to be considered.

A couple of notes that might be good for other users: the dictionary tables are populated with any active data sets - SASHELP is automatically loaded for users (if in config), but data sets that are created by the user are available via a library assignment. Also, all the V functions are populated in the same way.

This is a great point! Thanks!

What about using a case when in PROC SQL vs if then else in Data Step? Will this reduce spool space errors and be more efficient?

This depends on the data used. Formal benchmarking is required to determine which is more efficient.

Is there a preferred way of working with macro variables in DATA Steps vs macro variables in PROC SQL?

To create macro variables in DATA Step, use CALL SYMPUTX.

How do you match more than one table to see how many of the same values are matching among more than two datasets?

Try using DATA Step merge.

What if you need to join 3 or more tables?

With the DATA Step, you would add another table in the MERGE statement, and continue the same process.

Regarding the efficiency of retrieving the sashelp.vcolumn with a datastep: Would it run faster when you place the where clause directly on the set statement (together with the keep)? set sashelp.vcolumn (where=(…) keep=);

Yes, WHERE= KEEP= DATA Step options are faster.

Can the DATA Step conduct many-to-many merges? I know SQL can.

You can do a many-to-many merge in the DATA Step, but it takes more work. I believe this paper does it: https://www.lexjansen.com/nesug/nesug97/sassolu/mcallast.pdf

Why does SAS require manual sorting before the merge? As in, why can't the merge statement auto sort for us? Would we ever NOT want to sort?

You can merge without sorting, but it is randomly joining the rows and you cannot validate your results. If you sort, you can validate your results.

In the PROC SQL select statement, do the variables need to be the same names in both datasets? Can you select certain variables from each dataset by us.varname?

You can select specific variables from each table by qualifying with the tablename e.g., select tablename.variable.

Is it only for the first or last of the group? Does the 2nd, 3rd... of the group work too?

You can use multiple columns in the BY statement to do multiple groups.

Do you need a RETAIN to accumulate the wgt4000 and wle2500 variables over multiple rows?

You do. However, the wgt2500 + 1 is called a sum statement, which automatically retains the column. https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lestmtsref/n1dfiqj146yi2cn1maeju9wo7ijs.ht...

What are the SAS options that provided the additional detail in the log?

We used a global option called OPTIONS FULLSTIMER.

Why can you use * in select *, monotonic() as row_id from xx group by visit, if table with many vars inside, not only visit var?

The monotonic() function does not accept any arguments. Here is more https://communities.sas.com/t5/SAS-Communities-Library/MONOTONIC-function-in-PROC-SQL/ta-p/475752

Do you need to create a new variable, such as totalsum= running_total%2Bb, or just running_total%2Bb, when doing sum in DATA Step?

Running totals in the data step can be created without redoing an assignment statement.

Here is an example of the Assignment statement

X=X+Y;

Here is an example of the SUM statement

X+Y;

In 4.2, are you trying to get the number of rows with "over average weight" by visit?

In 4.2, I demonstrated how to get both over average visit(weight > 4000) and under average visit(weight <=2500)

How can I see a list of other recorded training sessions you've done? What about using a hash table for joining tables?

Here are some of my past videos

How do I create a dictionary for a single table? I would like to have the column name, label, and qualitative variables categories.

*describe table prints all columns in the log; 
proc sql;  
   describe table dictionary.columns;

*use this info to craft a query to lookup columns in a single table, though

you could look up the metadata of a single table quite easily via PROC CONTENTS;

proc sql;  
   select name, type, length, label 
     from dictionary.columns 
        where libname='SASHELP' and upcase(memname)='CARS';

What are the benefits for creating a Library vs. a Dataset?

A SAS data library is a pointer to a folder(in the windowing environment) which can have many datasets in there. A dataset is a single table.

Why do I sometimes get a 1-to-1 merge when I do a 1-to-many merge using Data Merge? I then use PROQ SQL to achieve the 1-to-many merge. Is there a reason for this?

Not sure, we would need to see the code & data to advise further.

I did not understand why the DATA Step used the vcolumn table whereas the PROC SQL used the dictionary table. Can you please explain this?

PROC SQL uses dictionary tables. You can also access dictionary tables via the SASHELP library as they are SAS extension, a PROC SQL dictionary table called dictionary.tables will be referenced as SASHELP.vtable. At the end of the day, your best bet is to access dictionary tables via PROC SQL as its more efficient. Here is a paper explaining why. https://www.pharmasug.org/proceedings/2018/BB/PharmaSUG-2018-BB11.pdf

How about linking 3 or more tables in PROC SQL?

You can link them 2 at a time. Here is a simple code snippet for a quick look

proc sql; 
 select * from 
    SASHELP.BMIMEN as bm 
    INNER JOIN 
    SASHELP.CLASSFIT as cf 
      ON bm.age=cf.age 
    INNER JOIN 
    SASHELP.CLASS as c 
    ON cf.age=c.age 
;

How do I use PROC SQL "insert into" for append?

SAS help & documentation link https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/sqlproc/p014txvde9e60an1xyloas81o6by.htm

In the Data Step Merge slide - how do you know which Boolean value that the inprd and inus is looking for by using 'if inprd and inus;' line? I am new to DATA Steps (more of PROC SQL user).

In the data step merge, the statement IF INPRD AND INUS asks for matching rows (equivalent SQL inner join)

Which one would be better when using discriminant analysis for classification data?

There are many SAS Statistical procedures. Here is a list https://support.sas.com/rnd/app/stat/procedures/Procedures.html

If I have 3 tables to join, in your opinion, is it faster to do 1 SQL statement joining all 3 tables (all with millions of records) at one time, or do 2 separate SQL statements joining 2 tables at one time?

It really depends on how you want to craft it, because it is likely the SQL optimizer will do 2 at a time in the background. The best way to measure efficiency is to perform some benchmarking against the various techniques you are considering before deciding on which one is faster using your specific resources.

What version of SAS are you using?

8.2 Update 5 (8.2.5.1277) (32-bit)

In your Boolean example in SQL, how did you specify the variable that was being summed?

Here's the entire code for your reference

proc sql; 
   create table sqlboolean as 
   select visit,  
sum(weight > 4000 and married=1 and momsmoke=1) as wgt4000 'over average weight', 
sum(weight <=2500 and married=1 and momsmoke=1) as wle2500
'under average weight', 
from sashelp.bweight 
group by visit; 
quit;

In sequel language, there is the advantage of using create view, which makes things very fast in PROC SQL. Is there something similar in PROC Dataset?

You can build views in both PROC SQL & the data step.

IS SQL processed by SQL engine? Does the SAS engine have any role?

The Base SAS engine is what allows us to read SAS datasets. That engine is activated when you submit the libname statement.

For Data Governance, would it be possible to use PROC SQL or DATA Step to audit source tables that were used to generate a certain SAS file? Would this be possible using DICTIONARY? You would need to know database, schema, and table name.

Yes, you can review this paper for more on dictionary tables and how to use them. https://www.pharmasug.org/proceedings/2018/BB/PharmaSUG-2018-BB11.pdf

Recommended Resources

The Power of SAS SQL

SAS Tutorial | Combining Data in SAS: DATA Step Versus SQL

SAS Tutorial | Merging Data Sets in SAS using SQL

Please see additional resources in the attached slide deck.

Want more tips? Be sure to subscribe to the Ask the Expert board to receive follow up Q&A, slides and recordings from other SAS Ask the Expert webinars.

Ask The Expert - Why Choose Between SAS Data Step and PROC SQL When You Can Have Both? On-Demand Recording, Q&A and Slides

Q&A

Recommended Resources

Click image to register for webinar

Classroom Training Available!