DATA Step, Macro, Functions and more

New SAS book: Selected Papers on SAS

Reply
N/A
Posts: 1

New SAS book: Selected Papers on SAS

I have published a book: Selected Papers on SAS (by BeAuthor.com). Here is the preface:

Preface

There is an old saying in Chinese: "Indigo comes from blue but is darker than blue; ice comes from water but is colder than water." This means, "The master is surpassed by the apprentice." However, this saying also describes the situation of my two "daughters": Understanding SAS and Selected Papers on SAS. This new book (Selected Papers on SAS) is derived from the older sister (Understanding SAS), but she is prettier and more charming than the older sister because I have concentrated on important and interesting parts rather than discussing every point. Actually, I should say that papers in this book are the essence of the older sister.

In the book Understanding SAS I included many new ideas, that is, my own research results. I finished the book, but the labor and delivery process takes a long time. Meanwhile, I decided to write some papers. I have taken some ideas from the book and finished them as separate, independent papers. So now the younger sister makes her debut before her older sister.

The book has 17 papers.

Mainly, there are three kinds of contents: basic, fundamental parts such as names, order of statements and options, end of data set, and operators; hot topics such as CHKLOG, page of, rtf files and special characters, and transmitting between SAS data sets and Excel files; and that are just for fun.

The SAS manual is a great resource for every SAS programmer. However, it is too general in some places. For example, when talking about variable names, it says
You do not assign the names of special SAS automatic variables (such as _N_ and _ERROR_) or variable list names (such as _NUMERIC_, _CHARACTER_, and _ALL_) to variables.
This is far, far from enough. As we know, many computer languages have a determined set of reserved words, such as SQL. But SAS is different. "The rules are more flexible for SAS variable names than for other language elements. . . . . SAS reserves a few names for automatic variables and variable lists, SAS data set, and librefs."

First, what are these "a few names"? Second, this gives users great flexibility. They can use almost any words freely. On the other hand, it brings users some inconvenience, because you don't know how SAS treats a word: Is it treated as a user-defined name or as a keyword? This depends on SAS's interest and SAS's understanding. Sometimes you may think it is OK to use a specified word as a name, but SAS says: No, it is a keyword. Then your programming will be messed up. The following program creates two printouts. Guess what these printouts are.


DATA s;
a=5;n=6;
PROC REPORT NOWD;
COLUMN a n;
RUN;
DATA s;
not=0;
yes=not+1;
PROC PRINT;RUN;

Therefore, if you don't know how SAS behaves, you may not get the results that you want. We need to know about individual names. We need to know what names are forbidden in SAS and what names and in what situations SAS has its own interpretation of their meanings, and then we can correctly use these names or just avoid them.

Another example is subsetting. We all know that we can use a WHERE statement and an IF statement (subsetting IF) to subset a data set. Also, we know that there are some differences between a WHERE statement and a subsetting IF statement. We know that they will produce different data sets in some situations as the SAS manual mentions in the following:

The WHERE statement can produce a different data set from the subsetting IF when a BY statement accompanies a SET, MERGE, or UPDATE statement.

Then we may ask the question: If there is no BY statement, are there any differences? Some books discuss this, but the discussions do not contain enough details. We need comprehensive comparisons between two statements. Several papers in this book discuss this topic.

END of data set is fundamental in the SAS language. Almost every programmer uses option END=. In the paper [IV] I discuss when we can use this option and how SAS works on this option.

Relationships among the options KEEP (DROP), RENAME, and WHERE are basic. In the paper [III] I discuss relationships among statements and options. You may not care about relationships between options OBS= and WHERE= because you never use them together. However, it is quite possible that you have used options RENAME= and WHERE= together. The SAS online document talks about relationships among the options KEEP=, DROP=, and RENAME=, but not with the WHERE= option. So what is wrong with the following program?

DATA s;
a=3;
DATA t;
SET s(WHERE=(a>2) RENAME=(a=b));
RUN;

3 DATA t;
4 SET s(WHERE=(a>2) RENAME=(a=b));
ERROR: Variable a is not on file WORK.S.
5 RUN;

NOTE: The SAS System stopped processing this step because of errors.

Some topics are just for fun, for example, the positive prefix operator. Do you know what the printout of the following program is?

DATA s;
a=3;OUTPUT;
a=-3;OUTPUT;
DATA t;
SET s;
WHERE +a>0;
PROC PRINT;
RUN;

This is a very simple and trivial program, but I believe that no one has ever tried it. If you are interested or curious, you can find this out by yourself (It is very simple.), or you can read this book. This problem is discussed in . Of course, this is not the main topic of that paper. Actually, is my favorite paper. In that paper, I discuss operators in a DATA step (not in a WHERE statement or a WHERE option), in a WHERE statement, in PROC SQL, and in macro facility, and I make a comprehensive comparison. These are fundamental parts of the SAS language and can be a supplement of the SAS manual. They touch a basic and fundamental question: When we learn, use, analyze, and compare operators, what factors do we need to consider? I think it is not very clear for many SAS programmers and in many books. The following program creates three data sets s, t, and p. Do you know the difference between the data set t and the data set p?

DATA s;
INPUT a b c;
CARDS;
-3 5 2
2 5 2
;
DATA t;
SET s;
IF a=-3 MAX b MIN c;
PROC PRINT;
DATA p;
SET s;
WHERE a=-3 MAX b MIN c;
PROC PRINT;
RUN;

Let's see another example. Do you know what the printouts of the following program are?

%MACRO ss(company);
%IF &company EQ GE %THEN %PUT The company is GE;
%MEND;
%ss(GE)


%MACRO tt(state);
%IF &state EQ OR %THEN %PUT The state is Oregon;
%MEND;
%tt(OR)

We can see that the first macro is OK, but the second one is not. This problem has been discussed in many papers and books. There are two questions: Why does this happen and how could we avoid the problem? Many books give the answer to the second question: Use quoting functions. In , I give my explanation for the first question.

Every programmer knows about variable lists and more or less uses variable lists. Now I ask the following question: Precisely which statement can use which variable list? I think many people may be not clear about this. This is the purpose of paper [VI]. Many people know statistical functions. You can guess that the function MEAN() returns the mean value of arguments without reading any book. But do you know how to use variable lists in statistical functions? For example, in the following, which statements are OK, and which have problems? Assume that all variables and arrays are defined.

A=MEAN(OF x1-x6 p, OF q y1-y5, OF z1 z2,y1+y2);
A=MEAN(OF x1-x6 p q y1-y5 z1 z2);
A=MEAN(OF x1-x6 y1+y2);
A=MEAN(x1 x2 x3);
A=MEAN(OF x1-x5, y2 y3);
A=MEAN(OF xSmiley Happy;
A=MEAN(OF z(1) – z(3));
A=MEAN(OF z(*));

In [XIII], I discuss relationships between formats and outputs of statistical procedures. Formats affect not only the appearance of printouts but also the results as well. For example, the following program creates two printouts. Do you know if they are different?

DATA s;
a=1.34; b=5;OUTPUT;
a=1.28;b=4;OUTPUT;
a=2;b=5;OUTPUT;
a=2;b=4;OUTPUT;
FORMAT a 5.1;
PROC MEANS;
CLASS a;
PROC SQL;
SELECT a,MEAN(b) FROM s GROUP BY a;
QUIT;

A variable has two kinds of values: original (unformatted) values and formatted values. Every SAS programmer knows that if there is a BY statement, the data set should be sorted, but should the data set be sorted by original values or by formatted values? The following program is supposed to have two printouts. Is there any problem for the program? If there is no problem, what printout do you need? This problem is discussed in [XIV].

PROC FORMAT;
VALUE aa 1,3,5=odd 2,4=even;
DATA s;
INPUT a b @@;
FORMAT a aa.;
CARDS;
1 1 1 1 3 3 3 3 2 2 2 2 4 4 4 4
;
PROC MEANS;
BY a;VAR b;
PROC SORT;
BY a;
PROC MEANS;
BY a;VAR b;
RUN;

PROC SQL is an important part of SAS. Many people have done some work on this topic and advocate using PROC SQL. However, most SAS programmers still prefer DATA step because they are used to it. In this book, I discuss PROC SQL in much detail. Moreover, I present two complete programs using PROC SQL to create two tables that are used in pharmaceuticals companies, so you can compare these programs with those using DATA steps. Will you turn to PROC SQL after reading this book? Well, it is possible.

Is .1+.2 equal to .3? If you ask a math professor, he will say 'Yes'. But if you ask Dr. SAS (Dr. C, Dr. JAVA, or Dr. AWK), she will say 'No'. Moreover, she will tell you that the difference is 2**(-54) or approximately 5.55E-17. In [VIII] I discuss how SAS stores numeric values and does addition so that you can understand why Dr. SAS says No and how SAs's behavior affects your programming.

Some parts of the book are state of the art, most advanced, such as Excel and RTF. I am sure that many SAS programmers will encounter them. This book, I think, could be a good reference.

There are so many ways to move data between SAS and Excel files. In paper [XV], not only do I collect more than thirty methods, but also I indicate how to categorize them and how to compare them. You can then choose the one that best fits your needs. For example, many people know that some methods require the installation of SAS/ACCESS, and some methods do not. Do you know the answers to the following questions? What methods need SAS/ACCESS, and what methods do not? What is the difference between these two kinds of methods? What is the advantage of installing SAS/ACCESS? If there is no difference between them, then why would people waste money purchasing SAS/ACCESS software? (SAS/ACCESS has other functionalities, but here we concentrate on Excel.)

I provide my programs for CHKLOG and Page x of y. My criteria for good programs are that they are easy to use, easy to understand, and easy to modify. I hope my programs meet these criteria.

Finally, I have to emphasize that these results are run on PC Windows using SAS v9. Due to different settings, some conclusions may not apply to your situation. In many cases I will indicate the differences for v8, because so many SAS programmers are still using SAS v8.

No doubt, some conclusions may be inaccurate. The author welcomes readers' criticisms, comments, and suggestions.

I want to thank the SAS institute. Whenever I have questions, I can always get answers from them. I want to thank all the colleagues that I have ever worked with and that I am working with now. Discussions between you and me greatly inspire my thinking and my research.

I want to express my sincere thanks to Mrs. Jenny Brain for her help.

It is my hope that everyone who sees my "daughter" will say, "Your daughter is so charming. I like her." In other words, "I enjoy reading this book. I have learned something from it." Then, I will be happy.
Ask a Question
Discussion stats
  • 0 replies
  • 158 views
  • 0 likes
  • 1 in conversation