Hello I have a question about trying to set a new data set from a raw dataset
This is part of my SAS code
data newdata;
input name$ age maj$ score
cards;
sara 25 nursing .9
kim 26 dance 1
charlie 21 psychology 4.3
anna 18 dance .45
run;
I have to make a new data set using if then statements using the SET statement so I try to do this
data newdata;
set newdata_2;
if maj nursing psychology then newmaj = sci
if maj dance then newmajo = art
if score <1 then newscore = low
if score <=1 then newscore = mid
if score <=3 then newscore = high;
run;
I get errors and it does not work. any help would be appreciated!
If you get errors in the log, show us the ENTIRE log from this code. Do not show us parts of the log. Please copy the log as text and paste it into the window that appears when you click on the </> icon.
If you get incorrect output, show us the incorrect output and explain or show us what the desired output will be.
163 data newdata; 164 input name$ age maj$ score; 165 cards; NOTE: The data set WORK.NEWDATA has 4 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 170 run; 171 data newdata; 172 set newdata_2; ERROR: File WORK.NEWDATA_2.DATA does not exist. 173 if maj nursing psychology then newmaj = sci; ------- 388 76 174 if maj dance then newmajo = art; ----- 388 76 ERROR 388-185: Expecting an arithmetic operator. ERROR 76-322: Syntax error, statement will be ignored. 175 if score <1 then newscore = low; 176 if score <=1 then newscore = mid; 177 if score <=3 then newscore = high; 178 run; NOTE: The SAS System stopped processing this step because of errors. WARNING: The data set WORK.NEWDATA may be incomplete. When this step was stopped there were 0 observations and 5 variables. WARNING: Data set WORK.NEWDATA was not replaced because this step was stopped. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
data newdata; input name$ age maj$ score; cards; sara 25 nursing .9 kim 26 dance 1 charlie 21 psychology 4.3 anna 18 dance .45 run; data newdata; set newdata_2; if maj nursing psychology then newmaj = sci; if maj dance then newmajo = art; if score <1 then newscore = low; if score <=1 then newscore = mid; if score <=3 then newscore = high; run;
Character values must appear inside quotes, single or double but the quotes are needed and must be the same, otherwise thinks that you mean that it is the name of a variable.
If you want to see if a variable is equal to a single value:
if maj='nursing' then <do something>
IF you want to see if a variable is one of a list of values then the operator is IN
if maj IN ( 'nursing' 'psychology') then newmaj = 'sci';
or instead of
if maj dance then newmajo = art;
I think you want (assumes you want the same named variable as the SCI goes into)
if maj= 'dance' then newmaj = 'art';
When you use ranges of numeric values you quite often want to use an if/then/else and be pretty specific about the ranges. All of your values are <=3 so everything with your code would end up in the High
<1 and <=1 only differ by the 1 being included. Do you only want a value of 1 to be "mid"? if so
you may be wanting:(but I doubt it as you don't have anything that assigns a value to newscore for the 4.3 score. So I think you need to describe in words which range is for which value).
length newscore $ 4; if score <1 then newscore = 'low'; else if score =1 then newscore = 'mid'; else if score <=3 then newscore = 'high';
SAS by default sets the length of a new character value based on the first use. Since the first time you use newscore would be assign 'low' it would have a length of 3 characters which means that 'high' will not all fit.
Let's break down what your code asked SAS to do.
First you have a data step to create a work dataset named NEWDATA.
SAS does not care that you forgot to add a space after the end of the variable name MAJ since $ is not a valid character to include in a variable name it was able to work out that you meant the variables NAME and MAJ to be defined a character instead of numeric. Since you did not set a length for NAME and MAJ it will default to length of $8. You also did not end the lines of data, but since you have a line with a semicolon on it SAS will use that as marking the end of the data and just ignore the other characters like "RUN" on the line.
A more complete data step might look like this instead:
data newdata;
length name $7 age 8 maj $10 score 8;
input name age maj score;
cards;
sara 25 nursing .9
kim 26 dance 1
charlie 21 psychology 4.3
anna 18 dance .45
;
You then tried to run another data step to replace NEWDATA with a new dataset with same name. That step will read from a dataset name NEWDATA_2 that you never defined before. Does NEWDATA_2 exist? Does it have the same variables as NEWDATA?
Perhaps you meant instead to create NEWDATA_2 by reading in NEWDATA?
data newdata_2;
set newdata;
The rest of the data step appears to be an attempt to write IF/THEN statements.
Let's look at the last 3 first since they at least look like valid SYNTAX.
if score <1 then newscore = low;
if score <=1 then newscore = mid;
if score <=3 then newscore = high;
So when SCORE is less than 1 you set NEWSCORE to the value LOW. But there is no variable named LOW in the NEWDATA dataset. So SAS will create a new variable and default its value to missing. Similarly for the variables MID and HIGH.
Did you want variable NEWSCORE to be character? If so you should first define it as such. What length will it need to store the longest possible value? Perhaps only 4 bytes to store 'high'?
length newscore $4;
if score <1 then newscore = 'low';
if score <=1 then newscore = 'mid';
if score <=3 then newscore = 'high';
Now we need to look a the logic error of these three statements when taken together. If SCORE is less then 1 then NEWSCORE is set to 'low' by the first IF/THEN. It will then be set to 'mid' by the second and finally it will be set to 'high' by the last.
You should probably use some ELSE statements in there so that when one test succeeds the other tests are skipped.
if score <1 then newscore = 'low';
else if score <=1 then newscore = 'mid';
else if score <=3 then newscore = 'high';
Now the only logic errors are what to do with missing values of SCORE. With this code those will cause NEWSORE to be set to 'low' since a missing value is smaller than any actual number. And what about values of SCORE that are larger than 3? Currently for those the value of NEWSCORE will be blank since none of the conditions will be true.
Perhaps you meant that values larger than 3 should by HIGH (your example data has a value of 4.3). In that case you might want to do something like this instead.
if score >3 then newscore = 'high';
else if score >= 1 then newscore = 'mid';
else if score >= 0 then newscore = 'low';
Now values larger than 3 will have NEWSCORE='high' and values from 1 to 3, inclusive, will ahve NEWSCORE='mid' and values from zero to less than 1 will have newscore = 'low' and missing values and negative values will have newscore=' '.
Now back to the other two IF / THEN statements.
if maj nursing psychology then newmaj = sci;
if maj dance then newmajo = art;
These are invalid syntax since you cannot have two or three variables listed in a row without operators between them.
I suspect that you only wanted MAJ to be a variable reference and the others to string literals so as to test if the value of MAJ is one of those other strings. You can use the IN operator for lists of one or more values. And if the list is only one value you can use the = (equality test) operator.
Also do you want to create two different new variables? NEWMAJ and NEWMAJO ? Or just one? And again how long should it be defined to store the longest value it will need to hold?
length newmaj $3 ;
if maj in ('nursing' 'psychology') then newmaj = 'sci';
else if maj='dance' then newmaj = 'art';
The RUN; statement looks right. Since there is no in-line data to end the code for the data step adding a RUN will let SAS (and other programmers reading your code) that you have finished defining the data step.
To add to all the explanations already provided: For recoding values using SAS formats instead of if/then/else constructs is another and often quite efficient option.
With formats it's often also not necessary to create new variables because many SAS procedures allow for direct use of formats (see examples below).
data have;
input name$ age maj :$20. score;
cards;
sara 25 nursing .9
kim 26 dance 1
charlie 21 psychology 4.3
anna 18 dance .45
;
run;
proc format;
value $maj
'nursing','psychology' = 'sci'
'dance' = 'art'
other = 'other'
;
value score
low -< 1 = 'low'
1 -< 3 = 'mid'
3 - high = 'high'
;
run;
/* create new variable using formats */
data want;
set have;
newmaj=put(maj,$maj.);
newscore=put(score,score.);
run;
/** examples using original values with formats for reporting */
proc print data=have;
format maj $maj. score score.;
run;
proc freq data=have;
format maj $maj.;
table maj;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.