Hello, I'm a new SAS user and I'm having a bit of difficulty with manipulating data using the data step. I'm working with almost entirely numeric data, and I frequently want to define a new variable as a function of both other variables and other observations.
For example, I'm currently working with data that was collected over a number of years, up to 2010. I have three relevant variables: YEAR and RATIO. To create ADJRATIO, I want to follow this procedure:
For the ith observation AGJRATIO_i:
If YEAR_i=2010, then ADJRATIO_i=RATIO_i. Else:
Find the value for RATIO that (1) was observed in 2010, (2) is less than RATIO_i, and (3) is maximal, given (1) and (2). More mathematically:
Find k such that YEAR_k=2010, RATIO_i > RATIO_k, and RATIO_k >= RATIO_j, for all j such that YEAR_j=2010 as well. Then ADJRATIO_i = RATIO_k
Coming from a background in MatLab and R, my strong instrinct is that this should be very easy to code. Instead, I'm finding it fiendishly difficult, and both asking friends more familiar with SAS and searching online have given me the impression that this is a fundamental difficulty in the language, rather than something I missed in the help index. So far, the best solution I've found is to make several datasets which each keep only one variable, transpose them, concatenate horizontally, and use arrays to do the arithmetic I'm aiming for. This will get *ridiculously* messy when I need to refer to more than one variable, as well as more than one observation, and I'm hoping there's a better solution.
Does anyone have any thoughts that might be helpful to someone who's used to datasets being treated as MxN arrays, and being able to easily refer to (Observation_m, Variable_n)? Many thanks in advance.
There are a number of ways to accomplish what you want. Here is one:
/*
First, use proc sql to create two macro variables that contain your POVRATIO
andPERPOVRATIO variables, for 2010, in descending order of the POVRATIO
values
*/
proc sql noprint;
select POVRATIO,PERPOVRATIO
into :povratio separated by "~",
:perpovratio separated by "~"
from input
where inty =2010
order by povratio descending
;
quit;
/*
Then, in a datastep, set the x and y values for 2010, otherwise compare the
POVRATIO values with the 2010 values until one is found that is less than
or equal to the current POVRATIO. When found, set the x and y values
accordingly
*/
data want (drop=i);
set input;
if inty eq 2010 then do;
x=povratio;
y=perpovratio;
end;
else do;
i=1;
do while (scan("&povratio.",i,"~") ne "");
if input(scan("&povratio.",i,"~"),best12.) le povratio then do;
x=input(scan("&povratio.",i,"~"),best12.);
y=input(scan("&perpovratio.",i,"~"),best12.);
leave;
end;
i+1;
end;
end;
run;
Can you provide a very simple example, in the form of two datasteps, one that shows an example of the data you have, and another that shows an example of the file you want to produce from that data?
That will make it easier for others to show you how to accomplish the task in SAS.
art297 wrote:
Can you provide a very simple example, in the form of two datasteps, one that shows an example of the data you have, and another that shows an example of the file you want to produce from that data?
Of course; I'm sorry, I should've thought of that. I've attached code to the original post, and here it is as text:
*--------------------------------------------------------------------;
data work.input;
input INTY POVRATIO PERPOVRATIO;
datalines;
1990 4 10
1991 5 20
1992 2 30
1993 5 40
1994 6 50
1994 8 55
1995 3 60
1996 8 70
1997 7 80
1998 5 90
1999 8 95
2010 4.9 110
2010 1.6 120
2010 2.1 130
2010 7.9 140
2010 4.6 150
run;
data work.output;
input INTY POVRATIO PERPOVRATIO x y;
datalines;
1990 4 10 2.1 130
1991 5 20 4.9 110
1992 2 30 1.6 120
1993 5 40 4.9 110
1994 6 50 4.9 110
1994 8 55 7.9 140
1995 3 60 2.1 130
1996 8 70 7.9 140
1997 7 80 4.9 110
1998 5 90 4.9 110
1999 8 95 7.9 140
2010 4.9 110 4.9 110
2010 1.6 120 1.6 120
2010 2.1 130 2.1 130
2010 7.9 140 7.9 140
2010 4.6 150 4.6 150
run;
*--------------------------------------------------------------------;
The idea is that for 2010, x is just POVRATIO. For previous years, x takes the largest 2010 observation for POVRATIO less than that observation's POVRATIO. So for the two observations in 1994 with POVRATIO 6 and 8, we assign x to be 4.9 and 7.9 respectively, because 4.9 is the largest 2010 value of POVRATIO less than 6, and 7.9 is the largest 2010 value less than 8.
As a follow up, I'm also trying to use the 2010 assignment to define another new variable, y. I'd like to assign to y the PERPOVRATIO value of the same observation as was found for x. That is, for any pre-2010 observation that had 4.9 assigned to x, I'd assign 110 to y. If 1.6 was assigned to x, then 120 would be assigned to y, and so on.
I've worked a bit previously with lag() and retain, but it doesn't seem that either would be useful in this situation (assuming it's not just that my SASfu weak, which is a distinct possibility).
There are a number of ways to accomplish what you want. Here is one:
/*
First, use proc sql to create two macro variables that contain your POVRATIO
andPERPOVRATIO variables, for 2010, in descending order of the POVRATIO
values
*/
proc sql noprint;
select POVRATIO,PERPOVRATIO
into :povratio separated by "~",
:perpovratio separated by "~"
from input
where inty =2010
order by povratio descending
;
quit;
/*
Then, in a datastep, set the x and y values for 2010, otherwise compare the
POVRATIO values with the 2010 values until one is found that is less than
or equal to the current POVRATIO. When found, set the x and y values
accordingly
*/
data want (drop=i);
set input;
if inty eq 2010 then do;
x=povratio;
y=perpovratio;
end;
else do;
i=1;
do while (scan("&povratio.",i,"~") ne "");
if input(scan("&povratio.",i,"~"),best12.) le povratio then do;
x=input(scan("&povratio.",i,"~"),best12.);
y=input(scan("&perpovratio.",i,"~"),best12.);
leave;
end;
i+1;
end;
end;
run;
Hmm, fascinating. I'll need to do some reading on proc SQL and macro (variables), but I see the structure of what you're doing.
Thanks a lot, this is exactly what I wanted to do but couldn't grok.
Hi Art,
I thought when work with macro variables we have to use "%scan", can you explain why "scan" also works?
Thank you!
You are confusing the SAS macro language and macro functions with macro variables. Using %scan, in this case in open code, will treat i as a character and inform you that it is expecting a numeric variable.
The macro variable, in this case, is simply text substitution.
However, I should point out that I am a psychologist, NOT a programmer. Possibly, one of our programmers can provide you with a better definition of the differences.
Thank you Art for your clarification. I understand now.
data work.input; input INTY POVRATIO PERPOVRATIO; datalines; 1990 4 10 1991 5 20 1992 2 30 1993 5 40 1994 6 50 1994 8 55 1995 3 60 1996 8 70 1997 7 80 1998 5 90 1999 8 95 2010 4.9 110 2010 1.6 120 2010 2.1 130 2010 7.9 140 2010 4.6 150 ; run; data temp ; set input(rename=(INTY=_INTY POVRATIO=_POVRATIO PERPOVRATIO=_PERPOVRATIO) where=(_INTY=2010)); run; data want(drop=_: dif); set input; if inty ne 2010 then do; dif=9999; do i=1 to _nobs; set temp nobs=_nobs point=i; if dif gt (POVRATIO-_POVRATIO) and POVRATIO gt _POVRATIO then do; dif=POVRATIO-_POVRATIO; x=_POVRATIO;y=_PERPOVRATIO; end; end; end; else do; x=POVRATIO;y=PERPOVRATIO; end; run;
Ksharp
I combined Art and ksharp's methods:
data work.input;
input INTY POVRATIO PERPOVRATIO;
datalines;
1990 4 10
1991 5 20
1992 2 30
1993 5 40
1994 6 50
1994 8 55
1995 3 60
1996 8 70
1997 7 80
1998 5 90
1999 8 95
2010 4.9 110
2010 1.6 120
2010 2.1 130
2010 7.9 140
2010 4.6 150
;
run;
proc sql;
create table temp as
select INTY as _INTY,
POVRATIO as _POVRATIO,
PERPOVRATIO as _PERPOVRATIO
from input
where _INTY=2010
order by 2 desc;
quit;
data want(drop=_: );
set input;
if inty ne 2010 then do;
do i=1 to _obs;
set temp nobs=_obs point=i;
if povratio>_povratio then do;
x=_povratio;
y=_perpovratio;
leave;
end;
end;
end;
else do;
x=povratio;
y=perpovratio;
end;
run;
Things you can look up that might help, lag function (be careful with this), retain statement and IML (if you have it licensed).
Otherwise, sometimes for things that refer to previous steps or observations (or futures ones) you may need to have multiple datasteps or proc sql.
As Art stated, you need to provide the two sample datasets (not datasteps I think), what you have and what you want.
Fareeza, FWIW: The reason I always suggest that people provide the forum with "an example in the form of a datastep" is so that those who respond don't have to take the time to create a sample dataset.
Ah...I think I know what you mean now. Thanks for the clarification.
Hi Art,
What does "FWIW" mean? English is not my first language.
Thank you!
For What It's Worth
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.