Hello I have a set of data, I need to calculate the Area Under the Curve, how can I achieve that in SAS?
ID | Time | Concentration |
2 | 0 | 1.120540964 |
2 | 4 | 0.581928655 |
2 | 8 | 0.532928263 |
2 | 12 | 0.366326931 |
2 | 16 | 0.481379851 |
2 | 20 | 0.614268914 |
2 | 24 | 0.692669541 |
2 | 28 | 0.340258722 |
2 | 32 | 0.447275578 |
2 | 36 | 0.394551156 |
2 | 40 | 0.611916895 |
2 | 44 | 0.611916895 |
2 | 48 | 0.476871815 |
3 | 0 | 1.466453674 |
3 | 4 | 0.768142401 |
3 | 8 | 0.464399817 |
3 | 12 | 0.500912825 |
3 | 16 | 0.751939754 |
3 | 20 | 0.829758101 |
3 | 24 | 0.81195801 |
3 | 28 | 0.523505249 |
3 | 32 | 0.508443633 |
3 | 36 | 0.356230032 |
3 | 40 | 0.306481059 |
3 | 44 | 0.452304884 |
3 | 48 | 0.329073482 |
4 | 0 | 0.569163694 |
4 | 4 | 0.423900119 |
4 | 8 | 0.372968688 |
4 | 12 | 0.503963535 |
4 | 16 | 0.447483155 |
4 | 20 | 0.760602457 |
4 | 24 | 0.752080856 |
4 | 28 | 1.273285771 |
4 | 32 | 0.326397146 |
4 | 36 | 0.448474039 |
4 | 40 | 0.478596908 |
4 | 44 | 0.714229092 |
4 | 48 | 0.563416568 |
5 | 0 | 0.829721362 |
5 | 4 | 1.272832817 |
5 | 8 | 0.621904025 |
5 | 12 | 0.367840557 |
5 | 16 | 0.47871517 |
5 | 20 | 1.062113003 |
5 | 24 | 0.796633127 |
5 | 28 | 0.395123839 |
5 | 32 | 0.367260062 |
5 | 36 | 0.935178019 |
5 | 40 | 1.938660991 |
5 | 44 | 0.949690402 |
5 | 48 | 1.617066563 |
Thank you
What role does ID play in this?
When you say "under the curve" do you mean under a smoothed curve going through those points or the sums of areas of the trapezoids defined by successive time values?
ID is the name is different subjects in my study. I prefer to have the area of the trapezoids
Thank you
Do you want this done for each id or as a whole?
For each ID, thank you
One way:
data want; set have; by id; retain area; lt = lat(time); lc = lag(concentration); if first.id then area=0; else area = area + (time - lt)*mean(concentration, lc); run;
Area here is the cumulative area and the last value of Id has to total.
You didn't show what the output should look like. If you only want final total then add : If last.id; to the end of the code.
@Rick_SAS has written a blog about it .
Since you specify "area under CURVE", I presume you want one observation per ID:
data have;
input ID Time Concentration ;
datalines;
2 0 1.120540964
2 4 0.581928655
2 8 0.532928263
2 12 0.366326931
2 16 0.481379851
2 20 0.614268914
2 24 0.692669541
2 28 0.340258722
2 32 0.447275578
2 36 0.394551156
2 40 0.611916895
2 44 0.611916895
2 48 0.476871815
3 0 1.466453674
3 4 0.768142401
3 8 0.464399817
3 12 0.500912825
3 16 0.751939754
3 20 0.829758101
3 24 0.81195801
3 28 0.523505249
3 32 0.508443633
3 36 0.356230032
3 40 0.306481059
3 44 0.452304884
3 48 0.329073482
4 0 0.569163694
4 4 0.423900119
4 8 0.372968688
4 12 0.503963535
4 16 0.447483155
4 20 0.760602457
4 24 0.752080856
4 28 1.273285771
4 32 0.326397146
4 36 0.448474039
4 40 0.478596908
4 44 0.714229092
4 48 0.563416568
5 0 0.829721362
5 4 1.272832817
5 8 0.621904025
5 12 0.367840557
5 16 0.47871517
5 20 1.062113003
5 24 0.796633127
5 28 0.395123839
5 32 0.367260062
5 36 0.935178019
5 40 1.938660991
5 44 0.949690402
5 48 1.617066563
run;
data want (keep=id area);
set have;
by id;
area + dif(time)*mean(concentration,lag(concentration));
if first.id then area=0;
if last.id;
run;
The DIF function is equivalent to X-lag(X).
Thank you very much for your response, but the code doesn't work. this is the message that comes out. I'm really new in SAS, the only reason I use is because the pressure in my department, apparently if you don't use SAS you are not considerate an animal scientist
376 ;
377
378 data want (keep=id area);
379 set have;
380 by id;
381 area + dif(time)*mean(concentration,lag(concentration));
382 if first.id then area=0;
383 if last.id;
384 run;
NOTE: Missing values were generated as a result of performing an operation on missing values.
Each place is given by: (Number of times) at (Line):(Column).
1 at 381:19
NOTE: There were 52 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 4 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.06 seconds
cpu time 0.04 seconds
So If there is a problem, you need to be more specific about what it is.
Mark,
I agree that your program works. But I question whether it is correctly calculating the area under the curve.
I know better than to challenge you in a math competition but, years ago, I wrote programs in basic and fortran that calculated a Gini. Out of curiosity, I took the basic code and, without trying to optimize it, converted it so that it would run in SAS. I compared the results with some examples I found on the web (e.g., https://www3.nd.edu/~wbrooks/GiniNotes.pdf ).
My results matched their results, but didn't match the results obtained with your code.
Am I missing something?
Here is what I ran:
data have;
input id time concentration;
cards;
1 0.02 0.10
1 0.03 0.05
1 0.10 0.15
1 0.50 0.50
1 0.35 0.20
;
data want;
set have (rename=(concentration=x time=y));
if missing(y) then y=1;
p=x/(x+y);
run;
proc sort data=want;
by id descending p;
run;
data want;
do until (last.id);
set want;
by id;
if first.id then do;
sumx=x;
sumy=y;
end;
else do;
sumx+x;
sumy+y;
end;
end;
do until (last.id);
set want;
by id;
output;
end;
run;
data want (keep=g s);
set want;
by id;
if first.id then do;
g=0;
cumx=0;
cumy=0;
end;
xpct=x/sumx;
ypct=y/sumy;
G+(((CUMY*2)+YPCT)*XPCT);
CUMX+XPCT;
CUMY+YPCT;
S+ABS(XPCT-YPCT);
if last.id then do;
G=(1-G)*100;
S=S*.5*100;
output;
end;
run;
Art, CEO, AnalystFinder.com
Art:
I don't think the OP here is asking for Gini (area between a 45 degree line and a Lorenz curve connecting cumulative totals of ordered scores of the criterion variable). So I wouldn't expect your program to produce the same value as my program above.
My code above is supposed to sum 12 trapezoids per ID, arising from 13 observations in sequence, where the trapezoid height is the diference in successive times, and the two bases are the current and previous concentration scores.
If there's a request for an overall measure of inequality in the distribution of the criterion, I totally missed it.
regards,
Mark
@ Mark: I totally agree! Would you agree with the following?:
@Aridany: The more I look at your data and question, I think that @mkeintz's approach will do what you want. I had been thinking about it like a Gini problem (i.e., a measure of inequality), but I don't think that is what it is.
You want to know the area under the curve, but you haven't indicated what is being plotted.
The following is based on plotting percent of total concentration as a function of time (thus the total area of each plot would be 48). Without some kind of standardization like that, I'm not sure if the results for each id are comparable with each other:
data need (drop=total);
do until (last.id);
set have;
by id;
total+concentration;
end;
do until (last.id);
set have;
by id;
pct=concentration/total;
output;
end;
run;
data want (keep=id area_below area_above);
set need;
by id;
area_below + dif(time)*mean(pct,lag(pct));
area_above + dif(time)*mean(1-pct,1-lag(pct));
if first.id then do;
area_below=0;
area_above=0;
end;
if last.id;
run;
Art, CEO, AnalystFinder.com
Hello,
Thanks everybody for the answers, it looks like you really know what you are saying. The biggest problem I have is I don't understand what you mean by "want", "need" and "have" code. I'm not a mathematician, I just need the area under the curve, I don't understand all that you are saying, so I'm going to use Excel to calculate that. I prefer to be a weirdo in my department for not using SAS than kill myself trying to understand what you all are saying. Thanks again all for your effort. Really surprised me how much you know. Impressive. I prefer to keep using statistic packages more user frendly like Minitab or SPSS.
Have a great day,
Ary
WANT, NEED, HAVE. This is not a sas-specific terminology. nor is it mathematical.
As a problem generalization, the user typically has a dataset, which we generically call HAVE. And the user commonly desires to produce results in the form of another datasets, which we generically call WANT. And sometimes you must create an intermediate data set to solve the problem in two steps instead of one. We generically call that NEED.
If you had created your own dataset name when you posed the problem, we may very well have used those names instead.
You didn't specify how you wanted the result (as a printable file, or a data set), which I think is understandable. However, you apparently didn't look at the resulting dataset that my code produced. And only now do I think that you wanted the area-under-curve printed out in the log or listing as opposed to being stored in a data set.
User-friendliness, even in SAS, can benefit from clarifying the problem statement when respondents generate something other than what you want. Mutual patience can be rewarding.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.