DATA Step, Macro, Functions and more

Area under the curve calculation

Reply
Occasional Contributor
Posts: 5

Area under the curve calculation

Hello I have a set of data, I need to calculate the Area Under the Curve, how can I achieve that in SAS?

 

IDTimeConcentration
201.120540964
240.581928655
280.532928263
2120.366326931
2160.481379851
2200.614268914
2240.692669541
2280.340258722
2320.447275578
2360.394551156
2400.611916895
2440.611916895
2480.476871815
301.466453674
340.768142401
380.464399817
3120.500912825
3160.751939754
3200.829758101
3240.81195801
3280.523505249
3320.508443633
3360.356230032
3400.306481059
3440.452304884
3480.329073482
400.569163694
440.423900119
480.372968688
4120.503963535
4160.447483155
4200.760602457
4240.752080856
4281.273285771
4320.326397146
4360.448474039
4400.478596908
4440.714229092
4480.563416568
500.829721362
541.272832817
580.621904025
5120.367840557
5160.47871517
5201.062113003
5240.796633127
5280.395123839
5320.367260062
5360.935178019
5401.938660991
5440.949690402
5481.617066563

 

Thank you

Super User
Posts: 13,942

Re: Area under the curve calculation

What role does ID play in this?

 

When you say "under the curve" do you mean under a smoothed curve going through those points or the sums of areas of the trapezoids defined by successive time values?

Occasional Contributor
Posts: 5

Re: Area under the curve calculation

ID is the name is different subjects in my study. I prefer to have the area of the trapezoids

 

Thank you

PROC Star
Posts: 1,405

Re: Area under the curve calculation

Do you want this done for each id or as a whole?

Occasional Contributor
Posts: 5

Re: Area under the curve calculation

For each ID, thank you

Super User
Posts: 13,942

Re: Area under the curve calculation

[ Edited ]

One way:

data want;
   set have;
   by id;
   retain area;
   lt = lat(time);
   lc = lag(concentration);
   if first.id then area=0;
   else area = area + (time - lt)*mean(concentration, lc);
run;

Area here is the cumulative area and the last value of Id has to total.

 

You didn't show what the output should look like. If you only want final total then add : If last.id; to the end of the code.

Super User
Posts: 10,850

Re: Area under the curve calculation

@Rick_SAS has written a blog about it . 

Trusted Advisor
Posts: 1,399

Re: Area under the curve calculation

Since you specify "area under CURVE", I presume you want one observation per ID:

 

data have;
 input ID Time Concentration ;
datalines;
2  0 1.120540964 
2  4 0.581928655 
2  8 0.532928263 
2 12 0.366326931 
2 16 0.481379851 
2 20 0.614268914 
2 24 0.692669541 
2 28 0.340258722 
2 32 0.447275578 
2 36 0.394551156 
2 40 0.611916895 
2 44 0.611916895 
2 48 0.476871815 
3  0 1.466453674 
3  4 0.768142401 
3  8 0.464399817 
3 12 0.500912825 
3 16 0.751939754 
3 20 0.829758101 
3 24 0.81195801 
3 28 0.523505249 
3 32 0.508443633 
3 36 0.356230032 
3 40 0.306481059 
3 44 0.452304884 
3 48 0.329073482 
4  0 0.569163694 
4  4 0.423900119 
4  8 0.372968688 
4 12 0.503963535 
4 16 0.447483155 
4 20 0.760602457 
4 24 0.752080856 
4 28 1.273285771 
4 32 0.326397146 
4 36 0.448474039 
4 40 0.478596908 
4 44 0.714229092 
4 48 0.563416568 
5  0 0.829721362 
5  4 1.272832817 
5  8 0.621904025 
5 12 0.367840557 
5 16 0.47871517 
5 20 1.062113003 
5 24 0.796633127 
5 28 0.395123839 
5 32 0.367260062 
5 36 0.935178019 
5 40 1.938660991 
5 44 0.949690402 
5 48 1.617066563 
run;

data want (keep=id area);
  set have;
  by id;
  area + dif(time)*mean(concentration,lag(concentration));
  if first.id then area=0;
  if last.id;
run;

 

The DIF function is equivalent to X-lag(X).

Occasional Contributor
Posts: 5

Re: Area under the curve calculation

Thank you very much for your response, but the code doesn't work. this is the message that comes out. I'm really new in SAS, the only reason I use is because the pressure in my department, apparently if you don't use SAS you are not considerate an animal scientist

 

376  ;
377
378  data want (keep=id area);
379    set have;
380    by id;
381    area + dif(time)*mean(concentration,lag(concentration));
382    if first.id then area=0;
383    if last.id;
384  run;

NOTE: Missing values were generated as a result of performing an operation on missing values.
      Each place is given by: (Number of times) at (Line)Smiley SadColumn).
      1 at 381:19
NOTE: There were 52 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 4 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.06 seconds
      cpu time            0.04 seconds

Trusted Advisor
Posts: 1,399

Re: Area under the curve calculation

  1. Did you look at the data? If so, what's wrong with it?
  2. What about the log makes you think something is wrong?  Here's what I get from the  log:
    1. 52 observations were read in (i.e. 13 observations each for 4 id's).
    2. 4 observations were written out - one per id, yielding the total area for each id.
    3. There is a note that a missing value was generated ONCE in line 381.  So the other 51 times line 381 was executed, a  missing value was NOT generated.

      What was that one time?   Since the DIF function is defined as dif(x)=x-lag(x), then the missing value was generated for the first observation, since  lag(x) would be missing at the point.

      However, while a missing value is generated for the DIF(x) expression in observation 1, it's not propagated to area, because the statement is effectively
          area + .       (a summing statement)
      rather than area=area+.   (an assignment statement)

      The summing statement for obs 1 is
         0+.
      which yields a 0 in observation 1.

So If there is a  problem, you need to be more specific about what it is.

Super User
Posts: 8,220

Re: Area under the curve calculation

Mark,

 

I agree that your program works. But I question whether it is correctly calculating the area under the curve.

 

I know better than to challenge you in a math competition but, years ago, I wrote programs in basic and fortran that calculated a Gini. Out of curiosity, I took the basic code and, without trying to optimize it, converted it so that it would run in SAS. I compared the results with some examples I found on the web (e.g., https://www3.nd.edu/~wbrooks/GiniNotes.pdf ).

 

My results matched their results, but didn't match the results obtained with your code.

 

Am I missing something?

 

Here is what I ran:

data have;
  input id time concentration;
  cards;
1 0.02 0.10 
1 0.03 0.05 
1 0.10 0.15 
1 0.50 0.50 
1 0.35 0.20
;

data want;
  set have (rename=(concentration=x time=y));
  if missing(y) then y=1;
  p=x/(x+y);
run;

proc sort data=want;
  by id descending p;
run;

data want;
  do until (last.id);
    set want;
    by id;
    if first.id then do;
      sumx=x;
      sumy=y;
    end;
    else do;
      sumx+x;
      sumy+y;
    end;
  end;
  do until (last.id);
    set want;
    by id;
    output;
  end;
run;

data want (keep=g s);
  set want;
  by id;
  if first.id then do;
    g=0;
    cumx=0;
    cumy=0;
  end;
  xpct=x/sumx;
  ypct=y/sumy;
  G+(((CUMY*2)+YPCT)*XPCT);
  CUMX+XPCT;
  CUMY+YPCT;
  S+ABS(XPCT-YPCT);
  if last.id then do;
    G=(1-G)*100;
    S=S*.5*100;
    output;
  end;
run;

Art, CEO, AnalystFinder.com

 

Trusted Advisor
Posts: 1,399

Re: Area under the curve calculation

Art:

 

I don't think the OP here is asking for Gini  (area between a 45 degree line and a Lorenz curve connecting cumulative totals of ordered scores of the criterion variable).  So I wouldn't expect your program to produce the same value as my program above.

 

My code above is supposed to sum 12 trapezoids per ID, arising from 13 observations in sequence, where the trapezoid height is the diference in successive times, and the two bases are the current and previous concentration scores.

 

If there's a request for an overall measure of inequality in the distribution of the criterion, I totally missed it.

 

 

regards,

Mark

Super User
Posts: 8,220

Re: Area under the curve calculation

@ Mark: I totally agree! Would you agree with the following?:

 

@Aridany: The more I look at your data and question, I think that @mkeintz's approach will do what you want. I had been thinking about it like a Gini problem (i.e., a measure of inequality), but I don't think that is what it is.

 

You want to know the area under the curve, but you haven't indicated what is being plotted.

 

The following is based on plotting percent of total concentration as a function of time (thus the total area of each plot would be 48). Without some kind of standardization like that, I'm not sure if the results for each id are comparable with each other:

 

data need (drop=total);
  do until (last.id);
    set have;
    by id;
    total+concentration;
  end;
  do until (last.id);
    set have;
    by id;
    pct=concentration/total;
    output;
  end;
run;

data want (keep=id area_below area_above);
  set need;
  by id;
  area_below + dif(time)*mean(pct,lag(pct));
  area_above + dif(time)*mean(1-pct,1-lag(pct));
  if first.id then do;
    area_below=0;
    area_above=0;
  end;
  if last.id;
run;

Art, CEO, AnalystFinder.com

 

Occasional Contributor
Posts: 5

Re: Area under the curve calculation

Hello,

 

Thanks everybody for the answers, it looks like you really know what you are saying. The biggest problem I have is I don't understand what you mean by "want", "need" and "have" code. I'm not a mathematician, I just need the area under the curve, I don't understand all that you are saying, so I'm going to use Excel to calculate that. I prefer to be a weirdo in my department for not using SAS than kill myself trying to understand what you all are saying. Thanks again all for your effort. Really surprised me how much you know. Impressive. I prefer to keep using statistic packages more user frendly like Minitab or SPSS.

 

Have a great day,

Ary 

Trusted Advisor
Posts: 1,399

Re: Area under the curve calculation

@Aridany

 

WANT, NEED, HAVE.  This is not a sas-specific terminology. nor is it mathematical.

 

As a problem generalization, the user typically has a dataset, which we generically call HAVE.  And the user commonly desires to produce results in the form of another datasets, which we generically call WANT.   And sometimes you must create an intermediate data set to solve the problem in two steps instead of one.  We generically call that NEED.

 

If you had created your own dataset name when you posed the problem, we may very well have used those names instead.

 

You didn't specify how you wanted the result (as a printable file, or a data set), which I think is understandable.  However, you apparently didn't look at the resulting dataset that my code produced.  And only now do I think that you wanted the area-under-curve printed out in the log or listing as opposed to being stored in a data set.

 

User-friendliness, even in SAS, can benefit from clarifying the problem statement when respondents generate something other than what you want.  Mutual patience can be rewarding.

 

 

 

 

Ask a Question
Discussion stats
  • 14 replies
  • 159 views
  • 2 likes
  • 6 in conversation