Solved: generating new variable based on cumulative sum

stevenyan0127 · Posted 04-08-2022 08:49 PM

Hi, I have a dataset with 3 varibales, ID, Date, Cost in a long format. I want to generate a new variable called "Date_over_100" that shows the date when one's cumulative cost have exceeded 100. I would appreciate all the help there is. Thanks in advance!

mkeintz · Posted 04-09-2022 06:59 AM

For each ID, I think you want one observation, with variable ID and DATE, representing the first date in which the cumulative sum of costs exceeds 100.

If so, then:

data want;
  set a;
  by id;
  if first.id then sum_cost=0;
  sum_cost+cost;
  if (lag(sum_cost)<=100 or first.id=1) and sum_cost>100;
run;

The "subsetting if" outputs an observation only when the current sum_cost>100 and the observation-in-hand is either the first observation for a given ID, or is immediately preceded by an observation with sum_cost<=100.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 04-08-2022 10:52 PM

Help us help you.

Please provide a working DATA step with your sample data.

And please provide a well-defined table showing what you expect the output dataset to look like.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

stevenyan0127 · Posted 04-09-2022 12:14 AM

Hi, thanks for the reply. I have started a new thread with very detailed steps. https://communities.sas.com/t5/SAS-Programming/generating-new-variable-based-on-cumulative-sum/m-p/8... Would you mind taking a look? Thank you very much!

stevenyan0127 · Posted 04-09-2022 12:13 AM

Hi, I have a dataset with 3 varibales, ID, Date, Cost in a long format. I want to generate a new variable called "Date_over_100" that shows the date when one's cumulative cost have exceeded 100. I would appreciate all the help there is. Thanks in advance! I have achieved this in R, but need help to do so in SAS.

R:

1. I created the dataset first. I have repeated the generated 10 IDs so there are 100 in total, and generated 10 repetitions of 10 dates for each ID, and generated values for cost by random sampling 100 numbers between 30 and 50. And the output would look like the attachment file.

set.seed(11)
ID <- c(rep(1:10,each=10))
Date <- rep(sample(seq(as.Date('2008/01/01'), as.Date('2008/01/11'), by="day",replace=T), 10),10)
Cost <- round(runif(100,30,50),digits = 2)
df <- data.frame(ID,Date,Cost)
df <- df%>%
      arrange(Date)

2. I created a variable named "Date_Over_100" to show the date when one's cumulative cost has exceeded 100. Below are codes and output.

df_new <- df%>%
  group_by(ID) %>%
  summarize(Date_over_100 = Date[which.max(cumsum(Cost) > 100)])

3. I then transformed the original dataset into wide format.

df_wide <- df%>%
  arrange(Date)%>%
  pivot_wider(names_from = Date, values_from = Cost)

Screen Shot 2022-04-09 at 12.07.03 AM.png

4. Lastly, I merged the wide format dataset with the dataset I got in 3.

df_final <- inner_join(df_new,df_wide,by="ID")

Screen Shot 2022-04-09 at 12.07.38 AM.png

Now in SAS:

1. Creating the dataset.

data a;
   call streaminit(123);
	do ID = 1 to 10;
	   do Date = '01Jan2008'd to '10Jan2008'd;
              Cost = rand("integer",30,50);
	      output;
	   end;
	end;
   format Date yymmddd10.;
run;

First 15 rows of the output:

Screen Shot 2022-04-08 at 11.54.37 PM.png

And next I need to create the variable "Date_over_100" that shows the date when one's cumulative cost has exceeded 100.

I have tried to replicate what I have done in R into SAS but haven't succeeded. I would appreciate any help there is. Thanks in advance!

Kurt_Bremser · Posted 04-09-2022 02:26 AM

See this:

dara want;
set a;
by id;
retain flag;
if first.id
then do;
  sum_cost = cost;
  flag = 1;
end;
else sum_cost + cost; /* sum statement implies retain of sum_cost */
if sum_cost > 100
then do;
  if flag then output;
  flag = 0;
end;
drop flag;
run;

Untested, posted from my tablet.

PS I merged your questions, no need for a new thread.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

stevenyan0127 · Posted 04-09-2022 05:42 PM

Thank you Kurt! It works perfect!

mkeintz · Posted 04-09-2022 06:59 AM

For each ID, I think you want one observation, with variable ID and DATE, representing the first date in which the cumulative sum of costs exceeds 100.

If so, then:

data want;
  set a;
  by id;
  if first.id then sum_cost=0;
  sum_cost+cost;
  if (lag(sum_cost)<=100 or first.id=1) and sum_cost>100;
run;

The "subsetting if" outputs an observation only when the current sum_cost>100 and the observation-in-hand is either the first observation for a given ID, or is immediately preceded by an observation with sum_cost<=100.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

stevenyan0127 · Posted 04-09-2022 05:42 PM

Thank you very much! It works perfect!

generating new variable based on max cumulative sum

Re: generating new variable based on max cumulative sum

Re: generating new variable based on max cumulative sum

Re: generating new variable based on max cumulative sum

generating new variable based on cumulative sum

Re: generating new variable based on cumulative sum

Re: generating new variable based on cumulative sum

Re: generating new variable based on max cumulative sum

Re: generating new variable based on max cumulative sum

Classroom Training Available!