Hi everyone, SAS learner. I have a dataset from which I need to do some calculations on some of the columns and output in a new columns. The data set temp1 is already been sorted and group by with ID and Date and I generated a counters for different categories.
For each ID on that date, I need to do two calculations for category C (Customer)
Form1C = Princ * Rate2;
Form2C= (Princ - Div ) * Rate1;
Output for C category :
Diff_C = Form 2C- Form1C
Similary for the category D (Div Customer)
Form1D = Princ * Rate2/100;
Form2D =( Princ + Gains) * Rate1
Output for 😧
Diff_D = Form2D - Form1D
** There are certain conditions to consider when calculating if the first record for any given ID has a Rate1 =0 and Flag set as 'Y' then Diff_C and Diff_C should be set to Initial Amt.
** If it is not first record for that ID and if the Rate1=0 and Flag = 'Y' Then the Diff_C and Diff_D should take the previous value calculated for that ID.
/*Sample dataset */
data temp1;
input ID Date $ Initial Amt Princ Div Gains Flag $ Rn RnG Category $ Rate1 Rate2;
datalines;
934 18-Jul-17 1.67 16694.38 300 0 N 1 0 C 8 7
934 18-Aug-17 1.67 16662.95 298 0 N 2 0 C 8 7
934 17-Sep-17 1.67 16619.13 296 0 N 3 0 C 0 6
934 18-Oct-17 1.67 16591.16 295 0 N 4 0 C 8 7
484 4-Jul-17 0 6982 200 0 N 1 0 C 13 12
484 4-Aug-17 0 6900 198 0 N 2 0 C 13 12
484 4-Sep-17 0 6818 196 0 N 3 0 C 0 12
484 4-Oct-17 0 6736 194 0 N 4 0 C 13 12
404 18-Feb-17 1.56 1200 165 0 Y 1 1 D 6.5 5.5
404 18-Mar-17 1.56 1180 0 9 Y 2 2 D 0 5.5
404 18-Apr-17 1.56 1160 0 8.98 Y 3 3 D 0 5.5
404 18-May-17 1.56 1140 0 8.96 Y 4 4 D 0 5.5
404 18-Jun-17 1.56 1120 0 8.94 Y 5 5 D 6.5 5.5
404 18-Jul-17 1.56 1100 0 8.92 Y 6 6 D 6.5 5.5
;
run
Here is the code that I wrote which is working. However I would like to know how it can implemented in the most efficient way since the data set will have records in millions. Appreciate any help.
/** My sample code ****/
proc sql;
create table ctemp_1 as
select
R.*,
(Form2C-Form1C) as Diff_C,
(Form2D-Form1D) as Diff_D
FROM (
select
ID,
Date,
Initial Amt,
Princ,
Div,
Gains,
Flag,
Rn,
RnG,
Category,
Rate1,
Rate2,
Princ* Rate2 as Form1C,
(Princ- Div ) * Rate1 as Form2C,
Princ* (Rate2/100) as Form1D,
( Princ+ Gains) * Rate1 as Form2D
FROM temp1 ) R
order by 1, 2;
QUIT;
/** I am update the records to . if the Rate1=0 and Flag is Y ***/
proc sql;
update ctemp_1
set Diff_C = .,
Diff_D = .
where Rate1 =0 and Flag = 'Y' ;
Quit;
/*** My final Output here ************/
data want;
set ctemp_1;
by ID Date;
retain _Diff_C _Diff_D;
if first.ID and Rate1=0 and Rn =1 and RnG =0 Then Diff_C = Initial Amt;
if first.ID and Rate1=0 and Rn =1 and RnG =1 Then Diff_D = Initial Amt;
if not missing(Diff_C) then _Diff_C = Diff_C;
If not missing(Diff_D) then _Diff_D = Diff_D;
if missing (Diff_C) then Diff_C = _Diff_C;
if missing (Diff_D) then Diff_D = _Diff_D;
drop _:;
run;
Sample output
ID | Date | Flag | Rn | RnG | Category | Rate1 | Diff_C | Diff_D |
934 | 18-Jul-17 | N | 1 | 0 | C | 8 | 14294.38 | 132386.4 |
934 | 18-Aug-17 | N | 2 | 0 | C | 8 | 14278.95 | 132137.2 |
934 | 17-Sep-17 | N | 3 | 0 | C | 0 | 14278.95 | 132137.2 |
934 | 18-Oct-17 | N | 4 | 0 | C | 8 | 14231.16 | 131567.9 |
484 | 4-Jul-17 | N | 1 | 0 | C | 13 | 4382 | 89928.16 |
484 | 4-Aug-17 | N | 2 | 0 | C | 13 | 4326 | 88872 |
484 | 4-Sep-17 | N | 3 | 0 | C | 0 | 4326 | 88872 |
484 | 4-Oct-17 | N | 4 | 0 | C | 13 | 4214 | 86759.68 |
404 | 18-Feb-17 | Y | 1 | 1 | D | 6.5 | 127.5 | 7734 |
404 | 18-Mar-17 | Y | 2 | 2 | D | 0 | 127.5 | 7734 |
404 | 18-Apr-17 | Y | 3 | 3 | D | 0 | 127.5 | 7734 |
404 | 18-May-17 | Y | 4 | 4 | D | 0 | 127.5 | 7734 |
404 | 18-Jun-17 | Y | 5 | 5 | D | 6.5 | 1120 | 7276.51 |
404 | 18-Jul-17 | Y | 6 | 6 | D | 6.5 | 1100 | 7147.48 |
@sasnew_484 wrote:
Rules? Such as within a an Id (seems likely)? If so, what if the value is missing for the first record of an Id? The previous would be from a different ID.
>> if the value is missing or the rate1 is zero for the first record of an ID, then it would have the initial amount instead. However, if the record is not the first and the rate1 =0, then the computed value in this case Diff_C and Diff_D will have its previous value.
Sorry, I am super new to Sas, I usually work using Tsql and Python. I am not aware of few things. The dates in the actual data are represent as 'ddmmyyyy' format. The sample data when I used on excel it came out as dd-Mon-yy.
When you use Proc Contents or other tools to examine the properties of the SAS variable does it have a FORMAT such as ddmmyy assigned? In SAS FORMAT has a very specific meaning as the way a value is displayed. The same value of a variable can be displayed using a different SAS format at different times. So "looks like" is not a Format.
You can run this code and look at the log for some example of displaying the same numeric value using different Formats.
data _null_; x=1; put "X with date9 format " x= date9.; put "X with Z8.1 format " x= z8.1 ; put "X with datetime18. format " x= datetime18.; put "X with time8. format " x= time8.; run;
Dates, time and datetime values are numeric with certain ranges. The Format is so the value makes sense to humans. If your variable is not numeric then it is not a date value.
@sasnew_484 wrote:
Hi everyone, SAS learner. I have a dataset from which I need to do some calculations on some of the columns and output in a new columns. The data set temp1 is already been sorted and group by with ID and Date and I generated a counters for different categories.
For each ID on that date, I need to do two calculations for category C (Customer)
Form1C = Princ * Rate2;
Form2C= (Princ - Div ) * Rate1;
Output for C category :
Diff_C = Form 2C- Form1C
Similary for the category D (Div Customer)
Form1D = Princ * Rate2/100;
Form2D =( Princ + Gains) * Rate1
Output for 😧
Diff_D = Form2D - Form1D
** There are certain conditions to consider when calculating if the first record for any given ID has a Rate1 =0 and Flag set as 'Y' then Diff_C and Diff_C should be set to Initial Amt.
** If it is not first record for that ID and if the Rate1=0 and Flag = 'Y' Then the Diff_C and Diff_D should take the previous value calculated for that ID.
/*Sample dataset */ data temp1; input ID Date $ Initial Amt Princ Div Gains Flag $ Rn RnG Category $ Rate1 Rate2; datalines; 934 18-Jul-17 1.67 16694.38 300 0 N 1 0 C 8 7 934 18-Aug-17 1.67 16662.95 298 0 N 2 0 C 8 7 934 17-Sep-17 1.67 16619.13 296 0 N 3 0 C 0 6 934 18-Oct-17 1.67 16591.16 295 0 N 4 0 C 8 7 484 4-Jul-17 0 6982 200 0 N 1 0 C 13 12 484 4-Aug-17 0 6900 198 0 N 2 0 C 13 12 484 4-Sep-17 0 6818 196 0 N 3 0 C 0 12 484 4-Oct-17 0 6736 194 0 N 4 0 C 13 12 404 18-Feb-17 1.56 1200 165 0 Y 1 1 D 6.5 5.5 404 18-Mar-17 1.56 1180 0 9 Y 2 2 D 0 5.5 404 18-Apr-17 1.56 1160 0 8.98 Y 3 3 D 0 5.5 404 18-May-17 1.56 1140 0 8.96 Y 4 4 D 0 5.5 404 18-Jun-17 1.56 1120 0 8.94 Y 5 5 D 6.5 5.5 404 18-Jul-17 1.56 1100 0 8.92 Y 6 6 D 6.5 5.5 ; run
Here is the code that I wrote which is working. However I would like to know how it can implemented in the most efficient way since the data set will have records in millions. Appreciate any help.
/** My sample code ****/ proc sql; create table ctemp_1 as select R.*, (Form2C-Form1C) as Diff_C, (Form2D-Form1D) as Diff_D FROM ( select ID, Date, Initial Amt, Princ, Div, Gains, Flag, Rn, RnG, Category, Rate1, Rate2, Princ* Rate2 as Form1C, (Princ- Div ) * Rate1 as Form2C, Princ* (Rate2/100) as Form1D, ( Princ+ Gains) * Rate1 as Form2D FROM temp1 ) R order by 1, 2; QUIT; /** I am update the records to . if the Rate1=0 and Flag is Y ***/ proc sql; update ctemp_1 set Diff_C = ., Diff_D = . where Rate1 =0 and Flag = 'Y' ; Quit; /*** My final Output here ************/ data want; set ctemp_1; by ID Date; retain _Diff_C _Diff_D; if first.ID and Rate1=0 and Rn =1 and RnG =0 Then Diff_C = Initial Amt; if first.ID and Rate1=0 and Rn =1 and RnG =1 Then Diff_D = Initial Amt; if not missing(Diff_C) then _Diff_C = Diff_C; If not missing(Diff_D) then _Diff_D = Diff_D; if missing (Diff_C) then Diff_C = _Diff_C; if missing (Diff_D) then Diff_D = _Diff_D; drop _:; run;
Sample output
ID Date Flag Rn RnG Category Rate1 Diff_C Diff_D 934 18-Jul-17 N 1 0 C 8 14294.38 132386.4 934 18-Aug-17 N 2 0 C 8 14278.95 132137.2 934 17-Sep-17 N 3 0 C 0 14278.95 132137.2 934 18-Oct-17 N 4 0 C 8 14231.16 131567.9 484 4-Jul-17 N 1 0 C 13 4382 89928.16 484 4-Aug-17 N 2 0 C 13 4326 88872 484 4-Sep-17 N 3 0 C 0 4326 88872 484 4-Oct-17 N 4 0 C 13 4214 86759.68 404 18-Feb-17 Y 1 1 D 6.5 127.5 7734 404 18-Mar-17 Y 2 2 D 0 127.5 7734 404 18-Apr-17 Y 3 3 D 0 127.5 7734 404 18-May-17 Y 4 4 D 0 127.5 7734 404 18-Jun-17 Y 5 5 D 6.5 1120 7276.51 404 18-Jul-17 Y 6 6 D 6.5 1100 7147.48
The data in your data step code has no "missing" values. So you need to define what you mean by missing and likely "previous value" as well.
@sasnew_484 wrote:
Thanks for your response. My bad, I should have edited it properly. I meant for the computed values such as Diff_C and Diff_D, if the Rate1 has zero value then fill them with previous computed value.
Rules? Such as within a an Id (seems likely)? If so, what if the value is missing for the first record of an Id? The previous would be from a different ID.
Are your dates really character values? That makes "previous" moderately obnoxious to determine because a sort order would have '17-Sep-17' before '18-Jul-17'. BTW you really should use 4 digit years to remove any possible confusion.
Proc SQL is generally not to the tool to use when you need to process records in a given order.
Rules? Such as within a an Id (seems likely)? If so, what if the value is missing for the first record of an Id? The previous would be from a different ID.
>> if the value is missing or the rate1 is zero for the first record of an ID, then it would have the initial amount instead. However, if the record is not the first and the rate1 =0, then the computed value in this case Diff_C and Diff_D will have its previous value.
Sorry, I am super new to Sas, I usually work using Tsql and Python. I am not aware of few things. The dates in the actual data are represent as 'ddmmyyyy' format. The sample data when I used on excel it came out as dd-Mon-yy.
@sasnew_484 wrote:
Rules? Such as within a an Id (seems likely)? If so, what if the value is missing for the first record of an Id? The previous would be from a different ID.
>> if the value is missing or the rate1 is zero for the first record of an ID, then it would have the initial amount instead. However, if the record is not the first and the rate1 =0, then the computed value in this case Diff_C and Diff_D will have its previous value.
Sorry, I am super new to Sas, I usually work using Tsql and Python. I am not aware of few things. The dates in the actual data are represent as 'ddmmyyyy' format. The sample data when I used on excel it came out as dd-Mon-yy.
When you use Proc Contents or other tools to examine the properties of the SAS variable does it have a FORMAT such as ddmmyy assigned? In SAS FORMAT has a very specific meaning as the way a value is displayed. The same value of a variable can be displayed using a different SAS format at different times. So "looks like" is not a Format.
You can run this code and look at the log for some example of displaying the same numeric value using different Formats.
data _null_; x=1; put "X with date9 format " x= date9.; put "X with Z8.1 format " x= z8.1 ; put "X with datetime18. format " x= datetime18.; put "X with time8. format " x= time8.; run;
Dates, time and datetime values are numeric with certain ranges. The Format is so the value makes sense to humans. If your variable is not numeric then it is not a date value.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.