I want to create multiple rows based on time interval between two variables. I've the data as below. We have 'Duration' value in the source data which difference between 'Start_Mo' and 'End_Mo'. Value of 'End_Mo' will always be greater than 'Start_Mo'.
Source data:
data have;
input Branch $ Segment $
Start_Mo :anydtdte7. End_Mo :anydtdte7. Duration Id Cmpny $;
format start_mo end_mo yymmd7.;
datalines;
71831234 7183_AS4 2017-03 2017-09 3 1234 Capgemini
71831248 7183_AS4 2016-12 2017-09 4 78901 TCS
71831234 7183_AS5 2017-03 2017-12 4 1234 Capgemini
71831234 7183_AS5 2017-06 2017-03 5 78901 TCS
;
run;
Now I want to create as many as rows based on the value from the variable 'Duration' and I've to create one more variable called 'Cal_Month' which should have values in quarters between time interval of the variables 'Start_Mo' and 'End_Mo'
with group by 'Segment' and 'Branch'
Desired Result is,
Branch Segment Start_Mo End_Mo Duration Cmpny cal_month 71831234 7183_AS4 2017-03 2017-09 3 Capgemin 2017-03 71831234 7183_AS4 2017-03 2017-09 3 Capgemin 2017-06 71831234 7183_AS4 2017-03 2017-09 3 Capgemin 2017-09 71831248 7183_AS4 2016-12 2017-09 4 TCS 2016-12 71831248 7183_AS4 2016-12 2017-09 4 TCS 2017-03 71831248 7183_AS4 2016-12 2017-09 4 TCS 2017-06 71831248 7183_AS4 2016-12 2017-09 4 TCS 2017-09 71831234 7183_AS5 2017-03 2017-12 4 Capgemin 2017-03 71831234 7183_AS5 2017-03 2017-12 4 Capgemin 2017-06 71831234 7183_AS5 2017-03 2017-12 4 Capgemin 2017-09 71831234 7183_AS5 2017-03 2017-12 4 Capgemin 2017-12 71831234 7183_AS5 2017-03 2018-03 5 TCS 2017-03 71831234 7183_AS5 2017-03 2018-03 5 TCS 2017-06 71831234 7183_AS5 2017-03 2018-03 5 TCS 2017-09 71831234 7183_AS5 2017-03 2018-03 5 TCS 2017-12 71831234 7183_AS5 2017-03 2018-03 5 TCS 2018-03
For example, values of 'Cal_Month' should be in 2017-03, 2017-06 and 2017-09 if 'Duration' value is 3 for each 'Segment'
In real life I've data with 'Duration' from 1 to 9 and 'Segment' is also have multiple values. Just to get the idea to tackle this Scenario, I just shared the sample data here.
I tried with the code as below, but it's not producing the desired results.
data want;
set have;
by Segment;
cal_month=start_mo;
do until (cal_month > end_mo);
output;
cal_month=intnx('month',cal_month,duration);
end;
format cal_month yymmd7.;
run;
Your logic is almost there. Try this
data want;
set have;
cal_month=start_mo;
do while (cal_month <= End_Mo);
output;
cal_month = intnx('qtr', cal_month, 1, 's');
end;
format cal_month yymmd7.;
run;
Result (Edited to comply with correct sample data)
Branch Segment Start_Mo End_Mo Duration Id Cmpny cal_month 71831234 7183_AS4 2017-03 2017-09 3 1234 Capgemin 2017-03 71831234 7183_AS4 2017-03 2017-09 3 1234 Capgemin 2017-06 71831234 7183_AS4 2017-03 2017-09 3 1234 Capgemin 2017-09 71831234 7183_AS4 2017-03 2017-12 4 78901 TCS 2017-03 71831234 7183_AS4 2017-03 2017-12 4 78901 TCS 2017-06 71831234 7183_AS4 2017-03 2017-12 4 78901 TCS 2017-09 71831234 7183_AS4 2017-03 2017-12 4 78901 TCS 2017-12 71831234 7183_AS5 2017-03 2017-12 4 1234 Capgemin 2017-03 71831234 7183_AS5 2017-03 2017-12 4 1234 Capgemin 2017-06 71831234 7183_AS5 2017-03 2017-12 4 1234 Capgemin 2017-09 71831234 7183_AS5 2017-03 2017-12 4 1234 Capgemin 2017-12 71831234 7183_AS5 2017-03 2018-03 5 78901 TCS 2017-03 71831234 7183_AS5 2017-06 2018-03 5 78901 TCS 2017-06 71831234 7183_AS5 2017-06 2018-03 5 78901 TCS 2017-09 71831234 7183_AS5 2017-06 2018-03 5 78901 TCS 2017-12 71831234 7183_AS5 2017-06 2018-03 5 78901 TCS 2018-03
@PeterClemmensen thank you. But it seems do while is taking more time to process with 10000+ records. Is there a chance to do it with other ways?
More time than what? You can do it with Do Until logic as you tried first, but I doubt that it'll be faster
data want;
set have;
cal_month=start_mo;
do until (cal_month > End_Mo);
output;
cal_month = intnx('qtr', cal_month, 1, 's');
end;
format cal_month yymmd7.;
run;
@David_Billa wrote:
@PeterClemmensen thank you. But it seems do while is taking more time to process with 10000+ records.
Only if you work with a mechanical computer from WW2. Anything that ever ran SAS does this in seconds, and nowadays in fractions of a second:
data have;
input Branch $ Segment $
Start_Mo :anydtdte7. End_Mo :anydtdte7. Duration Id Cmpny $;
format start_mo end_mo yymmd7.;
do i = 1 to 10000;
output;
end;
drop i;
datalines;
71831234 7183_AS4 2017-03 2017-09 3 1234 Capgemini
71831234 7183_AS4 2017-03 2017-12 4 78901 TCS
71831234 7183_AS5 2017-03 2017-12 4 1234 Capgemini
71831234 7183_AS5 2017-06 2018-03 5 78901 TCS
;
data want;
set have;
cal_month=start_mo;
do while (cal_month <= End_Mo);
output;
cal_month = intnx('qtr', cal_month, 1, 's');
end;
format cal_month yymmd7.;
run;
Log excerpt:
88 data want;
89 set have;
90 cal_month=start_mo;
91 do while (cal_month <= End_Mo);
92 output;
93 cal_month = intnx('qtr', cal_month, 1, 's');
94 end;
95 format cal_month yymmd7.;
96 run;
NOTE: There were 40000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 150000 observations and 8 variables.
NOTE: Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
real time 0.16 seconds
cpu time 0.15 seconds
But since you have a duration variable already, do it in an iterative do loop:
data want;
set have;
format cal_month yymmd7.;
do period = 1 to duration;
cal_month = intnx('quarter',start_mo,period-1,'b');
output;
end;
drop period;
run;
whic is even faster:
73 data want;
74 set have;
75 format cal_month yymmd7.;
76 do period = 1 to duration;
77 cal_month = intnx('quarter',start_mo,period-1,'b');
78 output;
79 end;
80 drop period;
81 run;
NOTE: There were 40000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 160000 observations and 8 variables.
NOTE: Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
real time 0.09 seconds
cpu time 0.08 seconds
Your code don't seem to work to produce the desired results. I should get 16 records in the desired Output but I could see only 11 after executing the code which you gave. I feel we still need to tweak the logic.
The error is in the given sample data in the dates in the last row. This is the correct example data given your desired result.
Both of my posted code snippets produce the correct result.
data have;
input Branch $ Segment $
Start_Mo :anydtdte7. End_Mo :anydtdte7. Duration Id Cmpny $;
format start_mo end_mo yymmd7.;
datalines;
71831234 7183_AS4 2017-03 2017-09 3 1234 Capgemini
71831234 7183_AS4 2017-03 2017-12 4 78901 TCS
71831234 7183_AS5 2017-03 2017-12 4 1234 Capgemini
71831234 7183_AS5 2017-03 2018-03 5 78901 TCS
;
run;
Yes, your code is working with the example data. When I apply your code in the real life data with 40 variables 10500 observations it keep on running. That's the reason, I asked you to help me with alternative methods. If possible, I would like to know how to achive the results with 'do until'
This should not take forever. I suspect that this is a data problem. Do you have missing values for Start_Mo?
Can you update your sample data to be representable of your actual data?
@PeterClemmensen I have updated the sample data and also the desired result in my initial post. Please note that have to create as many as rows based on two variables 'Segment' and 'Branch' and this is also I have updated in my post. If you see the sample data now, I have updated the value of 'Branch' for one of the record in 'have' data.
Thanks in advance for your effort in helping me to identify the logic.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.