Hi,
So this is about counting horizontally from different points for each ID.
Each row represent an individual over 12 years in week increments (52 sometimes 53 weeks in the year).
For every time (week) there is a match with the code 1, 2 or 3 it is to count for that ID, but it must be depending on the startdate if there is one - if there is no startdate it just has to be counting from the first time it meets a 1, 2 or 3. I want to know how many had 78 counts (for 78 weeks - but not for 78 consecutive weeks, gaps are allowed: that is missings in between or other classifications than 1, 2, 3) from when it starts counting.
So variables go from c0101 to c1252 (that's a lot - shown below in short form) ... that means from the year 2001 and the 1st week to the year 2012 and the 52nd week. c0833 would be the 33rd week of year 2008.
I have a million rows.
ID c0101 c0102 c0103 c0104 ... c0833 ... c1252 Startdate 1 1 1 1 1 01JAN01 2 2 2 3 5 5 5 1 1 29MAY08 4 1 1 15MAR04 5 6 7 3 1 8 9 99 99 99 25APR03 10 8 8 11 12 17FEB06 13 14 1 1 1 15 16
I hope someone could give me an idea of what to use. Was thinking of temporary arrays, but there will need to be many of them...
Hi Pearson,
I am sorry. I will try to recapitulate.
Dataset consists of several horizontal rows, each one representing an individual.
First column identifies the individual.
Then columns represent one week after another. Btw all the week-variables are texts - pondering about making these numerics.
There are week variables from the first week of 2001 to the last week of 2012 - giving some 626 variables of these (10 * 52 weeks + 2 * 53 weeks - 2004 and 2009 have 53 weeks). To me they are named oddly so that the first 52 variables run from c0101 to c0152 and the next variable is c0201 and so on ... up to c1252 - pondering about renaming the variables in a way that makes more sense if I were to use array.
For every week to every individual there can be a status code else it is empty. The status codes I am out for are 1, 2 or 3.
I want to know how many individuals reached 78 of the status codes 1,2 or 3 over the time period. But the twist is that there may also be defined a startdate (last column) for when it makes sense to count from. If there is no startdate then it should simply just starting counting from when it meets the first 1,2 or 3.
Hope it makes sense!
Regards
Well, the simplest answer is to normalise your data, go from wide data, to long data (there are many posts and papers on this topic). Then your data might look something like:
ID STARTDATE WEEKNO CODE
1 01JAN2010 1 .
1 01JAN2010 2 1
....
Then its quite simple, you could create a flag on the row where STARTDATE week is matching:
ID STARTDATE WEEKNO CODE STARTFLAG
1 01JAN2010 1 . Y
1 01JAN2010 2 1
....
Then its simple aggregates to sum not null records where weekno > min(weekno) where starflag="Y". I can try to get some code out later on, or maybe tomorrow as in meeting all day.
1) Use INTCK function to calculate starting week number based on startdate.
2) Add all week variables to array (no need to rename but have to add year by year eg: c0101-c0152 c0201-c0252 ...)
3) Start do loop from starting week number to 626
4) calculate count or sum as needed.
Here is what I have coded so far. But I haven't worked how to incorporate the start date.
data want; set have; array weeks{*} c0101--c1252; sum=0; d _i_ = 1 to 626 until (first_1 > .); if weeks{_i_} in ('1','2','3') then first_1=_i_; end; if (. lt first_1 le 626) then do _i_ = first_1 to min(first_1 + 77, dim(weeks)); sum + weeks{_i_}; end; count=0; do i=1 to 721 while(count<78); if weeks(i) in ('1','2','3') then count+1; end; run;
To incorporate start date you need to create a variable that stores week number of the start date. Following will work as long as your start date is not 31DEC2012.
if nmiss(startdate)=0 then startweek=ceil(('01JAN2001'd-startdate+1)/7));
else startweek=1;
You cannot set up array in this fashion since the weeks are not sequentially numbered for eg there is no column c0153. You have to add range for each year separately. Use following statements:
array weeks{*} c0101-c0152 c0201-c0252 ...;
Then your do loop should start from calculated startweek
do _i_=startweek to dim(temp);
In order to pick the correct variable which matches with a start date you need to know exactly how these variables have been created and use the same algorithm. There are some challenges in that like below example illustrates.
proc format;
picture yyww (default=4)
other='%0y%0V' (datatype=date)
;
quit;
data test;
format date date9.;
date='01jan2001'd;
YearWeek=put(date,yyww.);
week=week(date);
output;
date='01jan2012'd;
YearWeek=put(date,yyww.);
week=week(date);
output;
date='30dec2012'd;
YearWeek=put(date,yyww.);
week=week(date);
output;
run;
Below example code where I've made up the year and week calculation. The hardedst bit was to create some sample data.
Step 1: Create sample data
data have;
length id 8 yw $4 var 8;
format StartDate date9.;
format _date date9.;
do id=1 to 100;
StartDate=floor(ranuni(1)*2000) +'01jan2005'd;
_date=StartDate;
do while(_date<='31Dec2015'd);
yw=put(_date,year2.)||put(ceil((_date-intnx('year',_date,-1,'e'))/7),z2.);
var=ceil(ranuni(1)*14);
if var>5 then call missing(var);
output;
_date=_date+7;
end;
end;
run;
proc transpose data=have out=source(drop=_:) prefix=c;
by id;
id yw;
var var;
run;
proc sql noprint;
select name into :varlist separated by ' '
from dictionary.columns
where libname='WORK' and memname='SOURCE' and upcase(name) like 'C%'
order by name
;
quit;
data source;
length id StartDate &varlist 8;
format StartDate date9.;
merge have(keep=id StartDate) source;
by id;
if first.id;
run;
Step 2: Count 1,2,3.
The important parts in the code:
- Variables in array must be in sorted order (from earliest to latest "date").
- The hash holds the number of the array element - key is the variable name
- The algorithm used to calculate the starting variable name based on start date must use the same logic than what has been used to create the variables
data want(drop=_:);
length Count123 8;
set source;
array vars {*} &varlist;
if _n_=1 then
do;
length _varname $32 _arr_element 8;
dcl hash h();
_rc=h.defineKey('_varname');
_rc=h.defineData('_arr_element');
_rc=h.defineDone();
do _arr_element=1 to dim(vars);
_varname=upcase(vname(vars[_arr_element]));
_rc=h.add();
end;
end;
if missing(StartDate) then _arr_element=1;
else
do;
_varname='C'||put(StartDate,year2.)||put(ceil((StartDate-intnx('year',StartDate,-1,'e'))/7),z2.);
_rc=h.find();
end;
do _i=_arr_element to dim(vars);
if vars[_i] in (1,2,3) then Count123=sum(Count123,1);
end;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.