BookmarkSubscribeRSS Feed
C_andrade
Calcite | Level 5

Hi,

 

So this is about counting horizontally from different points for each ID. 

 

Each row represent an individual over 12 years in week increments (52 sometimes 53 weeks in the year).

 

For every time (week) there is a match with the code 1, 2 or 3 it is to count for that ID, but it must be depending on the startdate if there is one - if there is no startdate it just has to be counting from the first time it meets a 1, 2 or 3. I want to know how many had 78 counts (for 78 weeks - but not for 78 consecutive weeks, gaps are allowed: that is missings in between or other classifications than 1, 2, 3) from when it starts counting.

 

So variables go from c0101 to  c1252 (that's a lot - shown below in short form) ... that means from the year 2001 and the 1st week to the year 2012 and the 52nd week. c0833 would be the 33rd week of year 2008.

 

I have a million rows.

 

ID    c0101     c0102     c0103     c0104     ...    c0833 ...    c1252    Startdate
1         1      1           1        1                                      01JAN01   
2                                     2                 2                     
3         5                  5        5                 1            1       29MAY08   
4         1      1                                                           15MAR04
5         
6         
7                           3                            1                       
8
9         99               99         99                                     25APR03
10                          8          8
11
12                                                                          17FEB06
13
14        1               1             1
15
16

I hope someone could give me an idea of what to use. Was thinking of temporary arrays, but there will need to be many of them... 

8 REPLIES 8
pearsoninst
Pyrite | Level 9
I still feel question is not so clear ..
C_andrade
Calcite | Level 5

Hi Pearson,

 

I am sorry. I will try to recapitulate.

 

Dataset consists of several horizontal rows, each one representing an individual.

 

First column identifies the individual.

 

Then columns represent one week after another. Btw all the week-variables are texts - pondering about making these numerics.

 

There are week variables from the first week of 2001 to the last week of 2012 - giving some 626 variables of these (10 * 52 weeks + 2 * 53 weeks - 2004 and 2009 have 53 weeks). To me they are named oddly so that the first 52 variables run from c0101 to c0152 and the next variable is c0201 and so on ... up to c1252 - pondering about renaming the variables in a way that makes more sense if I were to use array.

 

For every week to every individual there can be a status code else it is empty. The status codes I am out for are 1, 2 or 3.

 

I want to know how many individuals reached 78 of the status codes 1,2 or 3 over the time period. But the twist is that there may also be defined a startdate (last column) for when it makes sense to count from. If there is no startdate then it should simply just starting counting from when it meets the first 1,2 or 3.

 

Hope it makes sense!

 

Regards

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, the simplest answer is to normalise your data, go from wide data, to long data (there are many posts and papers on this topic).  Then your data might look something like:

ID       STARTDATE     WEEKNO      CODE

1        01JAN2010        1                    .

1        01JAN2010        2                    1

....

 

Then its quite simple, you could create a flag on the row where STARTDATE week is matching:

ID       STARTDATE     WEEKNO      CODE    STARTFLAG

1        01JAN2010        1                    .            Y 

1        01JAN2010        2                    1

....

 

Then its simple aggregates to sum not null records where weekno > min(weekno) where starflag="Y".  I can try to get some code out later on, or maybe tomorrow as in meeting all day.

Reeza
Super User
From the start date determine what your starting index should be, ie 01Jan01 maps to C0101 and that maps to 1.

So loop from 1 to dimension of array or number of columns and count the number of 1/2/3 allowing your loop to break if you hit 78 since you've met your metric.

Figuring out the starting index is the hardest part, relatively. It will be something like (year-year(start))*52 +week(year).

And then something like the following (untested)

count=0;
do i=index to num_columns while(count<78);
if col(i) in (1, 2,3) then count+1;
end;
ndp
Quartz | Level 8 ndp
Quartz | Level 8

1) Use INTCK function to calculate starting week number based on startdate.

2) Add all week variables to array (no need to rename but have to add year by year eg: c0101-c0152 c0201-c0252 ...)

3) Start do loop from starting week number to 626

4) calculate count or sum as needed.

C_andrade
Calcite | Level 5

Here is what I have coded so far. But I haven't worked how to incorporate the start date.

 

data want;
set have;
array weeks{*} c0101--c1252;
sum=0;
d _i_ = 1 to 626 until (first_1 > .);
if weeks{_i_} in ('1','2','3') then first_1=_i_;
end;
if (. lt first_1 le 626) then do _i_ = first_1 to min(first_1 + 77, dim(weeks));
sum + weeks{_i_}; 
end;
count=0;
do i=1 to 721 while(count<78);
if weeks(i) in ('1','2','3') then count+1;
end;
run;
ndp
Quartz | Level 8 ndp
Quartz | Level 8

To incorporate start date you need to create a variable that stores week number of the start date. Following will work as long as your start date is not 31DEC2012.

 

if nmiss(startdate)=0 then startweek=ceil(('01JAN2001'd-startdate+1)/7));

else startweek=1;

 

You cannot set up array in this fashion since the weeks are not sequentially numbered for eg there is no column c0153. You have to add range for each year separately. Use following statements:

 

array weeks{*} c0101-c0152 c0201-c0252 ...;

 

Then your do loop should start from calculated startweek

 

do _i_=startweek to dim(temp);

 

 

Patrick
Opal | Level 21

In order to pick the correct variable which matches with a start date you need to know exactly how these variables have been created and use the same algorithm. There are some challenges in that like below example illustrates.

proc format;
  picture yyww (default=4)
    other='%0y%0V' (datatype=date)
  ;
quit;

data test;
  format date date9.;
  date='01jan2001'd;
  YearWeek=put(date,yyww.);
  week=week(date);
  output;
  date='01jan2012'd;
  YearWeek=put(date,yyww.);
  week=week(date);
  output;
  date='30dec2012'd;
  YearWeek=put(date,yyww.);
  week=week(date);
  output;
run;

Capture.PNG

 

Below example code where I've made up the year and week calculation. The hardedst bit was to create some sample data.

 

Step 1: Create sample data

data have;
  length id 8 yw $4 var 8;
  format StartDate date9.;
  format _date date9.;
  do id=1 to 100;
    StartDate=floor(ranuni(1)*2000) +'01jan2005'd;
    _date=StartDate;
    do while(_date<='31Dec2015'd);
      yw=put(_date,year2.)||put(ceil((_date-intnx('year',_date,-1,'e'))/7),z2.);
      var=ceil(ranuni(1)*14);
      if var>5 then call missing(var);
      output;
      _date=_date+7;
    end; 
  end;
run;

proc transpose data=have out=source(drop=_:) prefix=c;
  by id;
  id yw;
  var var;
run;
proc sql noprint;
  select name into :varlist separated by ' '
  from dictionary.columns
  where libname='WORK' and memname='SOURCE' and upcase(name) like 'C%'
  order by name
  ;
quit;

data source;
  length id StartDate &varlist 8;
  format StartDate date9.;
  merge have(keep=id StartDate) source;
  by id;
  if first.id;
run;

 

Step 2: Count 1,2,3. 

The important parts in the code:

- Variables in array must be in sorted order (from earliest to latest "date").

- The hash holds the number of the array element - key is the variable name

- The algorithm used to calculate the starting variable name based on start date must use the same logic than what has been used to create the variables

 

data want(drop=_:);
  length Count123 8;
  set source;
  array vars {*} &varlist;
  if _n_=1 then 
    do;
      length _varname $32 _arr_element 8;
      dcl hash h();
      _rc=h.defineKey('_varname');
      _rc=h.defineData('_arr_element');
      _rc=h.defineDone();
      do _arr_element=1 to dim(vars);
        _varname=upcase(vname(vars[_arr_element]));
        _rc=h.add();
      end;
    end;

  if missing(StartDate) then _arr_element=1;
  else
    do;
      _varname='C'||put(StartDate,year2.)||put(ceil((StartDate-intnx('year',StartDate,-1,'e'))/7),z2.);
      _rc=h.find();
    end;

  do _i=_arr_element to dim(vars);
    if vars[_i] in (1,2,3) then Count123=sum(Count123,1);
  end;
run;

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 2625 views
  • 0 likes
  • 6 in conversation