DATA Step, Macro, Functions and more

How to scan a binary variable for density

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 10
Accepted Solution

How to scan a binary variable for density


Hi,

I have been working with a programmer for several months trying to find an acceptable solution to a programming issue with no luck.  Basically, I have a SAS dataset that has a binary variable (0,1) with one record for each minute of the day.  I would like to find periods of at least 10 minutes of the day where there is a specific density of 1's, namely periods of at least 80% 1's and 20% or less 0's.  I would like to create a new variable that flags each minute that is part of one of these dense periods of 1's.  Also, I would like 5 consecutive zero's to indicate an automatic end of a period dense in 1's.  I think this is likely a difficult programming issue and that it may not have a solution that is 100% accurate, but I am wondering if anyone has tried something like this in SAS before and has advice or can point me to any reading that might be helpful.


Accepted Solutions
Solution
‎09-17-2014 07:51 PM
SAS Employee
Posts: 340

Re: How to scan a binary variable for density

Reading the file sequentially. If there is a 0->1 switch, it is a potential starting point for a run. So we store it in a hash object.

For each record (0 or 1):

     we update the data for each stored potential run: number of 1s

     if actual record is 1 we need to check conditions of a valid run (length>=10, 80% purity)

     if check is OK, we also store the end position of the (valid) run. If there is already an end position for this valid run, we can safely overwrite it (only longest run needed)

When there are 5 consecutive 0s (or end of file), we can output the good runs (start pos, end pos)

Postprocessing: merging overlapping runs.

Following code is not tested (lot of syntax and logical errors), will test, when I have SAS.

Tested.

data have;

  input bit @@;

datalines;

0 0 0 0 0 1 0 0 1 0 0 1 1 0 1 0 1 1 1 1

0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 1

1 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1

0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

0 0 0 0 0

;

run;

data periods(keep=startPos endPos);

     if _n_=1 then do;

          dcl hash runs(ordered:'y');

          dcl hiter riter('runs');

          runs.defineKey('startPos');

          runs.defineData('startPos');

          runs.defineData('num1s');

          runs.defineData('endPos');

          runs.defineDone();

     end;

     set have end=eof;

     pos+1;

     by bit notsorted;

     if bit=1 and first.bit then do;/*0->1 switch*/

          runs.add(key: pos, data: pos, data:0, data:0);/*potential run, storing it*/

     end;

  if bit=1 then do;

          rc=riter.first();

          do while(rc=0);/*looping through all potential runs*/

               num1s=num1s+bit;/*maintaining count of 1s*/

               if pos-startPos+1>=10 and num1s/(pos-startPos+1)>=0.8 then do;/*good run*/

                    endPos=pos;/*setting end position*/

               end;

               runs.replace();/*storing*/

               rc=riter.next();

          end;

       numCons0=0;

     end;

     else do;

          numCons0+1;

     end;

     if numCons0=5 or eof then do;/*cut the runs: output good runs, delete everything*/

          rc=riter.first();

          do while(rc=0);/*looping through all potential runs*/

               if endPos>0 then output;

               rc=riter.next();

          end;

          runs.clear();

     end;

run;

/*Post processing*/

proc sort data=periods out=periods_sorted;

     by endPos startPos;

run;

data periodsNoOverlap;

     set periods_sorted;

     by endPos;

     if first.endPos;/*keeping the lowest startPos*/

run;

Sorry, but right now, I'm not sure, maybe not all the overlapping time intervals are removed by this procedure.

There are still overlapping areas after this procedure Do you realy want to remove them? Experiment with this application, maybe you want to refine the definition of "run".

According to the currend definition "maximal" runs are: 35-56, 49-80

Message was edited by: Gergely Bathó

View solution in original post


All Replies
Respected Advisor
Posts: 4,659

Re: How to scan a binary variable for density

Do you have a SAS/ETS licence? If so, have you looked at the EXPAND procedure? I think it would provide a good starting point towards your goals.

PG

PG
Super User
Posts: 10,548

Re: How to scan a binary variable for density

Do you have the date and time information in a SAS datetime variable?

When you say at least 10 minutes (10 sequential records) the "at least" says more are acceptable. How many is the largest number of sequential records are you willing to consider as a single run? Can or should these groups of records have any overlap? For example, suppose records 1 and 2 are both 0 but 3-10 are 1 followed by 11 and 12 as zeroes. There are 3 potential sets: records 1-10, 2-11 and 3 to 12. What is the rule for determining the sequence to evaluate in this case?

Also if you could post some example or dummy data that shows the behavior, so we don't have to try to recreate it sight unseen, to try code against would be helpful.

Occasional Contributor
Posts: 10

Re: How to scan a binary variable for density

I do have the date and time information as a SAS datetime variable, although thus far I have created a variable for minutes 1-1440 for each day and have been using that as my sequence number.

10 minutes is the minimum, more than 10 minutes are acceptable and expected.  There is no set maximum for the length of the string.  However do note that the 0/1 data is a measure of physical activity intensity for a person at that minute, so I don't observe it continuing for many hours in the data.  Most strings are 10-20 minutes long.  A handful approach a couple of hours.

Thanks for the example-I realize I forogt one rule.  The string must start and end with a 1.  However, your example of ambigous strings is still relevant.  This issue is why I said there might not be a 100% accurate/right way to write the program.  Ideally, I would like the program to maximize the number of points flagged in cases of ambiguity.  Otherwise, I think the rule would just have to be an administrative decision to include the first such points.  E.g. 1 0 0 1 1 1 1 1 1 1 0 0 1 1  would mark the last 11 points and leave off the first 3 (because 11>10) but 1 0 0 1 1 1 1 1 1 1 0 0 1 would mark the first 10 observations and leave off the last 3 since the first and last 10 yield the same number of points.

In terms of the data, it is mostly 0s.  There are blips of 1s here and there and then the small sections of more dense 1s that I am trying to pick out.  The dataset is quite large, so identifying them by hand is very cumbersome.  Below is an example string.  The five zeros at the beginning and end set the section off from the rest of the data for the day.  I think the 2 underlined sections maximize the number of points flagged while still being less than or equal to 20% 0's.  The two sections are split because including those middle 2 zeros would bring the string above 20% zero.

0 0 0 0 0 1 0 0 1 0 0 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

Solution
‎09-17-2014 07:51 PM
SAS Employee
Posts: 340

Re: How to scan a binary variable for density

Reading the file sequentially. If there is a 0->1 switch, it is a potential starting point for a run. So we store it in a hash object.

For each record (0 or 1):

     we update the data for each stored potential run: number of 1s

     if actual record is 1 we need to check conditions of a valid run (length>=10, 80% purity)

     if check is OK, we also store the end position of the (valid) run. If there is already an end position for this valid run, we can safely overwrite it (only longest run needed)

When there are 5 consecutive 0s (or end of file), we can output the good runs (start pos, end pos)

Postprocessing: merging overlapping runs.

Following code is not tested (lot of syntax and logical errors), will test, when I have SAS.

Tested.

data have;

  input bit @@;

datalines;

0 0 0 0 0 1 0 0 1 0 0 1 1 0 1 0 1 1 1 1

0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 1

1 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1

0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

0 0 0 0 0

;

run;

data periods(keep=startPos endPos);

     if _n_=1 then do;

          dcl hash runs(ordered:'y');

          dcl hiter riter('runs');

          runs.defineKey('startPos');

          runs.defineData('startPos');

          runs.defineData('num1s');

          runs.defineData('endPos');

          runs.defineDone();

     end;

     set have end=eof;

     pos+1;

     by bit notsorted;

     if bit=1 and first.bit then do;/*0->1 switch*/

          runs.add(key: pos, data: pos, data:0, data:0);/*potential run, storing it*/

     end;

  if bit=1 then do;

          rc=riter.first();

          do while(rc=0);/*looping through all potential runs*/

               num1s=num1s+bit;/*maintaining count of 1s*/

               if pos-startPos+1>=10 and num1s/(pos-startPos+1)>=0.8 then do;/*good run*/

                    endPos=pos;/*setting end position*/

               end;

               runs.replace();/*storing*/

               rc=riter.next();

          end;

       numCons0=0;

     end;

     else do;

          numCons0+1;

     end;

     if numCons0=5 or eof then do;/*cut the runs: output good runs, delete everything*/

          rc=riter.first();

          do while(rc=0);/*looping through all potential runs*/

               if endPos>0 then output;

               rc=riter.next();

          end;

          runs.clear();

     end;

run;

/*Post processing*/

proc sort data=periods out=periods_sorted;

     by endPos startPos;

run;

data periodsNoOverlap;

     set periods_sorted;

     by endPos;

     if first.endPos;/*keeping the lowest startPos*/

run;

Sorry, but right now, I'm not sure, maybe not all the overlapping time intervals are removed by this procedure.

There are still overlapping areas after this procedure Do you realy want to remove them? Experiment with this application, maybe you want to refine the definition of "run".

According to the currend definition "maximal" runs are: 35-56, 49-80

Message was edited by: Gergely Bathó

Occasional Contributor
Posts: 10

Re: How to scan a binary variable for density

Thank you, this looks very promising.  I have passed it on to the programmer I am working with-she hasn't used hash objects in SAS before so hopefully this will be a fresh perspective for her.  For your question, yes I would want to get rid of overlapping bits (edit: not necessarily get rid of, rather only include them in one run.  The run in which the points are not included would need to fit the conditions on its own without those points as well).  I need to isolate independent bouts of activity.  My overall goal is to describe where these bouts of activity are occurring, so if points are double counted it will alter that distribution.

SAS Employee
Posts: 340

Re: How to scan a binary variable for density

I have not checked it manually, but the algorithm did (and I believe it is a good algorithm for this task): 35-56, 49-80 are valid runs, but 35-80 is not! How would you merge and/or split them? The answer in this specific case in your previous post (the underlined runs), but what is the general rule? Split and merge until you find a maximal coverage for the whole day (for every section surrounded by 5 consecutive zeros), with no overlap?? If this is the case, it sounds like something for SAS/OR.

+1: There is one more speed-up possibility in the code, but right now I don't want to confuse your programmer Smiley Happy Tell me if it runs slow.

Occasional Contributor
Posts: 10

Re: How to scan a binary variable for density

I changed my mind.  I thought more about what you said about changing the definition.  I think it does make sense to include any point that is part of a potential run, but just not double counting the overlap points.  So all I need to do is expand the dataset produced by this code to the minute level and then delete the duplicates to yield any point that is part of a run that meets the definition, which I know how to do.  Thanks!

Contributor
Posts: 26

Re: How to scan a binary variable for density

Think this will get you started.

/* create some test data */

DATA test;

     DO t = '0:00'T TO '23:59'T;

          v = (Uniform(0) > .5);

          OUTPUT test;

          t = Round(Intnx('second',t,59),.1);

     END;

     FORMAT t timeampm.;

RUN;

DATA t_all t_80(WHERE=(p>= .8));

/* create two data sets, t_all has all records, t_80 has 80% */

     SET test;

     ARRAY vs(*) vs1-vs10;

     vs1 = v;

     vs2 = Lag1(v);

     vs3 = Lag2(v);

... * repeat pattern;

     vs9 = Lag8(v);

     vs10 = Lag9(v);

     s = Sum(OF vs(*));

     p = s/10;

RUN;

Contributor
Posts: 26

Re: How to scan a binary variable for density

You could supplement the DATA t_all ... by

DATA ... t_f(WHERE=(f));

LENGTH str $ 10;

at bottom:

str = catt(OF vs(*));

f = (Find(str,'00000')>=1);

Super User
Posts: 9,691

Re: How to scan a binary variable for density

OK. But I have to say there would be an override range . Is that what you expected ?

data have;
  input bit @@;
datalines;
0 0 0 0 0 1 0 0 1 0 0 1 1 0 1 0 1 1 1 1
0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 1 1 1 1 1
1 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1
0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0
;
run;
proc transpose data=have out=temp;
run;
data x1(keep=start end want );
 set temp;
 length want $ 3000 ;
 array c{*} col: ;
 do start=1 to dim(c)-9;
   if c{start}=1 then do;
    do end=start+9 to dim(c);
      if c{end}=1 then do;
         five_zero=0; zero=0;
         do i=start to end;
            if c{i}=0 then do; five_zero+1; zero+1;end;
             else five_zero=0;
          end;

          if      five_zero ge 5 then leave;  div=divide(zero,(end-start+1));
          if div le 0.2 and zero ne 0 then do;
                                                                   do j=start to end;
                                                                       want=catx(' ',want,c{j}); 
                                                                      end;
                                                                              output; want=' ';
                                                                  end;
       end;
    end;
   end;
 end;
run;
data x2;
 set x1;
 by start;
 if last.start;
run;
data want;
 set x2;
 by end;
 if first.end;
run;

Xia Keshan

Message was edited by: xia keshan

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 456 views
  • 0 likes
  • 6 in conversation