Solved: How to count sequential values by observation

aeb · Posted 09-25-2019 06:47 PM

Hi all,

I have a dataset of laboratory results. Labs were collected at multiple time points. I am specifically interested in knowing if a patient tested negative at sequential time points. Specifically, if they tested negative twice or more in a row. Here's an example dataset, where neg1-neg5 is a yes/no (1/0) for a negative value. I do not want to reorder the dataset, as it is important to keep the results in the order in which they were received.

This is an example of what I have:

patid	neg1	neg2	neg3	neg4	neg5
1	0	1	0	0	1
2	1	1	1	0	0
3	0	1	1	0	0
4	0	1	1	1	1
5	0	0	0	1	0

And I would like something like this, where #seqneg is the number of sequential negatives by patient.

patid	neg1	neg2	neg3	neg4	neg5	#seqneg
1	0	1	0	0	1	0
2	1	1	1	0	0	3
3	0	1	1	0	0	2
4	0	1	1	1	1	4
5	0	0	0	1	0	0

Thank you!!

mkeintz · Posted 09-25-2019 11:18 PM

You apparently want the length of the longest sequence of '1', as long as that length is at least 2.

You could start by converting your sequence of numbers to a string (e.g. 0 1 1 1 0 becomes '01110'). You could then iteratively search STRNG for substrings '11111', '1111', '111', '11', '1', '' until you have success. Such a program would look like this:

data have;
infile cards truncover;
input patid	neg1	neg2	neg3	neg4	neg5;
cards;
1	0	1	0	0	1
2	1	1	1	0	0
3	0	1	1	0	0
4	0	1	1	1	1
5	0	0	0	1	0
;

data want (drop=strng);
  set have;
  length strng $5;
  strng=cats(of neg:);

  do seqneg=5 to 1 by -1 until(find(strng,repeat('1',seqneg-1)));
  end;
  if  seqneg=1 then seqneg=0;
run;

The "do seqneg=5 to 1 by -1 until (.....) stops when either the UNTIL condition resolves to a non-zero, or when seqneg reaches 1. The until is

until(find(strng,repeat('1',seqneg-1)));

which says search STRNG for the character "1" with SEQNEG-1 repetitions. If the FIND is unsuccessful, the function returns a zero, but otherwise returns the position in which the substring is found.

For a set of 10 NEG variables just change the length of STRNG to $10 and start the do loop at 10 instead of 5.

If this looks a little complicated then you could change the code to:

data want2 (drop=strng substrng found);
  set have;
  length strng $5  substrng $5;
  strng=cats(of neg:);
  do seqneg=5 to 1 by -1 until(found^=0);
    substrng=repeat('1',seqneg-1);
    found=find(strng,trim(substrng));
  end;
  if  seqneg=1 then seqneg=0;
run;

which utilizes more intermediate variables.

regards,

Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Reeza · Posted 09-25-2019 07:05 PM

What about this case?

patid neg1 neg2 neg3 neg4 neg5
6 1 1 0 1 1

aeb · Posted 09-25-2019 07:24 PM

Hi Reeza,

You are actually anticipating my next step. At this point I'm only interested in any occurrence of two or more - but I will eventually want to know if they were interrupted by a positive result.

I should also note, I have up to 10 results per patient (most patients do not have 10).

Thank you!

aeb · Posted 09-25-2019 07:40 PM

And to specifically answer the question in your scenario, I would want #seqneg to be 2.

novinosrin · Posted 09-25-2019 08:17 PM

Hi @aeb @Reeza 's intuitive reasoning is brilliant, so the following is solely for your sample. Not asking you to rely on this , but my share of fun


data have;
infile cards truncover;
input patid	neg1	neg2	neg3	neg4	neg5;
cards;
1	0	1	0	0	1
2	1	1	1	0	0
3	0	1	1	0	0
4	0	1	1	1	1
5	0	0	0	1	0
;

data want;
set have;
_k=cats(of neg:);
_f=findc(_k,'1');
_l=findc(_k,'1','b');
want= (countc(_k,'1')=_l-_f+1)*countc(_k,'1')*(_l-_f>0) ;
drop _:;
run;

mkeintz · Posted 09-25-2019 11:18 PM

You apparently want the length of the longest sequence of '1', as long as that length is at least 2.

You could start by converting your sequence of numbers to a string (e.g. 0 1 1 1 0 becomes '01110'). You could then iteratively search STRNG for substrings '11111', '1111', '111', '11', '1', '' until you have success. Such a program would look like this:

data have;
infile cards truncover;
input patid	neg1	neg2	neg3	neg4	neg5;
cards;
1	0	1	0	0	1
2	1	1	1	0	0
3	0	1	1	0	0
4	0	1	1	1	1
5	0	0	0	1	0
;

data want (drop=strng);
  set have;
  length strng $5;
  strng=cats(of neg:);

  do seqneg=5 to 1 by -1 until(find(strng,repeat('1',seqneg-1)));
  end;
  if  seqneg=1 then seqneg=0;
run;

The "do seqneg=5 to 1 by -1 until (.....) stops when either the UNTIL condition resolves to a non-zero, or when seqneg reaches 1. The until is

until(find(strng,repeat('1',seqneg-1)));

which says search STRNG for the character "1" with SEQNEG-1 repetitions. If the FIND is unsuccessful, the function returns a zero, but otherwise returns the position in which the substring is found.

For a set of 10 NEG variables just change the length of STRNG to $10 and start the do loop at 10 instead of 5.

If this looks a little complicated then you could change the code to:

data want2 (drop=strng substrng found);
  set have;
  length strng $5  substrng $5;
  strng=cats(of neg:);
  do seqneg=5 to 1 by -1 until(found^=0);
    substrng=repeat('1',seqneg-1);
    found=find(strng,trim(substrng));
  end;
  if  seqneg=1 then seqneg=0;
run;

which utilizes more intermediate variables.

regards,

Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

aeb · Posted 09-25-2019 11:57 PM

Mark and novinosrin:

Thank you both, very much.

novinosrin - you were right, code worked in specific scenarios but not all.

Mark - worked and I appreciate the explanation.

Thanks again!

PGStats · Posted 09-26-2019 12:12 AM

If you want to stick with numbers and auto adjust to any number of neg values:

data have;
infile cards truncover;
input patid neg1    neg2    neg3    neg4    neg5;
cards;
1   0   1   0   0   1
2   1   1   1   0   0
3   0   1   1   0   0
4   0   1   1   1   1
5   0   0   0   1   0
;

data want;
set have;
array a dum1 neg: dum2;
seq = 0;
do i = 2 to dim(a);
	if a{i} and not a{i-1} then start = i;
	else if a{i-1} and not a{i} then do;
		if i - 1 > start then
			if i - start > seq then seq = i - start;
		end;
	end;
drop dum1 dum2 i start;
run;

Variables dum1 and dum2 pad the array and allow us to skip testing end conditions. /p>

PG

Haris · Posted 09-26-2019 12:19 PM

An approach with transposition:

data c;
input PID neg1 neg2 neg3 neg4 neg5;
cards;
1 0 1 0 0 1
2 1 1 1 0 0
3 0 1 1 0 0
4 0 1 1 1 1
5 0 0 0 1 0
6 1 1 0 1 1
;
proc transpose data=c out=ct; var Neg: ; by PID;
run;

data cc; set ct; by PID;
	if first.PID=1									then count=0; else;
	if first.PID NE 1 AND Col1=1 AND lag(Col1)=1	then Count+1; else;
	if Col1=0										then Count=0;
run;

proc sql;
	select PID, max(Count) as Count
	from CC
	group by PID
;
quit;

If you skip the last line `IF COL1=0` it will continue the count: e.g., PID=6 will have the count of two rather than one.

RichardDeVen · Posted 09-26-2019 01:51 PM

Thinking outside the box, you can coerce an input statement to read a concatenation of the flag variables transformed to appear like an csv input line. Lots of tricks mashed into this answer:

DATA Step with SET and INFILE
Dummy DATALINES for INPUT to introduce _INFILE_
INFILE CARDS with TRUNCOVER to prevent LOST CARD
Held input with @@ to prevent dataline advancement
INPUT @1 to reposition parse start point
INPUT with (one:) variable list specifier to read into an array
INPUT with (binary32.:) informat list to read 1's as binary data upto next delimiter (,) using : list input modifier
MAX of one: to determine largest run of 1's
LOG2 of 1+MAX to get number of 1's in largest value that must be 2^p - 1

* faux data with 32 neg flag variables;

data have;
  do patid = 1 to 20;
    array neg neg1-neg32;
    do over neg;
      neg = ranuni(123) < 0.35;
    end;
    output;
  end;
run;

data want;
  attrib longest_run length=8;

  set have;
  infile cards dlm=',' truncover;

  if _n_ = 1 then input @1 @@; * activate _infile_;

  _infile_ = translate(cats(of neg:),',','0'); * repurpose _infile_ value;
  * put _infile_;

  array one one1-one32;
  input @1 (one:) (binary32.:) @@;  * read the _infile_ (input buffer);
  _error_ = 0;

  longest_run = log2 (1+max(of one:));

  drop one:;
cards;
.
run;

How to count sequential values by observation

Re: How to count sequential values by observation

Re: How to count sequential values by observation

Re: How to count sequential values by observation

Re: How to count sequential values by observation

Re: How to count sequential values by observation

Re: How to count sequential values by observation

Re: How to count sequential values by observation

Re: How to count sequential values by observation

Re: How to count sequential values by observation

Re: How to count sequential values by observation

Ready to join fellow brilliant minds for the SAS Hackathon?