Solved: Re: checking for sequential order

y_fu · Posted 08-29-2018 12:50 PM

I have a large data set, and one of the sorted column is like the following:

Sample_ID

A-001

A-002

A-003

A-004

A-010

A-011

A-012

B-001

B-002

and so on...

I want to add a new column to identify the samples that out of sequential order (discontinue ID numbers), the new column I want is:

Sample_ID Out_of_order

A-001

A-002

A-003

A-004

A-010 Yes

A-011

A-012

B-001

B-002

B-004 Yes

and so on...

I want to know if there are any way to approach it or similar result, thank you.

novinosrin · Posted 08-29-2018 01:27 PM

data have;
input Sample_ID $;
c=compress(Sample_ID,,'ka');
n=input(compress(Sample_ID,,'kd'),8.);
cards;
A-001
A-002
A-003
A-004
A-010
A-011
A-012
B-001
B-002
B-004
;
data want;
set have;
by c;
k=dif(n);
if not first.c and k ne 1 then Out_of_order='YES';
keep sample_id Out_of_order;
run;

View solution in original post

DBailey · Posted 08-29-2018 01:05 PM

in a data step, you can use the lag function to look at previous rows:

options pagesize= linesize= pageno=1 nodate;

data one;
   input x @@;
   y=lag1(x);
   z=lag2(x);
   datalines;
1 2 3 4 5 6
;

proc print data=one;
   title 'LAG Output';
run;

Output from Generating Two Lagged Values

                                  LAG Output                                 1

                              Obs    x    y    z

                               1     1    .    .
                               2     2    1    .
                               3     3    2    1
                               4     4    3    2
                               5     5    4    3
                               6     6    5    4

y_fu · Posted 08-29-2018 01:10 PM

I have a large data set that has more than 1000 obs. I do not think lag function will be efficient.

PeterClemmensen · Posted 08-29-2018 01:22 PM

Why is Out_Of_Order not equal to 'Yes' for B-001?

The DATA to DATA Step Macro
Blog: SASnrd

ballardw · Posted 08-29-2018 01:22 PM

@y_fu wrote:

I have a large data set that has more than 1000 obs. I do not think lag function will be efficient.

If the values are sorted then you only need to look at the previous record.

data have;
   input sample_id $;
datalines;
A-001
A-002
A-003
A-004
A-010 
A-011
A-012
B-001
B-002
B-004
;
run;

data want;
   set have;
   prefix=scan(sample_id,1,'-');
   lprefix=scan(lag(sample_id),1,'-');
   order = input(scan(sample_id,2,'-'),4.);
   lorder = input(scan(lag(sample_id),2,'-'),4.);
   Outoforder = (prefix=lprefix) and (order ne (lorder+1));
   drop prefix lprefix order lorder;
run;

You do not specify if the A, B etc value has to be checked in any way.

You also do not specify if the first A, B etc value has to be 001 or not.

y_fu · Posted 08-29-2018 01:27 PM

Sorry, I should be more specific.

The A,B,C etc. check will not be necessary, and all of the group of samples will start with -001

novinosrin · Posted 08-29-2018 01:27 PM

data have;
input Sample_ID $;
c=compress(Sample_ID,,'ka');
n=input(compress(Sample_ID,,'kd'),8.);
cards;
A-001
A-002
A-003
A-004
A-010
A-011
A-012
B-001
B-002
B-004
;
data want;
set have;
by c;
k=dif(n);
if not first.c and k ne 1 then Out_of_order='YES';
keep sample_id Out_of_order;
run;

y_fu · Posted 08-29-2018 01:33 PM

Thank you for the help! It is the result I want.

It is nice to learn something new about how to use compress function.

SAS Training: Just a Click Away