I have a large data set, and one of the sorted column is like the following:
Sample_ID
A-001
A-002
A-003
A-004
A-010
A-011
A-012
B-001
B-002
and so on...
I want to add a new column to identify the samples that out of sequential order (discontinue ID numbers), the new column I want is:
Sample_ID Out_of_order
A-001
A-002
A-003
A-004
A-010 Yes
A-011
A-012
B-001
B-002
B-004 Yes
and so on...
I want to know if there are any way to approach it or similar result, thank you.
data have;
input Sample_ID $;
c=compress(Sample_ID,,'ka');
n=input(compress(Sample_ID,,'kd'),8.);
cards;
A-001
A-002
A-003
A-004
A-010
A-011
A-012
B-001
B-002
B-004
;
data want;
set have;
by c;
k=dif(n);
if not first.c and k ne 1 then Out_of_order='YES';
keep sample_id Out_of_order;
run;
in a data step, you can use the lag function to look at previous rows:
options pagesize= linesize= pageno=1 nodate; data one; input x @@; y=lag1(x); z=lag2(x); datalines; 1 2 3 4 5 6 ; proc print data=one; title 'LAG Output'; run;
Output from Generating Two Lagged Values
LAG Output 1 Obs x y z 1 1 . . 2 2 1 . 3 3 2 1 4 4 3 2 5 5 4 3 6 6 5 4
I have a large data set that has more than 1000 obs. I do not think lag function will be efficient.
Why is Out_Of_Order not equal to 'Yes' for B-001?
@y_fu wrote:
I have a large data set that has more than 1000 obs. I do not think lag function will be efficient.
If the values are sorted then you only need to look at the previous record.
data have; input sample_id $; datalines; A-001 A-002 A-003 A-004 A-010 A-011 A-012 B-001 B-002 B-004 ; run; data want; set have; prefix=scan(sample_id,1,'-'); lprefix=scan(lag(sample_id),1,'-'); order = input(scan(sample_id,2,'-'),4.); lorder = input(scan(lag(sample_id),2,'-'),4.); Outoforder = (prefix=lprefix) and (order ne (lorder+1)); drop prefix lprefix order lorder; run;
You do not specify if the A, B etc value has to be checked in any way.
You also do not specify if the first A, B etc value has to be 001 or not.
Sorry, I should be more specific.
The A,B,C etc. check will not be necessary, and all of the group of samples will start with -001
data have;
input Sample_ID $;
c=compress(Sample_ID,,'ka');
n=input(compress(Sample_ID,,'kd'),8.);
cards;
A-001
A-002
A-003
A-004
A-010
A-011
A-012
B-001
B-002
B-004
;
data want;
set have;
by c;
k=dif(n);
if not first.c and k ne 1 then Out_of_order='YES';
keep sample_id Out_of_order;
run;
Thank you for the help! It is the result I want.
It is nice to learn something new about how to use compress function.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.