I have a large data set, and one of the sorted column is like the following:
Sample_ID
A-001
A-002
A-003
A-004
A-010
A-011
A-012
B-001
B-002
and so on...
I want to add a new column to identify the samples that out of sequential order (discontinue ID numbers), the new column I want is:
Sample_ID Out_of_order
A-001
A-002
A-003
A-004
A-010 Yes
A-011
A-012
B-001
B-002
B-004 Yes
and so on...
I want to know if there are any way to approach it or similar result, thank you.
data have;
input Sample_ID $;
c=compress(Sample_ID,,'ka');
n=input(compress(Sample_ID,,'kd'),8.);
cards;
A-001
A-002
A-003
A-004
A-010
A-011
A-012
B-001
B-002
B-004
;
data want;
set have;
by c;
k=dif(n);
if not first.c and k ne 1 then Out_of_order='YES';
keep sample_id Out_of_order;
run;
in a data step, you can use the lag function to look at previous rows:
options pagesize= linesize= pageno=1 nodate; data one; input x @@; y=lag1(x); z=lag2(x); datalines; 1 2 3 4 5 6 ; proc print data=one; title 'LAG Output'; run;
Output from Generating Two Lagged Values
LAG Output 1 Obs x y z 1 1 . . 2 2 1 . 3 3 2 1 4 4 3 2 5 5 4 3 6 6 5 4
I have a large data set that has more than 1000 obs. I do not think lag function will be efficient.
Why is Out_Of_Order not equal to 'Yes' for B-001?
@y_fu wrote:
I have a large data set that has more than 1000 obs. I do not think lag function will be efficient.
If the values are sorted then you only need to look at the previous record.
data have; input sample_id $; datalines; A-001 A-002 A-003 A-004 A-010 A-011 A-012 B-001 B-002 B-004 ; run; data want; set have; prefix=scan(sample_id,1,'-'); lprefix=scan(lag(sample_id),1,'-'); order = input(scan(sample_id,2,'-'),4.); lorder = input(scan(lag(sample_id),2,'-'),4.); Outoforder = (prefix=lprefix) and (order ne (lorder+1)); drop prefix lprefix order lorder; run;
You do not specify if the A, B etc value has to be checked in any way.
You also do not specify if the first A, B etc value has to be 001 or not.
Sorry, I should be more specific.
The A,B,C etc. check will not be necessary, and all of the group of samples will start with -001
data have;
input Sample_ID $;
c=compress(Sample_ID,,'ka');
n=input(compress(Sample_ID,,'kd'),8.);
cards;
A-001
A-002
A-003
A-004
A-010
A-011
A-012
B-001
B-002
B-004
;
data want;
set have;
by c;
k=dif(n);
if not first.c and k ne 1 then Out_of_order='YES';
keep sample_id Out_of_order;
run;
Thank you for the help! It is the result I want.
It is nice to learn something new about how to use compress function.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.