I have a large dataset with many observations (dataset). Each observation has at least 2 values (e.g. value1, value5).
data dataset;
input ID A $ B $ C $ D $ E $ F $ ;
datalines ;
1 value27 value31 value101 value999 value1 value7
2 value999 value1 value101 value8 value1 value886
3 value200 value31 value101 value49 value1 value3
4 value78 value531 value1 value2309 value12 value7
5 value200 value3 value2141 value919 value981 value234
;
run;
I also have a reference file (reference) which contains 3 columns (NUM COMB1 and COMB2), and I want to use this file to identify all observations in the dataset which contain certain combinations of values (e.g. observation #4 in the dataset contains value2309 and value7 which is a combination found in the reference file), therefore observation #4 should be flagged in some way to indicate that the dataset contains this combination.
data reference;
Input NUM COMB1 $ COMB2 $;
datalines;
1 value7 value1, value999, value2309, value17
2 value200 value3, value24, value883, value223
3 value12 value1, value234, value914, value981
4 value8 value999, value782, value888
5 value116 value26, value1
6 value900 value2, value9
;
run;
I have tried to do this using an array, but for some reason my array was only reading the first value in COMB2 column (e.g. Value7 from COMB1 and Value1 from COMB2, but not Value7 and Value999, Value7 and Value2309, etc. which meant some combinations were not being flagged).
I am unable to share the array I was using, but would be grateful if anyone could offer a solution/method to what I’m trying to achieve?
Good use of temporary arrays here. I changed NUM to character to make it easier to define the array and because I don't know if it's possible to have an array that's both character and numeric.
data reference;
infile cards dlm='|';
Input NUM $ COMB1 $ COMB2 : $200.;
datalines;
1| value7 | value1, value999, value2309, value17
2| value200 | value3, value24, value883, value223
3| value12 | value1, value234, value914, value981
4| value8 | value999, value782, value888
5| value116 |value26, value1
6| value900 | value2, value9
;
run;
data dataset;
input ID A $ B $ C $ D $ E $ F $;
datalines;
1 value27 value31 value101 value999 value1 value7
2 value999 value1 value101 value8 value1 value886
3 value200 value31 value101 value49 value1 value3
4 value78 value531 value1 value2309 value12 value7
5 value200 value3 value2141 value919 value981 value234
;
run;
proc sql noprint;
select count(*) into :num_flags from reference;
quit;
%put &num_flags.;
data flagged;
*create temporary array to hold the reference data set;
array M(&num_flags, 3) $ _temporary_;
if _n_=1 then
do j=1 to &num_flags;
set Reference;
M(j, 1)=NUM;
M(j, 2)=COMB1;
M(j, 3)=COMB2;
end;
set dataset;
*loop over reference dataset rows;
do i=1 to &num_flags;
*determine number of words in COMB2;
nwords=countw(M(i, 3));
*if COMB1 is list of A-F then search for values in COMB2;
if whichc(M(i, 2), of A--F) then
do k=1 to nwords;
*isolate each word;
word=scan(M(i, 3), k, ",");
*if found Comb2 item, then assign flag and output that line;
if whichc(word, of A--F)>0 then
do;
flag=M(i, 1);
output;
end;
end;
end;
drop NUM COMB1 COMB2 i j k nwords word;
run;
Please verify that the reference data is being generated correctly. I think you have too many comma's in the dataset.
Does the order of the combination matter? ie is value1 & Value 7 the same as value 7 & value 1.
A quick method would be to use CALL SORT and use INDEX or FIND function to search for the strings together.
Please also show what your expected output would be from this data to ensure we're understanding your requirements correctly.
I'm confused about your REFERENCE dataset, when I run your code, I get:
NUM COMB1 COMB2 1 value7 value1, 2 value200 value3, 3 value12 value1, 4 value8 value999 5 value116 value26, 6 value900 value2,
Do you mean for Comb2 to be a character variables storing a comma-delimited list of values? That will make this a mess to code.
I suggest you clarify the structure of the REFERENCE dataset, and then show the output you would WANT from your sample data. I also think you should share the array code you have tried, using this sample data.
Finally, it might help if you can provide a logical description of the problem. For example something like "I have a dataset with one row per patient, and the variables are ICD-9 diagnosis codes. I have another dataset that lists the combination of diagnosis codes that define a disease. I want to add a disease flag to each patient, indicating which disease they have." If you're worried about confidentiality, it doesn't have to be the truth. : )
@Reeza and @Quentin I’m sorry, yes the reference file should look like this:
NUM | COMB1 | COMB2 |
1 | value7 | value1, value999, value2309, value17 |
2 | value200 | value3, value24, value883, value223 |
3 | value12 | value1, value234, value914, value981 |
4 | value8 | value999, value782, value888 |
5 | value116 | value26, value1 |
@Reeza , the order does not matter.
@Quentin Yes, Comb2 is a comma-delimited list of values (the file is huge, I import it to SAS as a CSV – obviously generating a dataset like this in SAS using datalines is more involved that I had thought, sorry again).
Logical description:
It’s a medical dataset, and each row (ID) represents a different patient – the description you provide is basically what it is, different surgical interventions which each patient received during their time in hospital, and the reference file contains high risk combinations which I wish to identify in the medical dataset.
So in order to be flagged, a patient would need to have both Comb1 and one value from Comb2, correct? So a patient with VALUE7 and VALUE1 would be flagged per the first row of the reference table, but a patient with VALUE1 and VALUE999 would not be flagged?
Do you need to know the reason(s) a patient was flagged (i.e. which row of the reference table flagged the patient), or just the fact that the patient was flagged?
Yes, it's clear. And tricky. I've thought of some bad approaches, involving multi-dimensional arrays (I always take multi-dimensional arrays as a sign of a bad approach. : ) Look forward to seeing what the community comes up with.
Good use of temporary arrays here. I changed NUM to character to make it easier to define the array and because I don't know if it's possible to have an array that's both character and numeric.
data reference;
infile cards dlm='|';
Input NUM $ COMB1 $ COMB2 : $200.;
datalines;
1| value7 | value1, value999, value2309, value17
2| value200 | value3, value24, value883, value223
3| value12 | value1, value234, value914, value981
4| value8 | value999, value782, value888
5| value116 |value26, value1
6| value900 | value2, value9
;
run;
data dataset;
input ID A $ B $ C $ D $ E $ F $;
datalines;
1 value27 value31 value101 value999 value1 value7
2 value999 value1 value101 value8 value1 value886
3 value200 value31 value101 value49 value1 value3
4 value78 value531 value1 value2309 value12 value7
5 value200 value3 value2141 value919 value981 value234
;
run;
proc sql noprint;
select count(*) into :num_flags from reference;
quit;
%put &num_flags.;
data flagged;
*create temporary array to hold the reference data set;
array M(&num_flags, 3) $ _temporary_;
if _n_=1 then
do j=1 to &num_flags;
set Reference;
M(j, 1)=NUM;
M(j, 2)=COMB1;
M(j, 3)=COMB2;
end;
set dataset;
*loop over reference dataset rows;
do i=1 to &num_flags;
*determine number of words in COMB2;
nwords=countw(M(i, 3));
*if COMB1 is list of A-F then search for values in COMB2;
if whichc(M(i, 2), of A--F) then
do k=1 to nwords;
*isolate each word;
word=scan(M(i, 3), k, ",");
*if found Comb2 item, then assign flag and output that line;
if whichc(word, of A--F)>0 then
do;
flag=M(i, 1);
output;
end;
end;
end;
drop NUM COMB1 COMB2 i j k nwords word;
run;
Thank you @Reeza ! I accepted your code as the solution, but I might have more questions later, I hope that's ok. Thank you
data reference;
infile cards dlm='|';
Input NUM $ COMB1 $ COMB2 : $200.;
datalines;
1| value7 | value999, value10, value2309, value17
2| value200 | value24, value3, value883, value223
3| value12 | value1, value234, value914, value981
4| value8 | value782, value999, value888
5| value116 |value26, value1
6| value900 | value2, value9
;
run;
data dataset;
input ID A $ B $ C $ D $ E $ F $;
datalines;
1 value27 value31 value101 value999 value1 value7
2 value999 value1 value101 value8 value1 value886
3 value200 value31 value101 value49 value1 value3
4 value78 value7 value531 value2309 value112 value1
5 value200 value3 value2141 value919 value981 value234
;
run;
proc sql noprint;
select count(*) into :num_flags from reference;
quit;
%put &num_flags.;
data flagged;
*create temporary array to hold the reference data set;
array M(&num_flags, 3) $ _temporary_;
if _n_=1 then
do j=1 to &num_flags;
set Reference;
M(j, 1)=NUM;
M(j, 2)=COMB1;
M(j, 3)=COMB2;
end;
set dataset;
*loop over reference dataset rows;
do i=1 to &num_flags;
*determine number of words in COMB2;
nwords=countw(M(i, 3));
*if COMB1 is list of A-F then search for values in COMB2;
if whichc(M(i, 2), of A--F) then
do k=1 to nwords;
*isolate each word;
word=scan(M(i, 3), k, ",");
*if found Comb2 item, then assign flag and output that line;
if whichc(word, of A--F)>0 then
do;
flag=M(i, 1);
output;
end;
end;
end;
drop NUM COMB1 COMB2 i j k nwords word;
run;
Unfortunately I’m finding the same issue I was having with my original array, where combinations are only being identified if the COMB2 value occurs first in the list in the reference file.
In the above example, the following combinations (below in bold) are present in the dataset:
data dataset;
input ID A $ B $ C $ D $ E $ F $;
datalines;
1 value27 value31 value101 value999 value1 value7 /* VALUE 7 & 999 (#1 in ref file) */
2 value999 value1 value101 value8 value1 value886 /* VALUE 8 & 999 (#4 in ref file) */
3 value200 value31 value101 value49 value1 value3 /* VALUE 200 & 3 (#2 in ref file) */
4 value78 value7 value531 value2309 value112 value1 /* VALUE 7 & 2309 (#1 in ref file) */
5 value200 value3 value2141 value919 value981 value234 /* VALUE 200 & 3 (#2 in ref file) */
;
run;
However, when I run the code, only ID #1 from the reference file is flagged (as value999 appears as the first value in the COMB2 list. For the other combinations, the COMB2 value appears somewhere in the list after position 1, e.g. the 2nd row in the dataset has combination VALUE 999 & 8, this is ID 4 in the reference file where 999 appears as the 2nd value in COMB2 col).
How can I make the array read through all values in the COMB2 list for a match?
Always check your data is read in correctly.
The default length for a character variable is 8 characters. value2309 is 9 characters for example. The array also defaulted to 8 characters.
Adding a length statement fixed the issue for me.
data reference;
infile cards dlm='|';
Input NUM $ COMB1 $ COMB2 : $200.;
datalines;
1| value7 | value999, value10, value2309, value17
2| value200 | value24, value3, value883, value223
3| value12 | value1, value234, value914, value981
4| value8 | value782, value999, value888
5| value116 |value26, value1
6| value900 | value2, value9
;
run;
data dataset;
length a b c d e f $10.;
input ID A $ B $ C $ D $ E $ F $;
datalines;
1 value27 value31 value101 value999 value1 value7
2 value999 value1 value101 value8 value1 value886
3 value200 value31 value101 value49 value1 value3
4 value78 value7 value531 value2309 value112 value1
5 value200 value3 value2141 value919 value981 value234
;
run;
proc sql noprint;
select count(*) into :num_flags from reference;
quit;
%put &num_flags.;
data flagged;
*create temporary array to hold the reference data set;
array M(&num_flags, 3) $200. _temporary_;
if _n_=1 then
do j=1 to &num_flags;
set Reference;
M(j, 1)=NUM;
M(j, 2)=COMB1;
M(j, 3)=COMB2;
end;
set dataset;
*loop over reference dataset rows;
do i=1 to &num_flags;
*determine number of words in COMB2;
nwords=countw(M(i, 3));
*if COMB1 is list of A-F then search for values in COMB2;
put nwords;
if whichc(M(i, 2), of A--F) then
do k=1 to nwords;
*isolate each word;
word=scan(M(i, 3), k, ",");
*if found Comb2 item, then assign flag and output that line;
if whichc(compress(trim(word)), of A--F)>0 then
do;
flag=M(i, 1);
output;
end;
end;
end;
drop NUM COMB1 COMB2 i j k nwords word;
run;
Thank you @Reeza the length statement fixed it!!😎
It might help to show the code you attempted.
If the approach is that you need to test every observation in Dataset against all of the combinations in Reference you likely will need to have a step that uses SQL to do that match up values before attempting to compare anything. Which means a different step if array is even likely.
BTW I see no way that your Reference data has only 3 variables. I see at least 5 if not 6 variables on the first row of Reference. So perhaps you need to reconsider your description and make sure that your data step actually makes a Reference set that looks like your description. You also need to show an a example of the result.
If your Dataset is comparing just pairs of values from 6 variables then there are 15 combinations to consider ( Comb(6,2) function in SAS). Which means depending on your actual content of Reference that you may have 15 matches of combinations. So it is very important to show what the desired result looks like when one or more observations in Dataset actually matches more than one combination from Reference. "Flag in some way" and "to indicate that the dataset contains this combination" means that could be adding a considerable number of variables. 1 for each possible flag value and something else to hold the matched combination.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.