Solved: Identify value combinations in dataset based on reference file

Epi_Stats · Posted 05-08-2023 10:51 AM

I have a large dataset with many observations (dataset). Each observation has at least 2 values (e.g. value1, value5).

data dataset;
	input ID A $ B $ C $ D $ E $ F $ ;
datalines ;
1 value27 value31 value101 value999 value1 value7
2 value999 value1 value101 value8 value1 value886
3 value200 value31 value101 value49 value1 value3
4 value78 value531 value1 value2309 value12 value7
5 value200 value3 value2141 value919 value981 value234
;
run;

I also have a reference file (reference) which contains 3 columns (NUM COMB1 and COMB2), and I want to use this file to identify all observations in the dataset which contain certain combinations of values (e.g. observation #4 in the dataset contains value2309 and value7 which is a combination found in the reference file), therefore observation #4 should be flagged in some way to indicate that the dataset contains this combination.

data reference;
Input NUM COMB1 $ COMB2 $; 
datalines; 
1 value7 value1, value999, value2309, value17
2 value200 value3, value24, value883, value223
3 value12 value1, value234, value914, value981
4 value8 value999, value782, value888
5 value116 value26, value1
6 value900 value2, value9
;
run;

I have tried to do this using an array, but for some reason my array was only reading the first value in COMB2 column (e.g. Value7 from COMB1 and Value1 from COMB2, but not Value7 and Value999, Value7 and Value2309, etc. which meant some combinations were not being flagged).

I am unable to share the array I was using, but would be grateful if anyone could offer a solution/method to what I’m trying to achieve?

Reeza · Posted 05-08-2023 12:42 PM

Good use of temporary arrays here. I changed NUM to character to make it easier to define the array and because I don't know if it's possible to have an array that's both character and numeric.

data reference;
    infile cards dlm='|';
    Input NUM $ COMB1 $ COMB2 : $200.;
    datalines;
1| value7 | value1, value999, value2309, value17
2| value200 | value3, value24, value883, value223
3| value12 | value1, value234, value914, value981
4| value8 | value999, value782, value888
5| value116 |value26, value1
6| value900 | value2, value9
;
run;

data dataset;
    input ID A $ B $ C $ D $ E $ F $;
    datalines;
1 value27 value31 value101 value999 value1 value7
2 value999 value1 value101 value8 value1 value886
3 value200 value31 value101 value49 value1 value3
4 value78 value531 value1 value2309 value12 value7
5 value200 value3 value2141 value919 value981 value234
;
run;

proc sql noprint;
select count(*) into :num_flags from reference;
quit;

%put &num_flags.;

data flagged;


    *create temporary array to hold the reference data set;
    array M(&num_flags, 3) $ _temporary_;

    
    if _n_=1 then
        do j=1 to &num_flags;
            set Reference;
            M(j, 1)=NUM;
            M(j, 2)=COMB1;
            M(j, 3)=COMB2;
        end;
        
        
    set dataset;
    *loop over reference dataset rows;

    do i=1 to &num_flags;
        *determine number of words in COMB2;
        nwords=countw(M(i, 3));
        *if COMB1 is list of A-F then search for values in COMB2;

        if whichc(M(i, 2), of A--F) then
            do k=1 to nwords;
                *isolate each word;
                word=scan(M(i, 3), k, ",");
                *if found Comb2 item, then assign flag and output that line;

                if whichc(word, of A--F)>0 then
                    do;
                        flag=M(i, 1);
                        output;
                    end;
            end;
    end;
    drop NUM COMB1 COMB2 i j k nwords word;
run;

View solution in original post

Reeza · Posted 05-08-2023 11:22 AM

Please verify that the reference data is being generated correctly. I think you have too many comma's in the dataset.

Does the order of the combination matter? ie is value1 & Value 7 the same as value 7 & value 1.

A quick method would be to use CALL SORT and use INDEX or FIND function to search for the strings together.

Please also show what your expected output would be from this data to ensure we're understanding your requirements correctly.

Quentin · Posted 05-08-2023 11:24 AM

I'm confused about your REFERENCE dataset, when I run your code, I get:

NUM     COMB1       COMB2

 1     value7      value1,
 2     value200    value3,
 3     value12     value1,
 4     value8      value999
 5     value116    value26,
 6     value900    value2,

Do you mean for Comb2 to be a character variables storing a comma-delimited list of values? That will make this a mess to code.

I suggest you clarify the structure of the REFERENCE dataset, and then show the output you would WANT from your sample data. I also think you should share the array code you have tried, using this sample data.

Finally, it might help if you can provide a logical description of the problem. For example something like "I have a dataset with one row per patient, and the variables are ICD-9 diagnosis codes. I have another dataset that lists the combination of diagnosis codes that define a disease. I want to add a disease flag to each patient, indicating which disease they have." If you're worried about confidentiality, it doesn't have to be the truth. : )

BASUG is hosting free webinars Next up: Mike Raithel presenting on validating data files on Wednesday July 17. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

Epi_Stats · Posted 05-08-2023 11:45 AM

@Reeza and @Quentin I’m sorry, yes the reference file should look like this:

NUM	COMB1	COMB2
1	value7	value1, value999, value2309, value17
2	value200	value3, value24, value883, value223
3	value12	value1, value234, value914, value981
4	value8	value999, value782, value888
5	value116	value26, value1

@Reeza , the order does not matter.

@Quentin Yes, Comb2 is a comma-delimited list of values (the file is huge, I import it to SAS as a CSV – obviously generating a dataset like this in SAS using datalines is more involved that I had thought, sorry again).

Logical description:

It’s a medical dataset, and each row (ID) represents a different patient – the description you provide is basically what it is, different surgical interventions which each patient received during their time in hospital, and the reference file contains high risk combinations which I wish to identify in the medical dataset.

Quentin · Posted 05-08-2023 11:55 AM

So in order to be flagged, a patient would need to have both Comb1 and one value from Comb2, correct? So a patient with VALUE7 and VALUE1 would be flagged per the first row of the reference table, but a patient with VALUE1 and VALUE999 would not be flagged?

Do you need to know the reason(s) a patient was flagged (i.e. which row of the reference table flagged the patient), or just the fact that the patient was flagged?

BASUG is hosting free webinars Next up: Mike Raithel presenting on validating data files on Wednesday July 17. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

Epi_Stats · Posted 05-08-2023 12:05 PM

Yes correct, and yes need to know the reason why the patient was flagged, but this can be derived from the row NUM in the reference file (e.g. in the reference file, let's suppose row #1 means bowel resection (coded as VALUE7) and cyst removal surgical (coded as value1, value999, value2309, or value17) procedures).

So if VALUE7 and one of the values from COMB2 column in the reference file are found in the dataset, then NUM #1 might be returned, which can then be used to explain the "high risk procedures" - if that makes sense?

Quentin · Posted 05-08-2023 12:21 PM

Yes, it's clear. And tricky. I've thought of some bad approaches, involving multi-dimensional arrays (I always take multi-dimensional arrays as a sign of a bad approach. : ) Look forward to seeing what the community comes up with.

BASUG is hosting free webinars Next up: Mike Raithel presenting on validating data files on Wednesday July 17. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

Reeza · Posted 05-08-2023 12:22 PM

What happens when you meet you multiple combinations?

Reeza · Posted 05-08-2023 12:42 PM

Good use of temporary arrays here. I changed NUM to character to make it easier to define the array and because I don't know if it's possible to have an array that's both character and numeric.

data reference;
    infile cards dlm='|';
    Input NUM $ COMB1 $ COMB2 : $200.;
    datalines;
1| value7 | value1, value999, value2309, value17
2| value200 | value3, value24, value883, value223
3| value12 | value1, value234, value914, value981
4| value8 | value999, value782, value888
5| value116 |value26, value1
6| value900 | value2, value9
;
run;

data dataset;
    input ID A $ B $ C $ D $ E $ F $;
    datalines;
1 value27 value31 value101 value999 value1 value7
2 value999 value1 value101 value8 value1 value886
3 value200 value31 value101 value49 value1 value3
4 value78 value531 value1 value2309 value12 value7
5 value200 value3 value2141 value919 value981 value234
;
run;

proc sql noprint;
select count(*) into :num_flags from reference;
quit;

%put &num_flags.;

data flagged;


    *create temporary array to hold the reference data set;
    array M(&num_flags, 3) $ _temporary_;

    
    if _n_=1 then
        do j=1 to &num_flags;
            set Reference;
            M(j, 1)=NUM;
            M(j, 2)=COMB1;
            M(j, 3)=COMB2;
        end;
        
        
    set dataset;
    *loop over reference dataset rows;

    do i=1 to &num_flags;
        *determine number of words in COMB2;
        nwords=countw(M(i, 3));
        *if COMB1 is list of A-F then search for values in COMB2;

        if whichc(M(i, 2), of A--F) then
            do k=1 to nwords;
                *isolate each word;
                word=scan(M(i, 3), k, ",");
                *if found Comb2 item, then assign flag and output that line;

                if whichc(word, of A--F)>0 then
                    do;
                        flag=M(i, 1);
                        output;
                    end;
            end;
    end;
    drop NUM COMB1 COMB2 i j k nwords word;
run;

Epi_Stats · Posted 05-08-2023 01:49 PM

Thank you @Reeza ! I accepted your code as the solution, but I might have more questions later, I hope that's ok. Thank you

Epi_Stats · Posted 05-09-2023 04:33 AM

data reference;
    infile cards dlm='|';
    Input NUM $ COMB1 $ COMB2 : $200.;
    datalines;
1| value7 | value999, value10, value2309, value17
2| value200 | value24, value3, value883, value223
3| value12 | value1, value234, value914, value981
4| value8 | value782, value999, value888
5| value116 |value26, value1
6| value900 | value2, value9
;
run;

data dataset;
    input ID A $ B $ C $ D $ E $ F $;
    datalines;
1 value27 value31 value101 value999 value1 value7
2 value999 value1 value101 value8 value1 value886
3 value200 value31 value101 value49 value1 value3
4 value78 value7 value531 value2309 value112 value1
5 value200 value3 value2141 value919 value981 value234
;
run;

proc sql noprint;
select count(*) into :num_flags from reference;
quit;

%put &num_flags.;

data flagged;


    *create temporary array to hold the reference data set;
    array M(&num_flags, 3) $ _temporary_;

    
    if _n_=1 then
        do j=1 to &num_flags;
            set Reference;
            M(j, 1)=NUM;
            M(j, 2)=COMB1;
            M(j, 3)=COMB2;
        end;
        
        
    set dataset;
    *loop over reference dataset rows;

    do i=1 to &num_flags;
        *determine number of words in COMB2;
        nwords=countw(M(i, 3));
        *if COMB1 is list of A-F then search for values in COMB2;

        if whichc(M(i, 2), of A--F) then
            do k=1 to nwords;
                *isolate each word;
                word=scan(M(i, 3), k, ",");
                *if found Comb2 item, then assign flag and output that line;

                if whichc(word, of A--F)>0 then
                    do;
                        flag=M(i, 1);
                        output;
                    end;
            end;
    end;
    drop NUM COMB1 COMB2 i j k nwords word;
run;

Unfortunately I’m finding the same issue I was having with my original array, where combinations are only being identified if the COMB2 value occurs first in the list in the reference file.

In the above example, the following combinations (below in bold) are present in the dataset:

data dataset;

input ID A $ B $ C $ D $ E $ F $;

datalines;

1 value27 value31 value101 value999 value1 value7 /* VALUE 7 & 999 (#1 in ref file) */

2 value999 value1 value101 value8 value1 value886 /* VALUE 8 & 999 (#4 in ref file) */

3 value200 value31 value101 value49 value1 value3 /* VALUE 200 & 3 (#2 in ref file) */

4 value78 value7 value531 value2309 value112 value1 /* VALUE 7 & 2309 (#1 in ref file) */

5 value200 value3 value2141 value919 value981 value234 /* VALUE 200 & 3 (#2 in ref file) */

;

run;

However, when I run the code, only ID #1 from the reference file is flagged (as value999 appears as the first value in the COMB2 list. For the other combinations, the COMB2 value appears somewhere in the list after position 1, e.g. the 2nd row in the dataset has combination VALUE 999 & 8, this is ID 4 in the reference file where 999 appears as the 2nd value in COMB2 col).

How can I make the array read through all values in the COMB2 list for a match?

Reeza · Posted 05-09-2023 11:50 AM

Always check your data is read in correctly.

The default length for a character variable is 8 characters. value2309 is 9 characters for example. The array also defaulted to 8 characters.

Adding a length statement fixed the issue for me.

data reference;
    infile cards dlm='|';
    Input NUM $ COMB1 $ COMB2 : $200.;
    datalines;
1| value7 | value999, value10, value2309, value17
2| value200 | value24, value3, value883, value223
3| value12 | value1, value234, value914, value981
4| value8 | value782, value999, value888
5| value116 |value26, value1
6| value900 | value2, value9
;
run;

data dataset;
    length a b c d e f $10.;
    input ID A $ B $ C $ D $ E $ F $;
    datalines;
1 value27 value31 value101 value999 value1 value7
2 value999 value1 value101 value8 value1 value886
3 value200 value31 value101 value49 value1 value3
4 value78 value7 value531 value2309 value112 value1
5 value200 value3 value2141 value919 value981 value234
;
run;

proc sql noprint;
select count(*) into :num_flags from reference;
quit;

%put &num_flags.;

data flagged;


    *create temporary array to hold the reference data set;
    array M(&num_flags, 3) $200. _temporary_;

    
    if _n_=1 then
        do j=1 to &num_flags;
            set Reference;
            M(j, 1)=NUM;
            M(j, 2)=COMB1;
            M(j, 3)=COMB2;
        end;
        
        
    set dataset;
    *loop over reference dataset rows;

    do i=1 to &num_flags;
        *determine number of words in COMB2;
        nwords=countw(M(i, 3));
        *if COMB1 is list of A-F then search for values in COMB2;
        
        put nwords;

        if whichc(M(i, 2), of A--F) then
            do k=1 to nwords;
                *isolate each word;
                word=scan(M(i, 3), k, ",");
            
                *if found Comb2 item, then assign flag and output that line;

                if whichc(compress(trim(word)), of A--F)>0 then
                    do;
                        flag=M(i, 1);
                        output;
                    end;
            end;
    end;
    drop NUM COMB1 COMB2 i j k nwords word;
run;

Epi_Stats · Posted 05-09-2023 12:06 PM

Thank you @Reeza the length statement fixed it!!😎

ballardw · Posted 05-08-2023 11:34 AM

It might help to show the code you attempted.

If the approach is that you need to test every observation in Dataset against all of the combinations in Reference you likely will need to have a step that uses SQL to do that match up values before attempting to compare anything. Which means a different step if array is even likely.

BTW I see no way that your Reference data has only 3 variables. I see at least 5 if not 6 variables on the first row of Reference. So perhaps you need to reconsider your description and make sure that your data step actually makes a Reference set that looks like your description. You also need to show an a example of the result.

If your Dataset is comparing just pairs of values from 6 variables then there are 15 combinations to consider ( Comb(6,2) function in SAS). Which means depending on your actual content of Reference that you may have 15 matches of combinations. So it is very important to show what the desired result looks like when one or more observations in Dataset actually matches more than one combination from Reference. "Flag in some way" and "to indicate that the dataset contains this combination" means that could be adding a considerable number of variables. 1 for each possible flag value and something else to hold the matched combination.

Epi_Stats · Posted 05-09-2023 04:35 AM

Hi ballardw, thank you for your help. I have now included desired results above

Identify value combinations in dataset based on reference file

Re: Identify value combinations in dataset based on reference file

Re: Identify value combinations in dataset based on reference file

Re: Identify value combinations in dataset based on reference file

Re: Identify value combinations in dataset based on reference file

Re: Identify value combinations in dataset based on reference file

Re: Identify value combinations in dataset based on reference file

Re: Identify value combinations in dataset based on reference file

Re: Identify value combinations in dataset based on reference file

Re: Identify value combinations in dataset based on reference file

Re: Identify value combinations in dataset based on reference file

Re: Identify value combinations in dataset based on reference file

Re: Identify value combinations in dataset based on reference file

Re: Identify value combinations in dataset based on reference file

Re: Identify value combinations in dataset based on reference file

Re: Identify value combinations in dataset based on reference file

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!