Hello,
I have sample data, A, which contains Result
and Result_Name
columns. I have another sample data, B, which contains Result_Name
and Result_Code
. Data B acts as a result coding guide. I want to use Data B as a reference to check whether Data A has any entries that don't match the Data B coding guide. The result I'm looking for is shown as "Want." One particular condition is that if the Result_Name
contains "Non-coded ID format" or "Non-coded text," the match result should be set to 'TBD.' Please let me know how to approach this.
data Sample_A;
infile datalines dsd;
input LabID : $15. Test_Code : $15. Result : $25. Result_Name : $25.;
datalines;
21156, 100157-7_100157-7, BA.2.86, PCR_result,
21156, 100157-7_100157-7, XBB.1.16.3, PCR_result,
21156, 100157-7_100157-7, FL.4, PCR_result,
21156, 100157-7_100157-7, EG.5.1.3, PCR_result,
21156, 100157-6_100157-8, 10828004, Conclusion_result,
21156, 100157-6_100157-8, 260415000, Conclusion_result,
21156, 100157-6_100157-8, 260373001, Conclusion_result,
21156, 100157-6_100157-8, 10000004, Conclusion_result,
21156, 100357-6_100157-8, 5656, Non-coded ID format,
21156, 100357-6_100157-8, 7878, Non-coded ID format,
21156, 104157-6_100157-8, LKJH, Non-coded text,
21156, 104157-6_100157-8, PPGH, Non-coded text,
24597, 100157-7_100157-7, XBB.1.5.70, PCR_result,
24597, 100157-7_100157-7, FL.15, PCR_result,
24597, 100157-7_100157-7, BA.2.86, PCR_result,
24597, 100157-6_100157-8, 260373001, Conclusion_result,
24597, 100157-6_100157-8, 10828005, Conclusion_result,
24597, 100157-6_100157-8, 10987004, Conclusion_result,
24597, 100357-6_100157-8, 3245, Non-coded ID format,
24597, 104157-6_100157-8, OPUY, Non-coded text,
;
data Sample_B;
infile datalines dsd;
input Result_Name : $25. Result_Code : $25. ;
datalines;
PCR_result, BA.2.86,
PCR_result, BA.5.86,
PCR_result, FL.3,
PCR_result, FL.8,
PCR_result, FL.10,
PCR_result, XBB.1.16.3,
PCR_result, XBB.1.5.70,
PCR_result, XBB.3.66.3,
PCR_result, XBB.8.15.88,
PCR_result, EG.5.1.3,
PCR_result, JK.0.14.39,
Conclusion_result, 10828003,
Conclusion_result, 10828004,
Conclusion_result, 10828005,
Conclusion_result, 10828009,
Conclusion_result, 10987004,
Conclusion_result, 260373001,
Conclusion_result, 260383001,
Conclusion_result, 260215000,
Conclusion_result, 260415000,
Non-coded ID format,Identifier,
Non-coded text, Text,
;
data Want;
infile datalines dsd;
input LabID : $15. Test_Code : $15. Result : $25. Result_Name : $25. Match : $5.;
datalines;
21156, 100157-7_100157-7, BA.2.86, PCR_result, Y,
21156, 100157-7_100157-7, XBB.1.16.3, PCR_result, Y,
21156, 100157-7_100157-7, FL.4, PCR_result, N,
21156, 100157-7_100157-7, EG.5.1.3, PCR_result, Y,
21156, 100157-6_100157-8, 10828001, Conclusion_result, N,
21156, 100157-6_100157-8, 260415000, Conclusion_result, Y,
21156, 100157-6_100157-8, 260373001, Conclusion_result, Y,
21156, 100157-6_100157-8, 10000004, Conclusion_result, N,
21156, 100357-6_100157-8, 5656, Non-coded ID format, TBD,
21156, 100357-6_100157-8, 7878, Non-coded ID format, TBD,
21156, 104157-6_100157-8, LKJH, Non-coded text, TBD,
21156, 104157-6_100157-8, PPGH, Non-coded text, TBD,
24597, 100157-7_100157-7, XBB.1.5.70, PCR_result, Y,
24597, 100157-7_100157-7, FL.15, PCR_result, N,
24597, 100157-7_100157-7, BA.2.86, PCR_result, Y,
24597, 100157-6_100157-8, 260373001, Conclusion_result, Y,
24597, 100157-6_100157-8, 11828005, Conclusion_result, N,
24597, 100157-6_100157-8, 10987064, Conclusion_result, N,
24597, 100357-6_100157-8, 3245, Non-coded ID format, TBD,
24597, 104157-6_100157-8, OPUY, Non-coded text, TBD,
;
Hello @ybz12003,
I would probably read the reference dataset B into a hash object, which can then be queried after checking for "Non-coded ..." cases:
data Want;
set Sample_A;
if _n_=1 then do;
dcl hash h(dataset:'Sample_B(rename=(Result_Code=Result))');
h.definekey('Result_Name', 'Result');
h.definedone();
if 0 then set Sample_B(rename=(Result_Code=Result));
end;
length Match $5;
if Result_Name in ("Non-coded ID format", "Non-coded text") then Match='TBD';
else Match=ifc(h.check()=0,'Y','N');
run;
You may want to modify the IF condition for the "Non-coded ..." cases if the criterion is not an exact match with those strings.
Note that length 15 is not sufficient to accommodate the values listed for variable Test_Code. More importantly, three values of variable Result differ between dataset Sample_A and your sample dataset Want. This results in three observations where my suggested code finds a match, whereas your Want dataset has Match='N', based on the different values (see observations 5, 17 and 18).
Hello @ybz12003,
I would probably read the reference dataset B into a hash object, which can then be queried after checking for "Non-coded ..." cases:
data Want;
set Sample_A;
if _n_=1 then do;
dcl hash h(dataset:'Sample_B(rename=(Result_Code=Result))');
h.definekey('Result_Name', 'Result');
h.definedone();
if 0 then set Sample_B(rename=(Result_Code=Result));
end;
length Match $5;
if Result_Name in ("Non-coded ID format", "Non-coded text") then Match='TBD';
else Match=ifc(h.check()=0,'Y','N');
run;
You may want to modify the IF condition for the "Non-coded ..." cases if the criterion is not an exact match with those strings.
Note that length 15 is not sufficient to accommodate the values listed for variable Test_Code. More importantly, three values of variable Result differ between dataset Sample_A and your sample dataset Want. This results in three observations where my suggested code finds a match, whereas your Want dataset has Match='N', based on the different values (see observations 5, 17 and 18).
Instead of character "Y" and "N", use numeric 1 and 0. These are boolean values which can be used directly in conditions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.