Solved: Assign a "Y/N" variable in dataset A by using a reference dataset B?

ybz12003 · Posted 07-15-2024 09:58 AM

Hello,

I have sample data, A, which contains Result and Result_Name columns. I have another sample data, B, which contains Result_Name and Result_Code. Data B acts as a result coding guide. I want to use Data B as a reference to check whether Data A has any entries that don't match the Data B coding guide. The result I'm looking for is shown as "Want." One particular condition is that if the Result_Name contains "Non-coded ID format" or "Non-coded text," the match result should be set to 'TBD.' Please let me know how to approach this.

data Sample_A;
      infile datalines dsd;
  input LabID : $15. Test_Code : $15.  Result : $25. Result_Name : $25.;
datalines;
21156, 100157-7_100157-7, BA.2.86, PCR_result, 
21156, 100157-7_100157-7, XBB.1.16.3, PCR_result, 
21156, 100157-7_100157-7, FL.4, PCR_result, 
21156, 100157-7_100157-7, EG.5.1.3, PCR_result,
21156, 100157-6_100157-8, 10828004, Conclusion_result,
21156, 100157-6_100157-8, 260415000, Conclusion_result,
21156, 100157-6_100157-8, 260373001, Conclusion_result,
21156, 100157-6_100157-8, 10000004, Conclusion_result, 
21156, 100357-6_100157-8, 5656, Non-coded ID format, 
21156, 100357-6_100157-8, 7878, Non-coded ID format, 
21156, 104157-6_100157-8, LKJH, Non-coded text, 
21156, 104157-6_100157-8, PPGH, Non-coded text, 
24597, 100157-7_100157-7, XBB.1.5.70, PCR_result, 
24597, 100157-7_100157-7, FL.15, PCR_result, 
24597, 100157-7_100157-7, BA.2.86, PCR_result, 
24597, 100157-6_100157-8, 260373001, Conclusion_result, 
24597, 100157-6_100157-8, 10828005, Conclusion_result, 
24597, 100157-6_100157-8, 10987004, Conclusion_result, 
24597, 100357-6_100157-8, 3245, Non-coded ID format, 
24597, 104157-6_100157-8, OPUY, Non-coded text,  
;

data Sample_B;
      infile datalines dsd;
  input Result_Name : $25. Result_Code : $25. ;
datalines;
PCR_result, BA.2.86, 
PCR_result, BA.5.86,
PCR_result, FL.3, 
PCR_result, FL.8,
PCR_result, FL.10, 
PCR_result, XBB.1.16.3,
PCR_result, XBB.1.5.70,
PCR_result, XBB.3.66.3, 
PCR_result, XBB.8.15.88,
PCR_result, EG.5.1.3, 
PCR_result, JK.0.14.39, 
Conclusion_result, 10828003, 
Conclusion_result, 10828004,
Conclusion_result, 10828005,
Conclusion_result, 10828009,
Conclusion_result, 10987004, 
Conclusion_result, 260373001,
Conclusion_result, 260383001, 
Conclusion_result, 260215000,
Conclusion_result, 260415000, 
Non-coded ID format,Identifier,
Non-coded text, Text,
;

data Want;
      infile datalines dsd;
  input LabID : $15. Test_Code : $15.  Result : $25. Result_Name : $25. Match : $5.;
datalines;
21156, 100157-7_100157-7, BA.2.86, PCR_result, Y,
21156, 100157-7_100157-7, XBB.1.16.3, PCR_result, Y, 
21156, 100157-7_100157-7, FL.4, PCR_result, N, 
21156, 100157-7_100157-7, EG.5.1.3, PCR_result, Y,
21156, 100157-6_100157-8, 10828001, Conclusion_result, N,
21156, 100157-6_100157-8, 260415000, Conclusion_result, Y,
21156, 100157-6_100157-8, 260373001, Conclusion_result, Y,
21156, 100157-6_100157-8, 10000004, Conclusion_result, N,
21156, 100357-6_100157-8, 5656, Non-coded ID format, TBD,
21156, 100357-6_100157-8, 7878, Non-coded ID format, TBD,
21156, 104157-6_100157-8, LKJH, Non-coded text, TBD, 
21156, 104157-6_100157-8, PPGH, Non-coded text, TBD,
24597, 100157-7_100157-7, XBB.1.5.70, PCR_result, Y,
24597, 100157-7_100157-7, FL.15, PCR_result, N,
24597, 100157-7_100157-7, BA.2.86, PCR_result, Y,
24597, 100157-6_100157-8, 260373001, Conclusion_result, Y,
24597, 100157-6_100157-8, 11828005, Conclusion_result, N,
24597, 100157-6_100157-8, 10987064, Conclusion_result,  N,
24597, 100357-6_100157-8, 3245, Non-coded ID format, TBD,
24597, 104157-6_100157-8, OPUY, Non-coded text, TBD,
;

FreelanceReinh · Posted 07-15-2024 12:18 PM

Hello @ybz12003,

I would probably read the reference dataset B into a hash object, which can then be queried after checking for "Non-coded ..." cases:

data Want;
set Sample_A;
if _n_=1 then do;
  dcl hash h(dataset:'Sample_B(rename=(Result_Code=Result))');
  h.definekey('Result_Name', 'Result');
  h.definedone();
  if 0 then set Sample_B(rename=(Result_Code=Result));
end;
length Match $5;
if Result_Name in ("Non-coded ID format", "Non-coded text") then Match='TBD';
else Match=ifc(h.check()=0,'Y','N');
run;

You may want to modify the IF condition for the "Non-coded ..." cases if the criterion is not an exact match with those strings.

Note that length 15 is not sufficient to accommodate the values listed for variable Test_Code. More importantly, three values of variable Result differ between dataset Sample_A and your sample dataset Want. This results in three observations where my suggested code finds a match, whereas your Want dataset has Match='N', based on the different values (see observations 5, 17 and 18).

View solution in original post

FreelanceReinh · Posted 07-15-2024 12:18 PM

Hello @ybz12003,

I would probably read the reference dataset B into a hash object, which can then be queried after checking for "Non-coded ..." cases:

data Want;
set Sample_A;
if _n_=1 then do;
  dcl hash h(dataset:'Sample_B(rename=(Result_Code=Result))');
  h.definekey('Result_Name', 'Result');
  h.definedone();
  if 0 then set Sample_B(rename=(Result_Code=Result));
end;
length Match $5;
if Result_Name in ("Non-coded ID format", "Non-coded text") then Match='TBD';
else Match=ifc(h.check()=0,'Y','N');
run;

You may want to modify the IF condition for the "Non-coded ..." cases if the criterion is not an exact match with those strings.

Note that length 15 is not sufficient to accommodate the values listed for variable Test_Code. More importantly, three values of variable Result differ between dataset Sample_A and your sample dataset Want. This results in three observations where my suggested code finds a match, whereas your Want dataset has Match='N', based on the different values (see observations 5, 17 and 18).

ybz12003

Thank you for the suggestion; that is an awesome program.

Kurt_Bremser · Posted 07-15-2024 03:07 PM

Instead of character "Y" and "N", use numeric 1 and 0. These are boolean values which can be used directly in conditions.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Assign a "Y/N" variable in dataset A by using a reference dataset B?

Re: Assign a "Y/N" variable in dataset A by using a reference dataset B?

Re: Assign a "Y/N" variable in dataset A by using a reference dataset B?

Re: Assign a "Y/N" variable in dataset A by using a reference dataset B?

Re: Assign a "Y/N" variable in dataset A by using a reference dataset B?

Assign a "Y/N" variable in dataset A by using a reference dataset B?

Re: Assign a "Y/N" variable in dataset A by using a reference dataset B?

Re: Assign a "Y/N" variable in dataset A by using a reference dataset B?

Re: Assign a "Y/N" variable in dataset A by using a reference dataset B?

Re: Assign a "Y/N" variable in dataset A by using a reference dataset B?

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!