I have an unusual assignment where I have a list of expected lab tests categorized by "Sample Type", "Sample Name", and "Temperature". I then receive a file from the lab denoting what lab samples were received with its temperature. I am supposed to detect if anything in the lab file is not found in the expected list. That part would be easy to just merge by the 3 variables. The hard part comes in when they want me to detail which variable was the cause of not finding a match.
For example, the SAS code below shows a small sample of expected labs and a small sample of received labs. Because there are some "typos" or a different sample received, I want to retain those mismatches, but also show why they mismatched. So, in the LAB_FILE dataset, rows 3, 6, 7, and 8 should be retained. However, I am not sure how to denote which variable caused it to not be found. Line 3 should return unexpected Type and Name. Line 6 should return unexpected Name. Line 7 should return unexpected Type. Line 8 should return unexpected Temperature.
I have noticed that when I can't figure out how to do something, someone always comes to my rescue with SQL, so maybe I just need to use a different approach. Does anyone have any helpful ideas?
data list;
input Sample_Type $ 1-16 Sample_Name $ 17-28 Temperature $ 29-36;
datalines;
Plasma EDTA K2 BIO PLSM 1 Frozen
Plasma EDTA K2 BIO PLSM 2 Frozen
WB EDTA K2 139 VARIANT Frozen
WB EDTA K2 39 VARIANT Frozen
WB EDTA K2 FLSEQ Frozen
WB EDTA K2 HEMOR Ambient
;
run;
data lab_file;
input Sample_Type $ 1-16 Sample_Name $ 17-28 Temperature $ 29-36;
datalines;
Plasma EDTA K2 BIO PLSM 1 Frozen
Plasma EDTA K2 BIO PLSM 2 Frozen
Plasma EDTA K3 BIO PLSM 3 Frozen
WB EDTA K2 139 VARIANT Frozen
WB EDTA K2 39 VARIANT Frozen
WB EDTA K2 29 VARIANT Frozen
WB EDTA HEMOR Ambient
WB EDTA K2 FLSEQ Ambient
;
run;
proc sort data=list;
by sample_type sample_name temperature;
proc sort data=lab_file;
by sample_type sample_name temperature;
run;
/* I want to keep rows 3, 6, 7, 8 from LAB_FILE but denote which variable causes a mismatch */
data compare;
merge list (in=a) lab_file (in=b);
by sample_type sample_name temperature;
if a & b then delete;
if b;
run;
Hello,
Interesting question. 🤔. Had to think about it. 🤔.
In my opinion, you can only do a cartesian join of all the non-exactly-matches and then compare the by-variables with
COMPGED (returns the generalized edit distance between two strings).
If you see three zeros, then your mismatch is due to the case, leading blanks and/or quotation marks.
When you do not have three zeros, look for lines where you have 2 zeros and the other compged-value being very small. That can be a typo-error. And so on. You will get it when looking at the output table!!
data list; set list; rownum=_N_; run;
data lab_file; set lab_file; rownum=_N_; run;
PROC SQL noprint;
create table work.mismatch as
select t1.rownum as rownum_t1 , t1.sample_type as sample_type_t1
, t1.sample_name as sample_name_t1 , t1.temperature as temperature_t1
, t2.rownum as rownum_t2 , t2.sample_type as sample_type_t2
, t2.sample_name as sample_name_t2 , t2.temperature as temperature_t2
from list t1
, lab_file t2
where NOT
( t1.sample_type = t2.sample_type
AND t1.sample_name = t2.sample_name
AND t1.temperature = t2.temperature )
; QUIT;
data work.mismatch;
set work.mismatch;
sample_type_match=compged(sample_type_t1,sample_type_t2,"iLN");
sample_name_match=compged(sample_name_t1,sample_name_t2,"iLN");
temperature_match=compged(temperature_t1,temperature_t2,"iLN");
run;
/* end of program */
Cheers,
Koen
Hello,
Interesting question. 🤔. Had to think about it. 🤔.
In my opinion, you can only do a cartesian join of all the non-exactly-matches and then compare the by-variables with
COMPGED (returns the generalized edit distance between two strings).
If you see three zeros, then your mismatch is due to the case, leading blanks and/or quotation marks.
When you do not have three zeros, look for lines where you have 2 zeros and the other compged-value being very small. That can be a typo-error. And so on. You will get it when looking at the output table!!
data list; set list; rownum=_N_; run;
data lab_file; set lab_file; rownum=_N_; run;
PROC SQL noprint;
create table work.mismatch as
select t1.rownum as rownum_t1 , t1.sample_type as sample_type_t1
, t1.sample_name as sample_name_t1 , t1.temperature as temperature_t1
, t2.rownum as rownum_t2 , t2.sample_type as sample_type_t2
, t2.sample_name as sample_name_t2 , t2.temperature as temperature_t2
from list t1
, lab_file t2
where NOT
( t1.sample_type = t2.sample_type
AND t1.sample_name = t2.sample_name
AND t1.temperature = t2.temperature )
; QUIT;
data work.mismatch;
set work.mismatch;
sample_type_match=compged(sample_type_t1,sample_type_t2,"iLN");
sample_name_match=compged(sample_name_t1,sample_name_t2,"iLN");
temperature_match=compged(temperature_t1,temperature_t2,"iLN");
run;
/* end of program */
Cheers,
Koen
Koen,
Thanks for your help. I never would have thought of this method. It may still take some modifying to get what I fully need, but you gave me a good start. I was informed that the main variable of interest is Sample_Name, so I started merging by just Sample_Name. If it is not found in the lab file, then I know it is unexpected. Then I compare the associated Sample_Type and Temperature. This seems to be working for what I need at the moment, but I'm not sure how reliable it will be. Still, thanks for giving me more to think about if I need further manipulation.
Dallas
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.