I have an unusual assignment where I have a list of expected lab tests categorized by "Sample Type", "Sample Name", and "Temperature". I then receive a file from the lab denoting what lab samples were received with its temperature. I am supposed to detect if anything in the lab file is not found in the expected list. That part would be easy to just merge by the 3 variables. The hard part comes in when they want me to detail which variable was the cause of not finding a match.
For example, the SAS code below shows a small sample of expected labs and a small sample of received labs. Because there are some "typos" or a different sample received, I want to retain those mismatches, but also show why they mismatched. So, in the LAB_FILE dataset, rows 3, 6, 7, and 8 should be retained. However, I am not sure how to denote which variable caused it to not be found. Line 3 should return unexpected Type and Name. Line 6 should return unexpected Name. Line 7 should return unexpected Type. Line 8 should return unexpected Temperature.
I have noticed that when I can't figure out how to do something, someone always comes to my rescue with SQL, so maybe I just need to use a different approach. Does anyone have any helpful ideas?
data list;
input Sample_Type $ 1-16 Sample_Name $ 17-28 Temperature $ 29-36;
datalines;
Plasma EDTA K2 BIO PLSM 1 Frozen
Plasma EDTA K2 BIO PLSM 2 Frozen
WB EDTA K2 139 VARIANT Frozen
WB EDTA K2 39 VARIANT Frozen
WB EDTA K2 FLSEQ Frozen
WB EDTA K2 HEMOR Ambient
;
run;
data lab_file;
input Sample_Type $ 1-16 Sample_Name $ 17-28 Temperature $ 29-36;
datalines;
Plasma EDTA K2 BIO PLSM 1 Frozen
Plasma EDTA K2 BIO PLSM 2 Frozen
Plasma EDTA K3 BIO PLSM 3 Frozen
WB EDTA K2 139 VARIANT Frozen
WB EDTA K2 39 VARIANT Frozen
WB EDTA K2 29 VARIANT Frozen
WB EDTA HEMOR Ambient
WB EDTA K2 FLSEQ Ambient
;
run;
proc sort data=list;
by sample_type sample_name temperature;
proc sort data=lab_file;
by sample_type sample_name temperature;
run;
/* I want to keep rows 3, 6, 7, 8 from LAB_FILE but denote which variable causes a mismatch */
data compare;
merge list (in=a) lab_file (in=b);
by sample_type sample_name temperature;
if a & b then delete;
if b;
run;
Hello,
Interesting question. 🤔. Had to think about it. 🤔.
In my opinion, you can only do a cartesian join of all the non-exactly-matches and then compare the by-variables with
COMPGED (returns the generalized edit distance between two strings).
If you see three zeros, then your mismatch is due to the case, leading blanks and/or quotation marks.
When you do not have three zeros, look for lines where you have 2 zeros and the other compged-value being very small. That can be a typo-error. And so on. You will get it when looking at the output table!!
data list; set list; rownum=_N_; run;
data lab_file; set lab_file; rownum=_N_; run;
PROC SQL noprint;
create table work.mismatch as
select t1.rownum as rownum_t1 , t1.sample_type as sample_type_t1
, t1.sample_name as sample_name_t1 , t1.temperature as temperature_t1
, t2.rownum as rownum_t2 , t2.sample_type as sample_type_t2
, t2.sample_name as sample_name_t2 , t2.temperature as temperature_t2
from list t1
, lab_file t2
where NOT
( t1.sample_type = t2.sample_type
AND t1.sample_name = t2.sample_name
AND t1.temperature = t2.temperature )
; QUIT;
data work.mismatch;
set work.mismatch;
sample_type_match=compged(sample_type_t1,sample_type_t2,"iLN");
sample_name_match=compged(sample_name_t1,sample_name_t2,"iLN");
temperature_match=compged(temperature_t1,temperature_t2,"iLN");
run;
/* end of program */
Cheers,
Koen
Hello,
Interesting question. 🤔. Had to think about it. 🤔.
In my opinion, you can only do a cartesian join of all the non-exactly-matches and then compare the by-variables with
COMPGED (returns the generalized edit distance between two strings).
If you see three zeros, then your mismatch is due to the case, leading blanks and/or quotation marks.
When you do not have three zeros, look for lines where you have 2 zeros and the other compged-value being very small. That can be a typo-error. And so on. You will get it when looking at the output table!!
data list; set list; rownum=_N_; run;
data lab_file; set lab_file; rownum=_N_; run;
PROC SQL noprint;
create table work.mismatch as
select t1.rownum as rownum_t1 , t1.sample_type as sample_type_t1
, t1.sample_name as sample_name_t1 , t1.temperature as temperature_t1
, t2.rownum as rownum_t2 , t2.sample_type as sample_type_t2
, t2.sample_name as sample_name_t2 , t2.temperature as temperature_t2
from list t1
, lab_file t2
where NOT
( t1.sample_type = t2.sample_type
AND t1.sample_name = t2.sample_name
AND t1.temperature = t2.temperature )
; QUIT;
data work.mismatch;
set work.mismatch;
sample_type_match=compged(sample_type_t1,sample_type_t2,"iLN");
sample_name_match=compged(sample_name_t1,sample_name_t2,"iLN");
temperature_match=compged(temperature_t1,temperature_t2,"iLN");
run;
/* end of program */
Cheers,
Koen
Koen,
Thanks for your help. I never would have thought of this method. It may still take some modifying to get what I fully need, but you gave me a good start. I was informed that the main variable of interest is Sample_Name, so I started merging by just Sample_Name. If it is not found in the lab file, then I know it is unexpected. Then I compare the associated Sample_Type and Temperature. This seems to be working for what I need at the moment, but I'm not sure how reliable it will be. Still, thanks for giving me more to think about if I need further manipulation.
Dallas
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.