BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
djbateman
Lapis Lazuli | Level 10

I have an unusual assignment where I have a list of expected lab tests categorized by "Sample Type", "Sample Name", and "Temperature".  I then receive a file from the lab denoting what lab samples were received with its temperature.  I am supposed to detect if anything in the lab file is not found in the expected list.  That part would be easy to just merge by the 3 variables.  The hard part comes in when they want me to detail which variable was the cause of not finding a match.

 

For example, the SAS code below shows a small sample of expected labs and a small sample of received labs.  Because there are some "typos" or a different sample received, I want to retain those mismatches, but also show why they mismatched.  So, in the LAB_FILE dataset, rows 3, 6, 7, and 8 should be retained.  However, I am not sure how to denote which variable caused it to not be found.  Line 3 should return unexpected Type and Name.  Line 6 should return unexpected Name.  Line 7 should return unexpected Type.  Line 8 should return unexpected Temperature.

 

I have noticed that when I can't figure out how to do something, someone always comes to my rescue with SQL, so maybe I just need to use a different approach.  Does anyone have any helpful ideas?

 

data list;
	input Sample_Type $ 1-16 Sample_Name $ 17-28 Temperature $ 29-36;
	datalines;
Plasma EDTA K2	BIO PLSM 1	Frozen
Plasma EDTA K2	BIO PLSM 2	Frozen
WB EDTA K2		139 VARIANT	Frozen
WB EDTA K2		39 VARIANT	Frozen
WB EDTA K2		FLSEQ		Frozen
WB EDTA K2		HEMOR		Ambient
;
run;

data lab_file;
	input Sample_Type $ 1-16 Sample_Name $ 17-28 Temperature $ 29-36;
	datalines;
Plasma EDTA K2	BIO PLSM 1	Frozen
Plasma EDTA K2	BIO PLSM 2	Frozen
Plasma EDTA K3	BIO PLSM 3	Frozen
WB EDTA K2		139 VARIANT	Frozen
WB EDTA K2		39 VARIANT	Frozen
WB EDTA K2		29 VARIANT	Frozen
WB EDTA			HEMOR		Ambient
WB EDTA K2		FLSEQ		Ambient
;
run;

proc sort data=list;
	by sample_type sample_name temperature;
proc sort data=lab_file;
	by sample_type sample_name temperature;
run;

/* I want to keep rows 3, 6, 7, 8 from LAB_FILE but denote which variable causes a mismatch */
data compare;
	merge list (in=a) lab_file (in=b);
	by sample_type sample_name temperature;
	if a & b then delete;
	if b;
run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
sbxkoenk
SAS Super FREQ

Hello,

 

Interesting question. 🤔. Had to think about it. 🤔.

In my opinion, you can only do a cartesian join of all the non-exactly-matches and then compare the by-variables with
COMPGED (returns the generalized edit distance between two strings).

If you see three zeros, then your mismatch is due to the case, leading blanks and/or quotation marks.

When you do not have three zeros, look for lines where you have 2 zeros and the other compged-value being very small. That can be a typo-error. And so on. You will get it when looking at the output table!!

data list;     set list;     rownum=_N_; run;
data lab_file; set lab_file; rownum=_N_; run;
PROC SQL noprint;
 create table work.mismatch as
 select   t1.rownum as rownum_t1           , t1.sample_type as sample_type_t1 
        , t1.sample_name as sample_name_t1 , t1.temperature as temperature_t1
        , t2.rownum as rownum_t2           , t2.sample_type as sample_type_t2 
        , t2.sample_name as sample_name_t2 , t2.temperature as temperature_t2
 from   list     t1
      , lab_file t2
 where NOT 
 (    t1.sample_type = t2.sample_type 
  AND t1.sample_name = t2.sample_name 
  AND t1.temperature = t2.temperature )
; QUIT;
data work.mismatch;
 set work.mismatch;
 sample_type_match=compged(sample_type_t1,sample_type_t2,"iLN");
 sample_name_match=compged(sample_name_t1,sample_name_t2,"iLN");
 temperature_match=compged(temperature_t1,temperature_t2,"iLN");
run;
/* end of program */

Cheers,

Koen

View solution in original post

2 REPLIES 2
sbxkoenk
SAS Super FREQ

Hello,

 

Interesting question. 🤔. Had to think about it. 🤔.

In my opinion, you can only do a cartesian join of all the non-exactly-matches and then compare the by-variables with
COMPGED (returns the generalized edit distance between two strings).

If you see three zeros, then your mismatch is due to the case, leading blanks and/or quotation marks.

When you do not have three zeros, look for lines where you have 2 zeros and the other compged-value being very small. That can be a typo-error. And so on. You will get it when looking at the output table!!

data list;     set list;     rownum=_N_; run;
data lab_file; set lab_file; rownum=_N_; run;
PROC SQL noprint;
 create table work.mismatch as
 select   t1.rownum as rownum_t1           , t1.sample_type as sample_type_t1 
        , t1.sample_name as sample_name_t1 , t1.temperature as temperature_t1
        , t2.rownum as rownum_t2           , t2.sample_type as sample_type_t2 
        , t2.sample_name as sample_name_t2 , t2.temperature as temperature_t2
 from   list     t1
      , lab_file t2
 where NOT 
 (    t1.sample_type = t2.sample_type 
  AND t1.sample_name = t2.sample_name 
  AND t1.temperature = t2.temperature )
; QUIT;
data work.mismatch;
 set work.mismatch;
 sample_type_match=compged(sample_type_t1,sample_type_t2,"iLN");
 sample_name_match=compged(sample_name_t1,sample_name_t2,"iLN");
 temperature_match=compged(temperature_t1,temperature_t2,"iLN");
run;
/* end of program */

Cheers,

Koen

djbateman
Lapis Lazuli | Level 10

Koen,

 

Thanks for your help.  I never would have thought of this method.  It may still take some modifying to get what I fully need, but you gave me a good start.  I was informed that the main variable of interest is Sample_Name, so I started merging by just Sample_Name. If it is not found in the lab file, then I know it is unexpected.  Then I compare the associated Sample_Type and Temperature.  This seems to be working for what I need at the moment, but I'm not sure how reliable it will be.  Still, thanks for giving me more to think about if I need further manipulation.

 

Dallas

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 1531 views
  • 0 likes
  • 2 in conversation