BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
djbateman
Lapis Lazuli | Level 10

I have an unusual assignment where I have a list of expected lab tests categorized by "Sample Type", "Sample Name", and "Temperature".  I then receive a file from the lab denoting what lab samples were received with its temperature.  I am supposed to detect if anything in the lab file is not found in the expected list.  That part would be easy to just merge by the 3 variables.  The hard part comes in when they want me to detail which variable was the cause of not finding a match.

 

For example, the SAS code below shows a small sample of expected labs and a small sample of received labs.  Because there are some "typos" or a different sample received, I want to retain those mismatches, but also show why they mismatched.  So, in the LAB_FILE dataset, rows 3, 6, 7, and 8 should be retained.  However, I am not sure how to denote which variable caused it to not be found.  Line 3 should return unexpected Type and Name.  Line 6 should return unexpected Name.  Line 7 should return unexpected Type.  Line 8 should return unexpected Temperature.

 

I have noticed that when I can't figure out how to do something, someone always comes to my rescue with SQL, so maybe I just need to use a different approach.  Does anyone have any helpful ideas?

 

data list;
	input Sample_Type $ 1-16 Sample_Name $ 17-28 Temperature $ 29-36;
	datalines;
Plasma EDTA K2	BIO PLSM 1	Frozen
Plasma EDTA K2	BIO PLSM 2	Frozen
WB EDTA K2		139 VARIANT	Frozen
WB EDTA K2		39 VARIANT	Frozen
WB EDTA K2		FLSEQ		Frozen
WB EDTA K2		HEMOR		Ambient
;
run;

data lab_file;
	input Sample_Type $ 1-16 Sample_Name $ 17-28 Temperature $ 29-36;
	datalines;
Plasma EDTA K2	BIO PLSM 1	Frozen
Plasma EDTA K2	BIO PLSM 2	Frozen
Plasma EDTA K3	BIO PLSM 3	Frozen
WB EDTA K2		139 VARIANT	Frozen
WB EDTA K2		39 VARIANT	Frozen
WB EDTA K2		29 VARIANT	Frozen
WB EDTA			HEMOR		Ambient
WB EDTA K2		FLSEQ		Ambient
;
run;

proc sort data=list;
	by sample_type sample_name temperature;
proc sort data=lab_file;
	by sample_type sample_name temperature;
run;

/* I want to keep rows 3, 6, 7, 8 from LAB_FILE but denote which variable causes a mismatch */
data compare;
	merge list (in=a) lab_file (in=b);
	by sample_type sample_name temperature;
	if a & b then delete;
	if b;
run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
sbxkoenk
SAS Super FREQ

Hello,

 

Interesting question. 🤔. Had to think about it. 🤔.

In my opinion, you can only do a cartesian join of all the non-exactly-matches and then compare the by-variables with
COMPGED (returns the generalized edit distance between two strings).

If you see three zeros, then your mismatch is due to the case, leading blanks and/or quotation marks.

When you do not have three zeros, look for lines where you have 2 zeros and the other compged-value being very small. That can be a typo-error. And so on. You will get it when looking at the output table!!

data list;     set list;     rownum=_N_; run;
data lab_file; set lab_file; rownum=_N_; run;
PROC SQL noprint;
 create table work.mismatch as
 select   t1.rownum as rownum_t1           , t1.sample_type as sample_type_t1 
        , t1.sample_name as sample_name_t1 , t1.temperature as temperature_t1
        , t2.rownum as rownum_t2           , t2.sample_type as sample_type_t2 
        , t2.sample_name as sample_name_t2 , t2.temperature as temperature_t2
 from   list     t1
      , lab_file t2
 where NOT 
 (    t1.sample_type = t2.sample_type 
  AND t1.sample_name = t2.sample_name 
  AND t1.temperature = t2.temperature )
; QUIT;
data work.mismatch;
 set work.mismatch;
 sample_type_match=compged(sample_type_t1,sample_type_t2,"iLN");
 sample_name_match=compged(sample_name_t1,sample_name_t2,"iLN");
 temperature_match=compged(temperature_t1,temperature_t2,"iLN");
run;
/* end of program */

Cheers,

Koen

View solution in original post

2 REPLIES 2
sbxkoenk
SAS Super FREQ

Hello,

 

Interesting question. 🤔. Had to think about it. 🤔.

In my opinion, you can only do a cartesian join of all the non-exactly-matches and then compare the by-variables with
COMPGED (returns the generalized edit distance between two strings).

If you see three zeros, then your mismatch is due to the case, leading blanks and/or quotation marks.

When you do not have three zeros, look for lines where you have 2 zeros and the other compged-value being very small. That can be a typo-error. And so on. You will get it when looking at the output table!!

data list;     set list;     rownum=_N_; run;
data lab_file; set lab_file; rownum=_N_; run;
PROC SQL noprint;
 create table work.mismatch as
 select   t1.rownum as rownum_t1           , t1.sample_type as sample_type_t1 
        , t1.sample_name as sample_name_t1 , t1.temperature as temperature_t1
        , t2.rownum as rownum_t2           , t2.sample_type as sample_type_t2 
        , t2.sample_name as sample_name_t2 , t2.temperature as temperature_t2
 from   list     t1
      , lab_file t2
 where NOT 
 (    t1.sample_type = t2.sample_type 
  AND t1.sample_name = t2.sample_name 
  AND t1.temperature = t2.temperature )
; QUIT;
data work.mismatch;
 set work.mismatch;
 sample_type_match=compged(sample_type_t1,sample_type_t2,"iLN");
 sample_name_match=compged(sample_name_t1,sample_name_t2,"iLN");
 temperature_match=compged(temperature_t1,temperature_t2,"iLN");
run;
/* end of program */

Cheers,

Koen

djbateman
Lapis Lazuli | Level 10

Koen,

 

Thanks for your help.  I never would have thought of this method.  It may still take some modifying to get what I fully need, but you gave me a good start.  I was informed that the main variable of interest is Sample_Name, so I started merging by just Sample_Name. If it is not found in the lab file, then I know it is unexpected.  Then I compare the associated Sample_Type and Temperature.  This seems to be working for what I need at the moment, but I'm not sure how reliable it will be.  Still, thanks for giving me more to think about if I need further manipulation.

 

Dallas

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 525 views
  • 0 likes
  • 2 in conversation