Merging data in 2 levels

Prudhvi_007 · Posted 05-14-2022 02:23 AM

Hi Everyone. I m currently using SAS 9.4 version . I have to merge two tables A and B based on SUBJID and LABID. The SUBJID is of 10 characters and LABID is of 12 characters . My requirement is first I have to check if I can match SUBJID and first 12 digits of LABID.

SUBJID	LABID
EQ98765599	987653456986
MN65473829	634528754100
OP32419877	362567375709

SUBJID	LABID
EQ98765599	987653456986
MN65473829	634528754187
OP32419877	362567375709

The subjects highlighted in green are exact match where as the subject highlighted in red are not exact match because only first 10 characters are matching in LABID and 11th ,12th characters are different in LABID , so they wont come in output. Is it possible to merge the data like if 12 digits of LABID are matching then output the data , if 12 digits of LABID are not matching then check if 10 digits of LABID are matching then output the data.

Ksharp · Posted 05-14-2022 02:38 AM

https://communities.sas.com/t5/SAS-Programming/Good-Forum-Practices-Ask-a-question/m-p/813327

s_lassen · Posted 05-14-2022 06:15 AM

The message the @Ksharp referred to appears to be deleted, but my guess is that it contains a suggestion that you present test data as data step code, like this:

data have1;
length SUBJID LABID $20;
input subjid labid;
cards;
EQ98765599 987653456986
MN65473829 634528754100
OP32419877 1362567375709
;run;

Data have2;
length SUBJID LABID $20;
input subjid labid;
cards;
EQ98765599 987653456986
MN65473829 634528754187
OP32419877 1362567375709
;run;

I think the easiest way to make the match is to use SQL:

proc sql;
  create table want as select 
    have1.*,
    have2.labid as labid2, 
    case
      when have1.labid=have2.labid then 1
      else 0
    end as match
  from have1 join have2 on have1.subjid=have2.subjid and substr(have1.labid,1,10)=substr(have2.labid,1,10)
  ;

This will merge all the data on the criteria you mentioned. The variable MATCH will contain a 1 if the two LABIDs are identical, and otherwise a 0. If there are multiple occurences of each SUBJID and LABID in your data, you may want to add additional criteria, as you will otherwise get a cartesian product (all combinations of possible matches from the two tables).

Sajid01 · Posted 05-14-2022 08:05 AM

Hello @Prudhvi_007
it is indeed possible to have an output of matching and non-matching observations. One can use either the SAS dataset merge or Proc SQL whichever one is comfortable with.

mkeintz · Posted 05-14-2022 11:59 AM

These questions are not answered by your sample data, but the answers would significantly impact the appropriate responses:

Do you expect any many-to-many matches of SUBJID/12-digit-LABID? If yes, then do you want to keep implied cartesian crossing of all SUBJID/12-digit matches?
If you do not expect the above conditino, then are your 12-digit LABID's constructed in such a way that any of them share the first 10 digits? If so, and assuming a given row is matched on SUBJID, but no 12-digit-LABID match, what do you want to do if there is a set of many-to-many matches on SUBJID/10-digit-LABID?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 05-14-2022 03:24 PM

The clearest way is to make a new variable that has just the first 10 characters of the LABID. Let's call it LABSUBJ.

data fix1;
  set one;
  labsubj = substr(labid,1,10);
run;

data fix2;
  set two;
  labsubj = substr(labid,1,10);
run;

data want;
  merge fix1 fix2(rename=(labid=labid2));
  by subjid labsubj;
  exact = labid = labid2;
run;

proc print;
run;

Result

Obs      SUBJID         LABID         labsubj         labid2       exact

 1     EQ98765599    987653456986    9876534569    987653456986      1
 2     MN65473829    634528754100    6345287541    634528754187      0
 3     OP32419877    362567375709    3625673757    362567375709      1

If there are multiple observations of for the same values of SUBJID and LABSUBJ in both datasets then you probably need to do a many to many join which will require SQL code instead.

proc sql ;
create table both as 
  select a.*
       , coalesce(substr(a.labid,1,10),substr(b.labid,1,10)) as LABSUBJ
       , b.labid as labid2
       , a.labid=b.labid as exact
  from one a 
  full join two b
  on a.subjid = b.subjid
  and substr(a.labid,1,10) eqt b.labid
  order by 1,2,3,4
;
quit;

Merging data in 2 levels

Re: Merging data in 2 levels

Re: Merging data in 2 levels

Re: Merging data in 2 levels

Re: Merging data in 2 levels

Re: Merging data in 2 levels

Merging data in 2 levels

Re: Merging data in 2 levels

Re: Merging data in 2 levels

Re: Merging data in 2 levels

Re: Merging data in 2 levels

Re: Merging data in 2 levels

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away