PROC SQL join based on a common Substring

Zeus_Olympus · Posted 09-14-2022 02:38 PM

Dear all,

I have two tables A and B

  data a;
     subj=1; dat='HI_76_KAM_BAK_887_PD34_KAR'; output;
     subj=2; dat='FT_01_S_NDRI_KAR'; output;
     subj=3; dat='DY_02_DEALIRITEON_843_KAR'; output;
     subj=4; dat='DY_02_RET_DEALIRITEON_ST_843_KAR'; output;
     subj=5; dat='BV_719_B_SIAFT_JH_500_KAR'; output;
     subj=6; dat='BV_303_B_SIAFT_500_KAR'; output;
     subj=7; dat='BV_711_B_SIAFT_KAR'; output;
     subj=8; dat='BV_714_XEMEYDE5_CESIC_890_CAS_KAR'; output;
     subj=9; dat='BV_714_MIL_XMD5_CSIC_CAS_KAR'; output;
     subj=10; dat='BV_714_TJIRSIC_CAS_KAR'; output;
     subj=11; dat='DY_30_KAM_BAK_YR_956_KAR'; output;
     subj=12; dat='DY_30_PIC_KAM_BAK_956_KAR'; output;
     subj=13; dat='AM_07_MH_KAR'; output;
     subj=14; dat='AM_53_JIDIDUM_DYDERS_KAR'; output;

    run;
    
    data b;
     id=1; dat_new='KAM_BAK_887'; other_var=1; output;
     id=2; dat_new='S_NDRI'; other_var=2; output;
     id=3; dat_new='DEALIRITEON_843'; other_var=3; output;
     id=4; dat_new='B_SIAFT_500'; other_var=11; output;
     id=5; dat_new='XEMEYDE5_CESIC_CAS'; other_var=21; output;
     id=6; dat_new='KAM_BAK_956'; other_var=25; output;
     id=7; dat_new='MH'; other_var=21; other_var=28; output;
     id=7; dat_new='JIDIDUM_DYDERS'; other_var=29; output;
run;

I need to appropriately join them and get the table WANT as follows:

data want;
dat=HI_76_KAM_BAK_887_PD34_KAR;    dat_new=KAM_BAK_887;output; 
dat=FT_01_S_NDRI_KAR;              dat_new=S_NDRI;output;
dat=DY_02_DEALIRITEON_843_KAR;     dat_new=DEALIRITEON_843;output;
dat=BV_303_B_SIAFT_500_KAR;        dat_new=B_SIAFT_500;output;
dat=BV_714_MIL_XMD5_CSIC_CAS_KAR;  dat_new=XEMEYDE5_CESIC_CAS;output;
dat=DY_30_PIC_KAM_BAK_956_KAR;     dat_new=KAM_BAK_956;output;
dat=AM_07_MH_KAR;                  dat_new=MH;output;
dat=AM_53_JIDIDUM_DYDERS_KAR;      dat_new=JIDIDUM_DYDERS;output;
run;

I have tried to use something like this within a PROC SQL join but I have failed.

ifc(indexw(strip(dat),strip(dat_new),'_'),'OK','NOT OK');

Any suggestions would be more than welcome.

Thanking you in advance.

PaigeMiller · Posted 09-14-2022 03:11 PM

Note: I have modified the code for A and B to include LENGTH statements.

data a;
    length dat $ 40;
     subj=1; dat='HI_76_KAM_BAK_887_PD34_KAR'; output;
     subj=2; dat='FT_01_S_NDRI_KAR'; output;
     subj=3; dat='DY_02_DEALIRITEON_843_KAR'; output;
     subj=4; dat='DY_02_RET_DEALIRITEON_ST_843_KAR'; output;
     subj=5; dat='BV_719_B_SIAFT_JH_500_KAR'; output;
     subj=6; dat='BV_303_B_SIAFT_500_KAR'; output;
     subj=7; dat='BV_711_B_SIAFT_KAR'; output;
     subj=8; dat='BV_714_XEMEYDE5_CESIC_890_CAS_KAR'; output;
     subj=9; dat='BV_714_MIL_XMD5_CSIC_CAS_KAR'; output;
     subj=10; dat='BV_714_TJIRSIC_CAS_KAR'; output;
     subj=11; dat='DY_30_KAM_BAK_YR_956_KAR'; output;
     subj=12; dat='DY_30_PIC_KAM_BAK_956_KAR'; output;
     subj=13; dat='AM_07_MH_KAR'; output;
     subj=14; dat='AM_53_JIDIDUM_DYDERS_KAR'; output;

    run;
    
    data b;
        length dat_new $ 24;
     id=1; dat_new='KAM_BAK_887'; other_var=1; output;
     id=2; dat_new='S_NDRI'; other_var=2; output;
     id=3; dat_new='DEALIRITEON_843'; other_var=3; output;
     id=4; dat_new='B_SIAFT_500'; other_var=11; output;
     id=5; dat_new='XEMEYDE5_CESIC_CAS'; other_var=21; output;
     id=6; dat_new='KAM_BAK_956'; other_var=25; output;
     id=7; dat_new='MH'; other_var=21; other_var=28; output;
     id=7; dat_new='JIDIDUM_DYDERS'; other_var=29; output;
run;

proc sql;
    create table want as select
        a.dat,
        b.dat_new
    from a left join b 
        on find(a.dat,b.dat_new,'t')>=1
        where not missing(b.dat_new);
quit;

The reason that INDEXW (and also FINDW) will not work is that in record one, the words it looks at with the '_' delimiter are HI 76 KAM BAK 887 etc and it never finds one that matches KAM_BAK.

The approach in my code will fail if in data set A, there are values of DAT that contain MH, for example, if DAT='AMHERST' this will match data set b ID=7 DAT_NEW='MH'. I do not have at this moment a way to get around this.

Maybe regular expressions will have a better solution, but I leave that to others.

--
Paige Miller

mkeintz · Posted 09-14-2022 03:14 PM

What is the matching rule you wish to apply?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PROC SQL join based on a common Substring

Re: PROC SQL join based on a common Substring

Re: PROC SQL join based on a common Substring

PROC SQL join based on a common Substring

Re: PROC SQL join based on a common Substring

Re: PROC SQL join based on a common Substring

Registration is open

SAS Training: Just a Click Away