DATA Step, Macro, Functions and more

One-many match merge

Reply
Frequent Contributor
Posts: 127

One-many match merge

I have a dataset (dataset1) with provider state and provider key. In another dataset(dataset2) I have a column load_file. Looking upon this load_file tells which provider it belongs to. Each provider sends monthly , So the file name changes for every particular month for particular provider. Dataset1 has 250 distinct records and dataset2 has 2 million records. I want to match merge two datasets based on the provider key anywhere in load_file .

 

Dataset1

State     Provider_key

TX           123

FL            124

CT           125

FL            255

GA          UNI

IL             223

 

dataset2:

Load_file

20152303_TX_123_20142356.TXT

IL_223_201323.CSV

20152303_IL_223_20160201.TXT

2016_UNI_FREEDOM_UNITY_20162303.TXT

20150506_TX123_20150506.TXT

 

 

And I want like:

Load_fileSTATEPROVIDER_KEY
20152303_TX_123_20142356.TXTTX123
IL_223_201323.CSVIL223
20152303_IL_223_20160201.TXTIL223
2016_UNI_FREEDOM_UNITY_20162303.TXTGAUNI
20150506_TX123_20150506.TXTTX123

 

 

Thanks,
Suryakiran
Super User
Posts: 5,257

Re: One-many match merge

As a start consider the CONTAINS SQL operator.
But since you don't seem to have a strict file naming policy, it may be hard to assure you match with the correct file name.
Data never sleeps
Trusted Advisor
Posts: 1,394

Re: One-many match merge

According to your posted dataset2 examples, I could think that the provider key 

is always present after the state code separated by underscores, but then I noticed

next input line:

     2016_UNI_FREEDOM_UNITY_20162303.TXT

without a provider key at all.

 

Check dataset2  and try to build rules how to substring the state and the provider key

and what to do in case it is unavailable.

Frequent Contributor
Posts: 127

Re: One-many match merge

That is what my problem is. The dataset2 dosen't have proper format for the file name. It's my bad, the providers are not following the proper naming convention. What I'm doing is using If condition with FIND() function which gives me lengthy code. For 250 providers I need to wirte 250 lines IF than ELSE code.

 

e.g:

IF FIND(Load_file,'123','i') then do; state ='TX'; Provider ='123'; End;

ELSE IF FIND(Load_file,'UNI','i') then do; state ='GA'; Provider ='UNI'; End;

...........................................................................

............................................etc

 

 

 

Thanks,
Suryakiran
Super User
Posts: 10,516

Re: One-many match merge


SuryaKiran wrote:

That is what my problem is. The dataset2 dosen't have proper format for the file name. It's my bad, the providers are not following the proper naming convention. What I'm doing is using If condition with FIND() function which gives me lengthy code. For 250 providers I need to wirte 250 lines IF than ELSE code.

 

e.g:

IF FIND(Load_file,'123','i') then do; state ='TX'; Provider ='123'; End;

ELSE IF FIND(Load_file,'UNI','i') then do; state ='GA'; Provider ='UNI'; End;

...........................................................................

............................................etc

 

 

 


You need to be very careful of using FIND with numeric sequences because 123 can occur in what appears to be dates in you strings such as 20123003. And if any of the dates are in yyyymmdd then 20151231.

Note that UNI also might appear in other placess such as UNIVERSITY perhaps.

Respected Advisor
Posts: 4,654

Re: One-many match merge

Use a hash for efficiency:

 

data D1;
input State :$2. Provider_key :$8.;
datalines;
TX           123
FL            124
CT           125
FL            255
GA          UNI
IL             223
;

data D2;
input Load_file :$128.;
datalines;
20152303_TX_123_20142356.TXT
IL_223_201323.CSV
20152303_IL_223_20160201.TXT
2016_UNI_FREEDOM_UNITY_20162303.TXT
20150506_TX123_20150506.TXT
;

data D3;
length State $2 Provider_key $8;
if _n_ = 1 then do;
	declare hash w(dataset:"D1");
	w.definekey("Provider_key");
	w.definedata("State");
	w.definedone();
	call missing(State, Provider_key);
	end;
set D2;
do i = 1 by 1 until(pos<=0);
    call scan(Load_file, i, pos, len, "_. ");
    if pos > 0 then do;
        Provider_key = substr(Load_file, pos, len);
        if w.find()=0 then output;
        end;
    end;
drop i pos len;
run;

proc print; run;


PG
Respected Advisor
Posts: 4,654

Re: One-many match merge

Or if you want to match keys such as TX123, use:

 

data D1;
input State :$2. Provider_key :$8.;
datalines;
TX           123
FL            124
CT           125
FL            255
GA          UNI
IL             223
;

data D2 / view=D2;
set D1;
length key $10;
key = Provider_key;
output;
key = cats(State, Provider_key);
output;
run;

data D3;
input Load_file :$128.;
datalines;
20152303_TX_123_20142356.TXT
IL_223_201323.CSV
20152303_IL_223_20160201.TXT
2016_UNI_FREEDOM_UNITY_20162303.TXT
20150506_TX123_20150506.TXT
;

data D4;
length key $10 State $2 Provider_key $8;
if _n_ = 1 then do;
	declare hash w(dataset:"D2");
	w.definekey("key");
	w.definedata("Provider_key");
	w.definedata("State");
	w.definedone();
	call missing(key, State, Provider_key);
	end;
set D3;
do i = 1 by 1 until(pos<=0);
    call scan(Load_file, i, pos, len, "_. ");
    if pos > 0 then do;
        key = substr(Load_file, pos, len);
        if w.find()=0 then output;
        end;
    end;
drop i pos len key;
run;

proc print; run;


PG
Super User
Posts: 10,516

Re: One-many match merge

Is this name correct? 20150506_TX123_20150506.TXT

Or is it actually 20150506_TX_123_20150506.TXT?

If the Provider_key portion of the name ALWAYS appears between underscores and never occurs with underscores when not serving as the key value then something like this may work.

 

proc sql;
   create table want as
   select a.load_file, b.state, b.provider_key
   from dataset1 as a, dataset2 as b
   where findw(a.loadfile,b.provider_key,"_")>0
   ;
quit;

But if the names are not consistent about use of the provider key value and or state this may become a situation where you have to use multiple searches similar to

 

 

where findw(a.loadfile,b.provider_key,"_")>0

      or index(a.loadfile,catx('_',b.state,b.provider_key)>0

      or index(a.loadfile,cats(b.state,b.provider_key)>0

Ask a Question
Discussion stats
  • 7 replies
  • 235 views
  • 0 likes
  • 5 in conversation