I have a dataset (dataset1) with provider state and provider key. In another dataset(dataset2) I have a column load_file. Looking upon this load_file tells which provider it belongs to. Each provider sends monthly , So the file name changes for every particular month for particular provider. Dataset1 has 250 distinct records and dataset2 has 2 million records. I want to match merge two datasets based on the provider key anywhere in load_file .
Dataset1
State Provider_key
TX 123
FL 124
CT 125
FL 255
GA UNI
IL 223
dataset2:
Load_file
20152303_TX_123_20142356.TXT
IL_223_201323.CSV
20152303_IL_223_20160201.TXT
2016_UNI_FREEDOM_UNITY_20162303.TXT
20150506_TX123_20150506.TXT
And I want like:
Load_file | STATE | PROVIDER_KEY |
20152303_TX_123_20142356.TXT | TX | 123 |
IL_223_201323.CSV | IL | 223 |
20152303_IL_223_20160201.TXT | IL | 223 |
2016_UNI_FREEDOM_UNITY_20162303.TXT | GA | UNI |
20150506_TX123_20150506.TXT | TX | 123 |
According to your posted dataset2 examples, I could think that the provider key
is always present after the state code separated by underscores, but then I noticed
next input line:
2016_UNI_FREEDOM_UNITY_20162303.TXT
without a provider key at all.
Check dataset2 and try to build rules how to substring the state and the provider key
and what to do in case it is unavailable.
That is what my problem is. The dataset2 dosen't have proper format for the file name. It's my bad, the providers are not following the proper naming convention. What I'm doing is using If condition with FIND() function which gives me lengthy code. For 250 providers I need to wirte 250 lines IF than ELSE code.
e.g:
IF FIND(Load_file,'123','i') then do; state ='TX'; Provider ='123'; End;
ELSE IF FIND(Load_file,'UNI','i') then do; state ='GA'; Provider ='UNI'; End;
...........................................................................
............................................etc
@SuryaKiran wrote:
That is what my problem is. The dataset2 dosen't have proper format for the file name. It's my bad, the providers are not following the proper naming convention. What I'm doing is using If condition with FIND() function which gives me lengthy code. For 250 providers I need to wirte 250 lines IF than ELSE code.
e.g:
IF FIND(Load_file,'123','i') then do; state ='TX'; Provider ='123'; End;
ELSE IF FIND(Load_file,'UNI','i') then do; state ='GA'; Provider ='UNI'; End;
...........................................................................
............................................etc
You need to be very careful of using FIND with numeric sequences because 123 can occur in what appears to be dates in you strings such as 20123003. And if any of the dates are in yyyymmdd then 20151231.
Note that UNI also might appear in other placess such as UNIVERSITY perhaps.
Use a hash for efficiency:
data D1;
input State :$2. Provider_key :$8.;
datalines;
TX 123
FL 124
CT 125
FL 255
GA UNI
IL 223
;
data D2;
input Load_file :$128.;
datalines;
20152303_TX_123_20142356.TXT
IL_223_201323.CSV
20152303_IL_223_20160201.TXT
2016_UNI_FREEDOM_UNITY_20162303.TXT
20150506_TX123_20150506.TXT
;
data D3;
length State $2 Provider_key $8;
if _n_ = 1 then do;
declare hash w(dataset:"D1");
w.definekey("Provider_key");
w.definedata("State");
w.definedone();
call missing(State, Provider_key);
end;
set D2;
do i = 1 by 1 until(pos<=0);
call scan(Load_file, i, pos, len, "_. ");
if pos > 0 then do;
Provider_key = substr(Load_file, pos, len);
if w.find()=0 then output;
end;
end;
drop i pos len;
run;
proc print; run;
Or if you want to match keys such as TX123, use:
data D1;
input State :$2. Provider_key :$8.;
datalines;
TX 123
FL 124
CT 125
FL 255
GA UNI
IL 223
;
data D2 / view=D2;
set D1;
length key $10;
key = Provider_key;
output;
key = cats(State, Provider_key);
output;
run;
data D3;
input Load_file :$128.;
datalines;
20152303_TX_123_20142356.TXT
IL_223_201323.CSV
20152303_IL_223_20160201.TXT
2016_UNI_FREEDOM_UNITY_20162303.TXT
20150506_TX123_20150506.TXT
;
data D4;
length key $10 State $2 Provider_key $8;
if _n_ = 1 then do;
declare hash w(dataset:"D2");
w.definekey("key");
w.definedata("Provider_key");
w.definedata("State");
w.definedone();
call missing(key, State, Provider_key);
end;
set D3;
do i = 1 by 1 until(pos<=0);
call scan(Load_file, i, pos, len, "_. ");
if pos > 0 then do;
key = substr(Load_file, pos, len);
if w.find()=0 then output;
end;
end;
drop i pos len key;
run;
proc print; run;
Is this name correct? 20150506_TX123_20150506.TXT
Or is it actually 20150506_TX_123_20150506.TXT?
If the Provider_key portion of the name ALWAYS appears between underscores and never occurs with underscores when not serving as the key value then something like this may work.
proc sql;
create table want as
select a.load_file, b.state, b.provider_key
from dataset1 as a, dataset2 as b
where findw(a.loadfile,b.provider_key,"_")>0
;
quit;
But if the names are not consistent about use of the provider key value and or state this may become a situation where you have to use multiple searches similar to
where findw(a.loadfile,b.provider_key,"_")>0
or index(a.loadfile,catx('_',b.state,b.provider_key)>0
or index(a.loadfile,cats(b.state,b.provider_key)>0
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.