BookmarkSubscribeRSS Feed
SuryaKiran
Meteorite | Level 14

I have a dataset (dataset1) with provider state and provider key. In another dataset(dataset2) I have a column load_file. Looking upon this load_file tells which provider it belongs to. Each provider sends monthly , So the file name changes for every particular month for particular provider. Dataset1 has 250 distinct records and dataset2 has 2 million records. I want to match merge two datasets based on the provider key anywhere in load_file .

 

Dataset1

State     Provider_key

TX           123

FL            124

CT           125

FL            255

GA          UNI

IL             223

 

dataset2:

Load_file

20152303_TX_123_20142356.TXT

IL_223_201323.CSV

20152303_IL_223_20160201.TXT

2016_UNI_FREEDOM_UNITY_20162303.TXT

20150506_TX123_20150506.TXT

 

 

And I want like:

Load_fileSTATEPROVIDER_KEY
20152303_TX_123_20142356.TXTTX123
IL_223_201323.CSVIL223
20152303_IL_223_20160201.TXTIL223
2016_UNI_FREEDOM_UNITY_20162303.TXTGAUNI
20150506_TX123_20150506.TXTTX123

 

 

Thanks,
Suryakiran
7 REPLIES 7
LinusH
Tourmaline | Level 20
As a start consider the CONTAINS SQL operator.
But since you don't seem to have a strict file naming policy, it may be hard to assure you match with the correct file name.
Data never sleeps
Shmuel
Garnet | Level 18

According to your posted dataset2 examples, I could think that the provider key 

is always present after the state code separated by underscores, but then I noticed

next input line:

     2016_UNI_FREEDOM_UNITY_20162303.TXT

without a provider key at all.

 

Check dataset2  and try to build rules how to substring the state and the provider key

and what to do in case it is unavailable.

SuryaKiran
Meteorite | Level 14

That is what my problem is. The dataset2 dosen't have proper format for the file name. It's my bad, the providers are not following the proper naming convention. What I'm doing is using If condition with FIND() function which gives me lengthy code. For 250 providers I need to wirte 250 lines IF than ELSE code.

 

e.g:

IF FIND(Load_file,'123','i') then do; state ='TX'; Provider ='123'; End;

ELSE IF FIND(Load_file,'UNI','i') then do; state ='GA'; Provider ='UNI'; End;

...........................................................................

............................................etc

 

 

 

Thanks,
Suryakiran
ballardw
Super User

@SuryaKiran wrote:

That is what my problem is. The dataset2 dosen't have proper format for the file name. It's my bad, the providers are not following the proper naming convention. What I'm doing is using If condition with FIND() function which gives me lengthy code. For 250 providers I need to wirte 250 lines IF than ELSE code.

 

e.g:

IF FIND(Load_file,'123','i') then do; state ='TX'; Provider ='123'; End;

ELSE IF FIND(Load_file,'UNI','i') then do; state ='GA'; Provider ='UNI'; End;

...........................................................................

............................................etc

 

 

 


You need to be very careful of using FIND with numeric sequences because 123 can occur in what appears to be dates in you strings such as 20123003. And if any of the dates are in yyyymmdd then 20151231.

Note that UNI also might appear in other placess such as UNIVERSITY perhaps.

PGStats
Opal | Level 21

Use a hash for efficiency:

 

data D1;
input State :$2. Provider_key :$8.;
datalines;
TX           123
FL            124
CT           125
FL            255
GA          UNI
IL             223
;

data D2;
input Load_file :$128.;
datalines;
20152303_TX_123_20142356.TXT
IL_223_201323.CSV
20152303_IL_223_20160201.TXT
2016_UNI_FREEDOM_UNITY_20162303.TXT
20150506_TX123_20150506.TXT
;

data D3;
length State $2 Provider_key $8;
if _n_ = 1 then do;
	declare hash w(dataset:"D1");
	w.definekey("Provider_key");
	w.definedata("State");
	w.definedone();
	call missing(State, Provider_key);
	end;
set D2;
do i = 1 by 1 until(pos<=0);
    call scan(Load_file, i, pos, len, "_. ");
    if pos > 0 then do;
        Provider_key = substr(Load_file, pos, len);
        if w.find()=0 then output;
        end;
    end;
drop i pos len;
run;

proc print; run;


PG
PGStats
Opal | Level 21

Or if you want to match keys such as TX123, use:

 

data D1;
input State :$2. Provider_key :$8.;
datalines;
TX           123
FL            124
CT           125
FL            255
GA          UNI
IL             223
;

data D2 / view=D2;
set D1;
length key $10;
key = Provider_key;
output;
key = cats(State, Provider_key);
output;
run;

data D3;
input Load_file :$128.;
datalines;
20152303_TX_123_20142356.TXT
IL_223_201323.CSV
20152303_IL_223_20160201.TXT
2016_UNI_FREEDOM_UNITY_20162303.TXT
20150506_TX123_20150506.TXT
;

data D4;
length key $10 State $2 Provider_key $8;
if _n_ = 1 then do;
	declare hash w(dataset:"D2");
	w.definekey("key");
	w.definedata("Provider_key");
	w.definedata("State");
	w.definedone();
	call missing(key, State, Provider_key);
	end;
set D3;
do i = 1 by 1 until(pos<=0);
    call scan(Load_file, i, pos, len, "_. ");
    if pos > 0 then do;
        key = substr(Load_file, pos, len);
        if w.find()=0 then output;
        end;
    end;
drop i pos len key;
run;

proc print; run;


PG
ballardw
Super User

Is this name correct? 20150506_TX123_20150506.TXT

Or is it actually 20150506_TX_123_20150506.TXT?

If the Provider_key portion of the name ALWAYS appears between underscores and never occurs with underscores when not serving as the key value then something like this may work.

 

proc sql;
   create table want as
   select a.load_file, b.state, b.provider_key
   from dataset1 as a, dataset2 as b
   where findw(a.loadfile,b.provider_key,"_")>0
   ;
quit;

But if the names are not consistent about use of the provider key value and or state this may become a situation where you have to use multiple searches similar to

 

 

where findw(a.loadfile,b.provider_key,"_")>0

      or index(a.loadfile,catx('_',b.state,b.provider_key)>0

      or index(a.loadfile,cats(b.state,b.provider_key)>0

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 994 views
  • 0 likes
  • 5 in conversation