Re: Joining 2 datasets based on Variable and dataset name

Pumpp · Posted 10-31-2020 09:55 AM

Hi,

I have 2 datasets(Sample is attached). One is Name dataset (having more than 600,000 observations) and the other is Country dataset (having more than 200,000 observations). I split the Country datasets into multiple different datasets based on the Country Name giving me close to 100 Unique datasets.

I want to join 2 datasets one based on Name and the other based on Country.

For eg: In the Name dataset, I want to 1st filter on the Name Andy, then since the Country of Andy is US, I want to join this with the US dataset. Then 2nd step I want to filter on the Name Andrew and join it with the New Zealand dataset. Then 3rd step filter on Luca and join it with Italy dataset. And do this process for all the unique names.

Is there any way to do it? Or any Macro which could simplify my coding?

Thanks

Kurt_Bremser · Posted 10-31-2020 10:16 AM

... and this is another reason why one does not split datasets.

With the full dataset, it is a perfectly normal (and simple) join on country and whatever else key you have.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Pumpp · Posted 10-31-2020 11:11 AM

I initially tried without splitting the dataset. But the problem is, the dataset gets too heavy, and post the joining. I need to do further analysis after this joining step.

My analysis includes splitting each word from the address field (from Name dataset) into a number of observations and try to match it with the location field from the Country dataset. This process takes a lot of time and crashes the sas server which is why I decided to split the datasets and then do further analysis.

I finally append the smaller datasets into one (using proc append) and delete the small datasets after my analysis using proc datasets.

Kurt_Bremser · Posted 10-31-2020 03:25 PM

How many words do you have in the address field?

But you can match by loading the whole country dataset into a hash object, and then do the find() method in a do loop over the words until you get a match.

Please show us the match code that fails. I have a suspicion that you inadvertently created a cartesian join over everything.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Pumpp · Posted 11-03-2020 12:33 AM

I have approx 12-20 words in the Address field. My Location field in Country dataset has 3-5 words. I cannot use find as the spellings in both the datasets are incorrect. So i use functions like comged and comlev to match the words.

Below is the code:

""""""

data work.namedata;
set work.namedata;
n+1;
do i=1 to countw(Address," ");
Address_Split=compress(scan(Address,i," ","m"),", / : @ # & ( ) ; \ . ! 0 1 2 3 4 5 6 7 8 9");output;
end;

drop n i;
run;

data work.namedata;
set work.namedata;
Add_split_lag = lag (Address_Split);
Location_Count = _N_;
run;

proc sort data = work.namedata ;
by descending Location_Count ;
run;

data work.namedata;
set work.namedata;
Add_split_lead = lag (Address_split);
run;

proc sql;
create table Name_Country_Cart as
select * from Namedata as a left join Countrydata as b
on a.Country=b.Country;
quit;

data Name_Country_Cart _01;
set Name_Country_Cart ;
m+1;
do j=1 to countw(Location," ");
n+1;
do i=1 to countw(Sub_Location," ");
Sub_Split=compress(scan(Sub_Location,i," ","m"),", / - : @ # & ( ) ; \ . ! 0 1 2 3 4 5 6 7 8 9");output;
Location_Split=compress(scan(Location,j," ","m"),", / - : @ # & ( ) ; \ . ! 0 1 2 3 4 5 6 7 8 9");output;
end; end;
drop j m i n;
run;

proc sort data = Name_Country_Cart _01 nodupkey;
by Name Address Address_split Add_split_lag Add_split_lead Location Location_Split Sub_Split;
where Sub_split is not missing;
run;

data Name_Country_Cart_02;

set Name_Country_Cart_01;

where Location_Split is not missing;

Ged_Score_SDT = compged (Address_split , Location_Split);
Lev_Score_SDT = complev (Address_split , Location_Split);

run;

''''''

The code runs for some observations but gets stuck after a certain point while running Name_Country_Cart_01. Which is why I decided to spilt the Country dataset and run the name dataset one name at a time.

Kurt_Bremser · Posted 11-03-2020 04:44 AM

Given that you have 200.000 obs in country, and up to 5 locations per obs, a dataset with all locations will result in less than a million observations.

With a word length of < 48, that would result in 80MB for a hash object.

You can read sequentially through your namedata dataset and load the country dataset into a hash object with key=country and multidata=yes.

For every observation read, iterate through the address string, and then iterate through the hash, and calculate the scores; whenever you get a better score, keep that element. After the iteration is finished (all entries for that given country), you have the best match for a given word; do the same keep maneuver for the word loop, and once that is finished, you have your best match and output that.

This would prevent the massive joins you do when you join all entries for a country to each name entry before searching for the best match.

See a simple code (without the outer loop through the address field) that finds the best match:

data namedata;
infile datalines dlm="," dsd truncover;
input Name $ Age Country :$20. Address :$20. ID;
datalines;
Andy,24,US,washington,39059
Andrew,39,New Zealand,rotoru,96560
Luca,43,Italy,milan,74125
Keisha,35,Australia,brisbane,22247
;

data country;
infile datalines dlm="," dsd truncover;
input Country :$20. Location :$20. Unique_Code;
datalines;
Australia,Brisbane,51429
Australia,Sydney,61435
Australia,Wales,59084
France,Milan,3551
France,Paris,44103
India,Mumbai,14194
India,Baroda  67199
India,Chennai,92362
India,Bangalore,28911
Italy,Milan,80461
Italy,Trentino,70530
Italy,Naples,28138
New Zealand,Auckland,18822
New Zealand,Rotorua,23534
New Zealand,Stewart Island,14496
New Zealand,Queenstown,15307
Poland,Warsaw,27502
Poland,Ojcow,41760
Poland,Wroclaw,93556
UK,London,48265
UK,Birmingham,22487
UK,Leeds,47103
UK,Scotland,54970
US,New York,37748
US,Los Angeles,37567
US,Washington DC,51738
US,Chicago,10136
US,Las Vegas,47226
US,Nashville,62726
;

data want;
set namedata;
if _n_ = 1
then do;
  length
    location $20
    unique_code 8
    loc $20
    uc 8
  ;
  declare hash c (dataset:"country", multidata:"yes");
  c.definekey("country");
  c.definedata("location","unique_code");
  c.definedone();
  call missing(location,unique_code);
end;
score = 999;
rc = c.find();
do while (rc = 0);
  if score > compged(address,location)
  then do;
    score = compged(address,location);
    loc = location;
    uc = unique_code;
  end;
  rc = c.find_next();
end;
drop location unique_code rc score;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

RichardDeVen · Posted 10-31-2020 11:15 AM

So, for the case of ANDY, you want six rows output ?

Create a DATA Step that restacks the countries and join to that.

Example
Presume library contains tables named county_<country name>

data all_country / view=all_country;
  set permlib.country_: indsname=dsname;
  country = translate(substr(scan(dsname,1,'.'),9),' ','_');
run;

proc sql;
  create table want as 
  select name.*, all_country.location, all_country.code
  from names
  left join all_country
  on names.country = all_country.country
;

Depending on your storage libref and data set names, some tweaking might be needed, including uppercase comparisons.

Joining 2 datasets based on Variable and dataset name