Hi Guys,
I want to ask some help from you guys regarding reading the data and saving results by using do-loop.
I have two dataset "start" and "end" and merge two dataset if I find two names similar (using compged).
Because the number of observation is big, it takes too long so I want to do it by splitting dataset.
proc sql noprint ;
create table want as
select *
from start inner join end
on (compged(start.companyname,end.name,'i') le 50)
order by companyname;
quit ;
Above is the code I wan to run by splitting sample 'end' into multiple subsamples.
Is there a way that
- using 'start' data as a whole
- split 'end' data for, say, every 1000 observations
- and run the above codes multiple times
- then I will have tens of 'want's
- join them and have one final 'want'.
I would really appreciate it if anyone has some suggestions!
Thanks!
I'm not sure it is really what you ultimately want to (or should) do, but you can split end as follows:
DATA end_partition1;
SET end (firstobs=i obs=j);
RUN;
So for sets of 1,000 observations, do i=1 j=1000, then i=1001, j=2000, etc etc
You aren't really changing the number of comparisons though are you? If not, it's not going to be any faster.
But it is way faster when I tried one subsample by manually splitting it.
Also, when I run it as a whole, I get the message below.
The execution of this query involves performing one or more Cartesian product joins that can not be optimized.
If you subsplit are you still doing the same number of comparisons?
If the values are in a single data set, and you're sub-setting the big data set you're reducing the number of comparisons until you run it against the full data set. The only other thing I can think of, is your subsets have an additional criteria that you're not accounting for in the cross join.
You should be getting the same messages regardless of subset or the full data set.
In case you want it, this macro will make 4 data files (chunk1, chunk5, chunk9, and chunk13) out of the test file. Maybe you could adapt it.
data end;
input Plan;
datalines;
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
;
run;
%macro datasplitter(chunksize);
%do i=1 %to 16 %by &chunksize;
data chunk&i;
%let j= %eval(&i + &chunksize - 1);
SET end (firstobs=&i obs=&j);
run;
%end;
%mend;
%datasplitter(4);
A better programmer would know how to replace the hardcoded 16 with the total number of observations and make things more flexible. I don't.
Even though you propose to break down your comparison to chunks of size N_start by 1,000 chunks, you will still end up with N_start *
N_end comparison to do - no savings in overall time.
COMPGED on such a Cartesian comparison is expensive. So I would suggest:
(0) make sure all names are entirely in upper (or lower) case.
(1) replace all instances of duplicate names in each data set with a single record containing the name and pointers/record id's of the original observations.
(2) look for cases of exact equality between start and end - save the matches and remove the matched names from start and end
(3) can you standardize names? I.e. convert trailing "JUNIOR" to "JR." and "JR" also to "JR.", etc. Rerun #2.
(4) avoid making comparisons in which COMPGED is certain to be over 50. For instance establish the number of letters in the names. Then do the COMPGED comparison for lengths of X in start only to cases with lengths of (say) X-4 through X+4 in end. Then length X+1 in start to X-3 through X+5 in end. Or if your names have multiple words, you might additionally filter based on the number of words in the name.
These are the sorts of processes I used to match records for mutual funds by name. Those names were almost all multiple words (vanguard explorer admiral shares) and often abbreviated in random ways (vgrd exp adm shrs) or sometimes with reordered words(vgrd expl shrs adm). This was all done to generate best matches to subsequently be inspected manually. False positives were to be avoided at the expense of missed matches.
Thanks all of you guys for the suggestions!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.