01-05-2018 04:54 PM
I want to ask some help from you guys regarding reading the data and saving results by using do-loop.
I have two dataset "start" and "end" and merge two dataset if I find two names similar (using compged).
Because the number of observation is big, it takes too long so I want to do it by splitting dataset.
proc sql noprint ; create table want as select * from start inner join end on (compged(start.companyname,end.name,'i') le 50) order by companyname; quit ;
Above is the code I wan to run by splitting sample 'end' into multiple subsamples.
Is there a way that
- using 'start' data as a whole
- split 'end' data for, say, every 1000 observations
- and run the above codes multiple times
- then I will have tens of 'want's
- join them and have one final 'want'.
I would really appreciate it if anyone has some suggestions!
01-05-2018 05:05 PM
I'm not sure it is really what you ultimately want to (or should) do, but you can split end as follows:
DATA end_partition1; SET end (firstobs=i obs=j); RUN;
So for sets of 1,000 observations, do i=1 j=1000, then i=1001, j=2000, etc etc
01-05-2018 05:21 PM
But it is way faster when I tried one subsample by manually splitting it.
Also, when I run it as a whole, I get the message below.
The execution of this query involves performing one or more Cartesian product joins that can not be optimized.
01-05-2018 05:31 PM
01-05-2018 05:34 PM
If you subsplit are you still doing the same number of comparisons?
If the values are in a single data set, and you're sub-setting the big data set you're reducing the number of comparisons until you run it against the full data set. The only other thing I can think of, is your subsets have an additional criteria that you're not accounting for in the cross join.
You should be getting the same messages regardless of subset or the full data set.
01-05-2018 05:29 PM
01-05-2018 06:26 PM - edited 01-05-2018 06:28 PM
In case you want it, this macro will make 4 data files (chunk1, chunk5, chunk9, and chunk13) out of the test file. Maybe you could adapt it.
data end; input Plan; datalines; 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 ; run; %macro datasplitter(chunksize); %do i=1 %to 16 %by &chunksize; data chunk&i; %let j= %eval(&i + &chunksize - 1); SET end (firstobs=&i obs=&j); run; %end; %mend; %datasplitter(4);
A better programmer would know how to replace the hardcoded 16 with the total number of observations and make things more flexible. I don't.
01-06-2018 04:06 PM
Even though you propose to break down your comparison to chunks of size N_start by 1,000 chunks, you will still end up with N_start *
N_end comparison to do - no savings in overall time.
COMPGED on such a Cartesian comparison is expensive. So I would suggest:
(0) make sure all names are entirely in upper (or lower) case.
(1) replace all instances of duplicate names in each data set with a single record containing the name and pointers/record id's of the original observations.
(2) look for cases of exact equality between start and end - save the matches and remove the matched names from start and end
(3) can you standardize names? I.e. convert trailing "JUNIOR" to "JR." and "JR" also to "JR.", etc. Rerun #2.
(4) avoid making comparisons in which COMPGED is certain to be over 50. For instance establish the number of letters in the names. Then do the COMPGED comparison for lengths of X in start only to cases with lengths of (say) X-4 through X+4 in end. Then length X+1 in start to X-3 through X+5 in end. Or if your names have multiple words, you might additionally filter based on the number of words in the name.
These are the sorts of processes I used to match records for mutual funds by name. Those names were almost all multiple words (vanguard explorer admiral shares) and often abbreviated in random ways (vgrd exp adm shrs) or sometimes with reordered words(vgrd expl shrs adm). This was all done to generate best matches to subsequently be inspected manually. False positives were to be avoided at the expense of missed matches.