DATA Step, Macro, Functions and more

Run compged by splitting dataset and then merge the output by using do loop

Reply
Occasional Contributor
Posts: 15

Run compged by splitting dataset and then merge the output by using do loop

Hi Guys,

I want to ask some help from you guys regarding reading the data and saving results by using do-loop.

 

I have two dataset "start" and "end" and merge two dataset if I find two names similar (using compged). 

Because the number of observation is big, it takes too long so I want to do it by splitting dataset.

 

proc sql noprint ;
 create table want as
 select *
 from start inner join end
 on (compged(start.companyname,end.name,'i') le 50)
 order by companyname;
quit ;

Above is the code I wan to run by splitting sample 'end' into multiple subsamples.

Is there a way that

- using 'start' data as a whole

- split 'end' data for, say, every 1000 observations

- and run the above codes multiple times

- then I will have tens of 'want's

- join them and have one final 'want'.

 

I would really appreciate it if anyone has some suggestions!

Thanks!

Super Contributor
Super Contributor
Posts: 266

Re: Run compged by splitting dataset and then merge the output by using do loop

I'm not sure it is really what you ultimately want to (or should) do, but you can split end as follows:

 

DATA end_partition1;
    SET end (firstobs=i obs=j);
RUN;

  So for sets of 1,000 observations, do i=1 j=1000, then i=1001, j=2000, etc etc

Super User
Posts: 23,306

Re: Run compged by splitting dataset and then merge the output by using do loop

You aren't really changing the number of comparisons though are you? If not, it's not going to be any faster.

Occasional Contributor
Posts: 15

Re: Run compged by splitting dataset and then merge the output by using do loop

But it is way faster when I tried one subsample by manually splitting it.

Also, when I run it as a whole, I get the message below.

 

The execution of this query involves performing one or more Cartesian product joins that can not be optimized.

Super Contributor
Posts: 320

Re: Run compged by splitting dataset and then merge the output by using do loop

If you're not getting the cartesian join message on a subsample, then perhaps SAS is choosing to use a hash table in the smaller version rather than the bigger?
Super User
Posts: 23,306

Re: Run compged by splitting dataset and then merge the output by using do loop

If you subsplit are you still doing the same number of comparisons?

 

If the values are in a single data set, and you're sub-setting the big data set you're reducing the number of comparisons until you run it against the full data set.  The only other thing I can think of, is your subsets have an additional criteria that you're not accounting for in the cross join. 

 

You should be getting the same messages regardless of subset or the full data set.

Super Contributor
Posts: 320

Re: Run compged by splitting dataset and then merge the output by using do loop

It's possible that the current query time is too long for a timeout or some other issue, where 10 queries of 1 hour each are acceptable but 1 query of 10 hours isn't.
Super Contributor
Super Contributor
Posts: 266

Re: Run compged by splitting dataset and then merge the output by using do loop

[ Edited ]

In case you want it, this macro will make 4 data files (chunk1, chunk5, chunk9, and chunk13) out of the test file.  Maybe you could adapt it.

 

data end;
   input Plan;
datalines;
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
;
run;


%macro datasplitter(chunksize);
	%do i=1 %to 16 %by &chunksize;
    	data chunk&i;
			%let j= %eval(&i + &chunksize - 1);
			SET end (firstobs=&i obs=&j);
		run;
	%end;
%mend;

%datasplitter(4);

 A better programmer would know how to replace the hardcoded 16 with the total number of observations and make things more flexible.  I don't.

Trusted Advisor
Posts: 1,312

Re: Run compged by splitting dataset and then merge the output by using do loop

Even though you propose to break down your comparison to chunks of size N_start by 1,000 chunks, you will still end up with N_start *

N_end comparison to do - no savings in overall time.

 

COMPGED on such a Cartesian comparison is expensive.  So I would suggest:

 

  (0) make sure all names are entirely in upper (or lower) case.

 

  (1) replace all instances of duplicate names in each data set with a single record containing the name and pointers/record id's of the original observations.

 

  (2) look for cases of exact equality between start and end - save the matches and remove the matched names from start and end

 

  (3) can you standardize names?  I.e. convert trailing "JUNIOR" to "JR."    and "JR" also to "JR.", etc.  Rerun #2.

 

  (4)  avoid making comparisons in which COMPGED is certain to be over 50.  For instance establish the number of letters in the names.  Then do the COMPGED  comparison for  lengths of  X in start only to cases with lengths of (say)   X-4 through X+4 in end.   Then length X+1 in start to X-3 through X+5 in end.  Or if your names have multiple words, you might additionally filter based on the number of words in the name.

 

These are the sorts of processes I used to match records for mutual funds by name.  Those names were almost all multiple words  (vanguard explorer admiral shares) and often abbreviated in random ways  (vgrd exp adm shrs) or sometimes with reordered words(vgrd expl shrs adm).  This was all done to generate best matches to subsequently be inspected manually.  False positives were to be avoided at the expense of missed matches.

Occasional Contributor
Posts: 15

Re: Run compged by splitting dataset and then merge the output by using do loop

Thanks all of you guys for the suggestions!

Ask a Question
Discussion stats
  • 9 replies
  • 177 views
  • 3 likes
  • 5 in conversation