Re: Run compged by splitting dataset and then merge the output by usin...

Sangho · Posted 01-05-2018 04:54 PM

Hi Guys,

I want to ask some help from you guys regarding reading the data and saving results by using do-loop.

I have two dataset "start" and "end" and merge two dataset if I find two names similar (using compged).

Because the number of observation is big, it takes too long so I want to do it by splitting dataset.

proc sql noprint ;
 create table want as
 select *
 from start inner join end
 on (compged(start.companyname,end.name,'i') le 50)
 order by companyname;
quit ;

Above is the code I wan to run by splitting sample 'end' into multiple subsamples.

Is there a way that

- using 'start' data as a whole

- split 'end' data for, say, every 1000 observations

- and run the above codes multiple times

- then I will have tens of 'want's

- join them and have one final 'want'.

I would really appreciate it if anyone has some suggestions!

Thanks!

HB · Posted 01-05-2018 05:05 PM

I'm not sure it is really what you ultimately want to (or should) do, but you can split end as follows:

DATA end_partition1;
    SET end (firstobs=i obs=j);
RUN;

So for sets of 1,000 observations, do i=1 j=1000, then i=1001, j=2000, etc etc

Reeza · Posted 01-05-2018 05:10 PM

You aren't really changing the number of comparisons though are you? If not, it's not going to be any faster.

Sangho · Posted 01-05-2018 05:21 PM

But it is way faster when I tried one subsample by manually splitting it.

Also, when I run it as a whole, I get the message below.

The execution of this query involves performing one or more Cartesian product joins that can not be optimized.

snoopy369 · Posted 01-05-2018 05:31 PM

If you're not getting the cartesian join message on a subsample, then perhaps SAS is choosing to use a hash table in the smaller version rather than the bigger?

Reeza · Posted 01-05-2018 05:34 PM

If you subsplit are you still doing the same number of comparisons?

If the values are in a single data set, and you're sub-setting the big data set you're reducing the number of comparisons until you run it against the full data set. The only other thing I can think of, is your subsets have an additional criteria that you're not accounting for in the cross join.

You should be getting the same messages regardless of subset or the full data set.

snoopy369 · Posted 01-05-2018 05:29 PM

It's possible that the current query time is too long for a timeout or some other issue, where 10 queries of 1 hour each are acceptable but 1 query of 10 hours isn't.

HB · Posted 01-05-2018 06:26 PM

In case you want it, this macro will make 4 data files (chunk1, chunk5, chunk9, and chunk13) out of the test file. Maybe you could adapt it.

data end;
   input Plan;
datalines;
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
;
run;


%macro datasplitter(chunksize);
	%do i=1 %to 16 %by &chunksize;
    	data chunk&i;
			%let j= %eval(&i + &chunksize - 1);
			SET end (firstobs=&i obs=&j);
		run;
	%end;
%mend;

%datasplitter(4);

A better programmer would know how to replace the hardcoded 16 with the total number of observations and make things more flexible. I don't.

mkeintz · Posted 01-06-2018 04:06 PM

Even though you propose to break down your comparison to chunks of size N_start by 1,000 chunks, you will still end up with N_start *

N_end comparison to do - no savings in overall time.

COMPGED on such a Cartesian comparison is expensive. So I would suggest:

(0) make sure all names are entirely in upper (or lower) case.

(1) replace all instances of duplicate names in each data set with a single record containing the name and pointers/record id's of the original observations.

(2) look for cases of exact equality between start and end - save the matches and remove the matched names from start and end

(3) can you standardize names? I.e. convert trailing "JUNIOR" to "JR." and "JR" also to "JR.", etc. Rerun #2.

(4) avoid making comparisons in which COMPGED is certain to be over 50. For instance establish the number of letters in the names. Then do the COMPGED comparison for lengths of X in start only to cases with lengths of (say) X-4 through X+4 in end. Then length X+1 in start to X-3 through X+5 in end. Or if your names have multiple words, you might additionally filter based on the number of words in the name.

These are the sorts of processes I used to match records for mutual funds by name. Those names were almost all multiple words (vanguard explorer admiral shares) and often abbreviated in random ways (vgrd exp adm shrs) or sometimes with reordered words(vgrd expl shrs adm). This was all done to generate best matches to subsequently be inspected manually. False positives were to be avoided at the expense of missed matches.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Sangho · Posted 01-07-2018 12:36 PM

Thanks all of you guys for the suggestions!

Run compged by splitting dataset and then merge the output by using do loop

Re: Run compged by splitting dataset and then merge the output by using do loop

Re: Run compged by splitting dataset and then merge the output by using do loop

Re: Run compged by splitting dataset and then merge the output by using do loop

Re: Run compged by splitting dataset and then merge the output by using do loop

Re: Run compged by splitting dataset and then merge the output by using do loop

Re: Run compged by splitting dataset and then merge the output by using do loop

Re: Run compged by splitting dataset and then merge the output by using do loop

Re: Run compged by splitting dataset and then merge the output by using do loop

Re: Run compged by splitting dataset and then merge the output by using do loop

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away