BookmarkSubscribeRSS Feed
Sangho
Obsidian | Level 7

Hi Guys,

I want to ask some help from you guys regarding reading the data and saving results by using do-loop.

 

I have two dataset "start" and "end" and merge two dataset if I find two names similar (using compged). 

Because the number of observation is big, it takes too long so I want to do it by splitting dataset.

 

proc sql noprint ;
 create table want as
 select *
 from start inner join end
 on (compged(start.companyname,end.name,'i') le 50)
 order by companyname;
quit ;

Above is the code I wan to run by splitting sample 'end' into multiple subsamples.

Is there a way that

- using 'start' data as a whole

- split 'end' data for, say, every 1000 observations

- and run the above codes multiple times

- then I will have tens of 'want's

- join them and have one final 'want'.

 

I would really appreciate it if anyone has some suggestions!

Thanks!

9 REPLIES 9
HB
Barite | Level 11 HB
Barite | Level 11

I'm not sure it is really what you ultimately want to (or should) do, but you can split end as follows:

 

DATA end_partition1;
    SET end (firstobs=i obs=j);
RUN;

  So for sets of 1,000 observations, do i=1 j=1000, then i=1001, j=2000, etc etc

Reeza
Super User

You aren't really changing the number of comparisons though are you? If not, it's not going to be any faster.

Sangho
Obsidian | Level 7

But it is way faster when I tried one subsample by manually splitting it.

Also, when I run it as a whole, I get the message below.

 

The execution of this query involves performing one or more Cartesian product joins that can not be optimized.

snoopy369
Barite | Level 11
If you're not getting the cartesian join message on a subsample, then perhaps SAS is choosing to use a hash table in the smaller version rather than the bigger?
Reeza
Super User

If you subsplit are you still doing the same number of comparisons?

 

If the values are in a single data set, and you're sub-setting the big data set you're reducing the number of comparisons until you run it against the full data set.  The only other thing I can think of, is your subsets have an additional criteria that you're not accounting for in the cross join. 

 

You should be getting the same messages regardless of subset or the full data set.

snoopy369
Barite | Level 11
It's possible that the current query time is too long for a timeout or some other issue, where 10 queries of 1 hour each are acceptable but 1 query of 10 hours isn't.
HB
Barite | Level 11 HB
Barite | Level 11

In case you want it, this macro will make 4 data files (chunk1, chunk5, chunk9, and chunk13) out of the test file.  Maybe you could adapt it.

 

data end;
   input Plan;
datalines;
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
;
run;


%macro datasplitter(chunksize);
	%do i=1 %to 16 %by &chunksize;
    	data chunk&i;
			%let j= %eval(&i + &chunksize - 1);
			SET end (firstobs=&i obs=&j);
		run;
	%end;
%mend;

%datasplitter(4);

 A better programmer would know how to replace the hardcoded 16 with the total number of observations and make things more flexible.  I don't.

mkeintz
PROC Star

Even though you propose to break down your comparison to chunks of size N_start by 1,000 chunks, you will still end up with N_start *

N_end comparison to do - no savings in overall time.

 

COMPGED on such a Cartesian comparison is expensive.  So I would suggest:

 

  (0) make sure all names are entirely in upper (or lower) case.

 

  (1) replace all instances of duplicate names in each data set with a single record containing the name and pointers/record id's of the original observations.

 

  (2) look for cases of exact equality between start and end - save the matches and remove the matched names from start and end

 

  (3) can you standardize names?  I.e. convert trailing "JUNIOR" to "JR."    and "JR" also to "JR.", etc.  Rerun #2.

 

  (4)  avoid making comparisons in which COMPGED is certain to be over 50.  For instance establish the number of letters in the names.  Then do the COMPGED  comparison for  lengths of  X in start only to cases with lengths of (say)   X-4 through X+4 in end.   Then length X+1 in start to X-3 through X+5 in end.  Or if your names have multiple words, you might additionally filter based on the number of words in the name.

 

These are the sorts of processes I used to match records for mutual funds by name.  Those names were almost all multiple words  (vanguard explorer admiral shares) and often abbreviated in random ways  (vgrd exp adm shrs) or sometimes with reordered words(vgrd expl shrs adm).  This was all done to generate best matches to subsequently be inspected manually.  False positives were to be avoided at the expense of missed matches.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Sangho
Obsidian | Level 7

Thanks all of you guys for the suggestions!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2019 views
  • 3 likes
  • 5 in conversation