Solved: Re: Merge large datasets using sql or harsh objects (need help!!) - Page 2

novinosrin · Posted 08-30-2019 08:11 PM

A technique that can be used to reduce the hash memory footprint and still avoid sorting the entire file is to index it first by the splitting variable and then use by-processing, For example:


proc sql noprint;                                                              
  create index cusip on ibes (cusip) ;   
  create index cusip on combine (cusip) ;  
quit 

data want;
if _n_=1 then do;
 dcl hash H (multidata:'y') ;
 h.definekey  ("cusip") ;
 h.definedata ("ANNDATS", "value") ;
 h.definedone () ;
 end;
 do until(last.cusip);
  set ibes(in=a) combine(in=b);
  by cusip;
  if a then rc=h.add();	/*load ibes by group in hash*/
  else do;
   /*Here when b do your find,find_next retrieve and
   whatever logic want*/
  end;
 end;
 h.clear();
 drop rc;
run;

This way, the hash memory footprint is confined to the extent of the largest CUSIP by-group, at the expense of the overhead of creating the index and indexed data retrieval..HTH

hashman · Posted 08-30-2019 11:05 PM

@constancezeng:

1. From what you've shown, you have a data type conflict between SEDOL defined as numeric in the LENGTH statement and its character type in the data sets you're processing (it contains alpha data, so it can't be numeric). First the compiler sees the LENGTH statement and stores SEDOL in the compiler symbol table as numeric; then it sees COMBINE referenced in the first SET and reads its descriptor. But in the latter, SEDOL is defined as character, hence the conflict. If you haven't seen a related error before, you will. Solution: Eliminate the LENGTH statement. (In the docs, it's used to create a PDV host variable for the corresponding hash variable, but you don't need it here since the compiler creates it using its attributes stored in the descriptor of COMBINE.)

2. I don't see from your code how the CUSIP data type conflict is possible. It cannot have anything to do with the hash table, as no such hash variable is defined. Judging from your code, CUSIP can only come from IBES since it's not kept in COMBINE and not defined elsewhere in the step. Hence, the code that resulted in this error cannot be the code you showed.

3. As @novinosrin said, it's safer to use CALL MISSING (PRC, DATE), though in your case assigning standard numeric missing values to these variables explicitly doesn't hurt, as both are numeric. However, in your case what you're effectively executing is an inner join: If SEDOL isn't found in IBES, you don't output the record. Therefore, you don't need to set PRC and DATE to missing at all; so, just remove the corresponding statements.

4. Memory shortage is what you may encounter once your step has compiled successfully and commences on loading the table GROUP. If you're on a 64-bit system, its item length is 64 if SEDOL is $16 or shorter and 80 if it's longer. With 16 million items loaded, you're looking at about 1.25 GB of RAM footprint. If your SAS session is allotted 4 GB (which is not much to ask), you should be okay.

5. The reason your shot at doing it with SQL has bombed is most likely an attempt by SQL to sort COMBINE behind the scenes. Since you want an inner join, you can change this default behavior by coding the MAGIC=103 option with the proc SQL statement. It will force the SQL optimizer to select the SQXJHSH access method - in other words, to store IBES in an internal hash table and look it up for every record in COMBINE without sorting the latter; i.e. to do implicitly what you're trying to do by using the hash explicitly. Note that it will only work for an inner join (not left, right, etc.), but since that's what you need, you're in luck here.

6. Yet another option is: Index the file IBES and use the index to look it up for every record in COMBINE using the SET statement with the KEY= option. You can find a gazillion SAS articles on how to do that (including handling duplicate keys on the driver side) or just read the definitive opus on SAS indexes by Michael Raithel:

https://www.amazon.com/Complete-Guide-SAS-Indexes/dp/1590478495

The method is slower than a hash (or implicitly hashed SQL), but it's advantage is that you don't have to worry about the memory footprint.

Kind regards

Paul D.

constancezeng · Posted 08-31-2019 12:21 AM

Hey Paul. I'd like to first thank you and @novinosrin for your answers. I just tried the magic option that you suggested, unfortunately I do not think that I have sufficient space according to SAS.

In addition to that, using what novinosrin suggested, SAS has produced an output. But you can see that a lot of observations are dropped - I am still trying to figuring the reason, maybe I coded it in a wrong way again (I have indexed cusip in both datasets).

Again, thanks to all your comments! I appreciate for your work and looking forward to your reply!

hashman · Posted 08-31-2019 12:53 AM

@constancezeng:

You just don't have enough space in your WORK library to write the output data set you are writing.

As long as this is the case, it doesn't matter what kind of file matching method you're using.

Either clean your WORK library of the junk you don't need or work with your admin to give you more disk space in WORK (or write to a different, permanent, library where you do have enough space).

From your log, SQL reacts as I've expected: Note the "merge join has been transformed into a hash join" message.

And yes, you're creating an inner join: (a) because of how your SQL query is structured and (b) otherwise, you wouldn't receive the above message from the optimizer; no join other than an inner join can be prodded into hashing by MAGIC=103.

If you indeed want a left join, you have to code it explicitly. But remember that MAGIC=103 won't work in this case (it makes no sense to me and a number of other big data guys I know, but the fact is, it won't).

Two more suggestions:

Turn on the system options FULLSTIMER to see memory usage.
Turn on the system option MSGLEVEL=I and add the option _METHOD (in addition to MAGIC=103) to the SQL statement to see the SQL access methods reported in the log.

In fact, since you are dealing with fairly large data, I'd recommend keeping these system options on at all times, i.e. just code:

option fullstimer msglevel=i ;

The more relevant info in the SAS log, the better.

Kind regards

Paul D.

DonH · Posted 08-31-2019 08:19 AM

A corollary to @hashman's suggestion about WORK, if you have access to a location with more space, define a libname to it (e.g., a libname like MYWORK) and then use the USER option to have SAS use that location for any data sets that have a one level name. That allows you to use a different location for your temporary data sets without having to update each reference.

Just make sure to delete all the SAS datasets you created. If the location only contains your WORK data sets (e.g., by creating a sub-folder in that location with more space), you can simply run something like to delete them all:

proc datasets lib=mywork nolist kill;
run;

constancezeng · Posted 09-01-2019 06:45 PM

Thank you Paul! Problem solved after getting a new hard drive. At the end I used the magic=103 option, it took me around 40 mins to finish the merge which is amazing.

Thanks to all the reply! Wish you have a great weekend.

novinosrin · Posted 09-01-2019 07:57 PM

Very well acknowledged of @DonH and @hashman 's incredible magic=join algoritm. Nice. However as Guru knows I pay attention and take notes, I feel like I've gotten back to kindergarden as my understanding was(until this thread) magic=101,102 or 103 of join algorithms choice is done rather by the SQL optimizer. And to oveerride it the user gets to choose knowing the data and of course the resources available. Well well, my attention to detail has taken a beating yet again and now i'm confused with the default and if so, why the default. If the default is really optimum, or even reliable?

Jesus!!!!!!!!!!!!!!!!!!!!! I like the subject "Harsh" than Hash object. Makes sense as the SQL optimizer drives me nuts.

hashman · Posted 09-01-2019 09:09 PM

@novinosrin:

The optimizer might choose the SQXJHSH access method by default if it deems one table small enough to load it into a memory-resident hash table. In the days of yore when the possibility to code MAGIC= was yet undisclosed, it was possible to prod the optimizer into choosing SQXJHSH by coding the BUFFERSIZE= option with a big argument, such as 10000000, for example, basically telling SQL "you have enough memory". It usually works when one table is quite big (and seemed laborious to sort) and the other is much smaller. And of course now, just as then, it works only for inner joins for reasons not quite clear to me. So, if you want a left or right or full join via a hash, the hash object is the only choice.

One exception is a situation where you have a simple integer key or a key that can be very inexpensively converted to an integer, such as an up to $6 (or $7 on the mainframe) character key that converts to an integer as INPUT (Key, PIB6.). In this case, simple array hashing via linear probing is faster than the hash object and much more conservative memory-wise.

Kind regards

Paul D.

novinosrin · Posted 09-01-2019 09:16 PM

Got it. Very very intriguing and interesting. Thank you Guru as always for your time. Hmm, Super info.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!