Solved: Re: Merge large datasets using sql or harsh objects (need help!!)

constancezeng · Posted 08-30-2019 03:43 PM

Hey guys. New sas user here. I am having trouble merging two super large datasets named ibes and combine (around 30g). I've tried using proc sql but it took me forever and still I cannot have the results.

So I am thinking about merging those two datasets based on sedol=cusip (same company id but different var names in ibes and combine) and ANNDATS<=date<=ANNDATS+365. So each date will correspond to a PRC in combine; and each price in ibes will have 200 PRC (from combine). At the end I need to calculate the number of PRCs for each price in ibes dateset.

I read some posts that suggested a use of hush objects. Here is my attempt:

But somehow sas has returned me an error: : Variable cusip has been defined as both character and numeric..

Please let me know if you have better ideas in terms of how to deal with those two datasets! Appreciate for your help!!

hashman · Posted 08-30-2019 11:05 PM

@constancezeng:

1. From what you've shown, you have a data type conflict between SEDOL defined as numeric in the LENGTH statement and its character type in the data sets you're processing (it contains alpha data, so it can't be numeric). First the compiler sees the LENGTH statement and stores SEDOL in the compiler symbol table as numeric; then it sees COMBINE referenced in the first SET and reads its descriptor. But in the latter, SEDOL is defined as character, hence the conflict. If you haven't seen a related error before, you will. Solution: Eliminate the LENGTH statement. (In the docs, it's used to create a PDV host variable for the corresponding hash variable, but you don't need it here since the compiler creates it using its attributes stored in the descriptor of COMBINE.)

2. I don't see from your code how the CUSIP data type conflict is possible. It cannot have anything to do with the hash table, as no such hash variable is defined. Judging from your code, CUSIP can only come from IBES since it's not kept in COMBINE and not defined elsewhere in the step. Hence, the code that resulted in this error cannot be the code you showed.

3. As @novinosrin said, it's safer to use CALL MISSING (PRC, DATE), though in your case assigning standard numeric missing values to these variables explicitly doesn't hurt, as both are numeric. However, in your case what you're effectively executing is an inner join: If SEDOL isn't found in IBES, you don't output the record. Therefore, you don't need to set PRC and DATE to missing at all; so, just remove the corresponding statements.

4. Memory shortage is what you may encounter once your step has compiled successfully and commences on loading the table GROUP. If you're on a 64-bit system, its item length is 64 if SEDOL is $16 or shorter and 80 if it's longer. With 16 million items loaded, you're looking at about 1.25 GB of RAM footprint. If your SAS session is allotted 4 GB (which is not much to ask), you should be okay.

5. The reason your shot at doing it with SQL has bombed is most likely an attempt by SQL to sort COMBINE behind the scenes. Since you want an inner join, you can change this default behavior by coding the MAGIC=103 option with the proc SQL statement. It will force the SQL optimizer to select the SQXJHSH access method - in other words, to store IBES in an internal hash table and look it up for every record in COMBINE without sorting the latter; i.e. to do implicitly what you're trying to do by using the hash explicitly. Note that it will only work for an inner join (not left, right, etc.), but since that's what you need, you're in luck here.

6. Yet another option is: Index the file IBES and use the index to look it up for every record in COMBINE using the SET statement with the KEY= option. You can find a gazillion SAS articles on how to do that (including handling duplicate keys on the driver side) or just read the definitive opus on SAS indexes by Michael Raithel:

https://www.amazon.com/Complete-Guide-SAS-Indexes/dp/1590478495

The method is slower than a hash (or implicitly hashed SQL), but it's advantage is that you don't have to worry about the memory footprint.

Kind regards

Paul D.

View solution in original post

novinosrin · Posted 08-30-2019 04:01 PM

1. Why do you need a length or attrib statement when at compilation SAS will read the descriptor portion of the dataset and creates a PDV to hold an observation(record) at execution?

2. Hash object variables inherit their type/length from PDV host variables. In your case I can't see any variable(like that of assignment /explicit retain)outside of the PDV host that is loaded/retrieved in/from hash object. --Very important fundamental !!!!!!!

3. prc=.; date=.; *Why play with representation when call missing will assign missing values appropriately pertaining to type inherited from PDV host variable

 rc=group.find(key:sedol);

 if rc ne 0 then call missing(prc,date);

constancezeng · Posted 08-30-2019 04:37 PM

Hey novinosrin! Thank you so much for the reply. I guess you can tell that I don't know much about SAS while I am struggling to learn by working... Do you have any suggestions? I am trying the following code:

novinosrin · Posted 08-30-2019 04:43 PM

Fair enough. This time your approach syntactically is fine. Did it meet your logical requirement i.e did you get the results what you need?

You are the best judge to assess the results.

Kurt_Bremser · Posted 08-31-2019 01:15 AM

Do you get ERRORs or WARNINGs? Is the result not what you expected?

Please post the complete log of the step (copy/paste into a window opened with the {i} button).

For posting code, just copy/paste it into a "little running man" window. Don't post pictures.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom · Posted 08-30-2019 04:47 PM

If you don't define the variables before naming them in the HASH object method calls then the data step compiler will define them as numeric.

I usually use a line like this to define the variables from the existing dataset before creating the hash object without actually reading in any of the data.

if 0 then set dsname;

novinosrin · Posted 08-30-2019 04:57 PM

@Tom wrote:

If you don't define the variables before naming them in the HASH object method calls then the data step compiler will define them as numeric.

I usually use a line like this to define the variables from the existing dataset before creating the hash object without actually reading in any of the data.
if 0 then set dsname;

Thank you Sir . I am afraid PDV host gets the precedence whether or not variables are defined before naming in Hash obj

For example,

data _null_ ;
   dcl hash H () ;
   h.definekey  ("name") ;
   h.definedata ("name","height","weight") ;
   h.definedone () ;
name="xxx";
height=6868;
weight=5676;
rc=h.add();
h.output(dataset:'test');
stop;
run;

proc contents data=test;
run;

Or a better example

data _null_ ;
   dcl hash H () ;
   h.definekey  ("name") ;
   h.definedata ("name","height","weight") ;
   h.definedone () ;
do until(z);
 set sashelp.class end=z;
 rc=h.add();
end;
h.output(dataset:'test');
stop;
run;

hashman · Posted 08-31-2019 12:20 AM

@Tom:

>If you don't define the variables before naming them in the HASH object method calls then the data step compiler will define them as numeric.<

Nope. The compiler doesn't see anything inside the DefineKey or DefineData methods' parentheses, as they're executed after it has already done its job, i.e. at the run time. As far as the hash object directives go, the compiler isn't concerned with anything but (a) the correctness of the object.dot syntax, (b) making sure that the non-scalar variable (used as the hash object name, such as H) is defined only once, and (c) checking that if a hash object is referenced by a method, operator, or attribute, it's been defined.

In other words, if you do this:

1 data _null_ ;                                                                 
2   dcl hash h () ;                                                             
3   h.definekey ("name") ;                                                      
4   h.definedata ("weight") ;                                                   
5   h.definedone () ;                                                           
6 run ;                                                                         
                                                                                
ERROR: Undeclared key symbol name for hash object at line 5 column 3.           
ERROR: DATA STEP Component Object failure.  Aborted during the EXECUTION phase.

the step compiles just fine and doesn't define any non-auto PDV variables at all; the compiler sees neither NAME nor WEIGHT. It is during the run time (aka EXECUTION phase in the hash log speak) that DefineDone checks for the presence of NAME in the PDV, and having not found it, aborts the step (it doesn't even bother to go on to check for WEIGHT). But if you code:

data _null_ ;                  
  if 0 then set sashelp.class ;
  dcl hash h () ;              
  h.definekey ("name") ;       
  h.definedata ("weight") ;    
  h.definedone () ;            
  stop ;                       
run ;

everything is hunky-dory. However, if you coded:

data _null_ ;              
  dcl hash h () ;          
  h.definekey ("name") ;   
  h.definedata ("weight") ;
  h.definedone () ;        
  stop ;                   
  set sashelp.class ;      
run ;

everything is fine, too. DefineDone (at run time) couldn't care less where in the code - before it or after it - the compiler has seen NAME and WEIGHT at compile time; all it cares about is that the host variables called NAME and WEIGHT are present in the PDV when DefineDone is called at run time. Note that if your assertion that these variables are defined as numeric if the compiler sees them after the hash definition cluster were true, the compiler would abort the step at compile time, as there would be a data type conflict between NAME coming from SASHELP.CLASS and DefineKey upstream.

I more than understand that from the standpoint of an experienced pre-hash SAS programmer, this kind of behavior looks odd. Indeed, before the advent of the hash object, we had accustomed that if a variable is named anywhere in the step, the compiler sees it and compiles it according to its set of (fairly convoluted) rules. When I got my hands on the early version 9.0 circa 2002, I was struck by this behavior, just like anyone in my shoes would be. Luckily, for a length of time I was in direct email contact with the developers, who helped me absorb the concept of parameter type matching rather quickly - in fact, one of them wrote, verbatim, "the compiler doesn't see anything within those parentheses", at which point if kind of instantly clicked.

This is the very reason @DonH and I dedicated the whole section 2.2 (CREATE Operation) to illuminating the things of this nature from different angles in:

https://www.amazon.com/Management-Solutions-Using-Table-Operations/dp/1629601438

Kind regards

Paul D.

DonH · Posted 08-31-2019 08:29 AM

As a followup to @hashman's spot-on comments, you might want to review the communities article The SET Statement's Compile Time Functions that provides some additional context for why you sometimes see a conflict about numeric for character conflicts when defining hash objects.

The key point is that the compiler does not recognize quoted strings as variable names (as @hashman said in his comment). If you never reference the variable that is defined to the hash object that is a quoted string in your DATA step, you get the execution time message Undeclared key symbol . . . If you do reference it elsewhere in your program, the compiler will add it to the PDV at compile time and typically will define the variable to the PDV as a numeric variable. The article above provides some examples.

Astounding · Posted 08-30-2019 06:29 PM

How many records are in each of your data sets, as you begin? (Ballpark number is OK.)

constancezeng · Posted 08-30-2019 07:19 PM

I have around 300,000,000 rows in combine and 16,000,000 rows in ibes. Thanks for asking!

Tom · Posted 08-30-2019 07:35 PM

You want to compare 300M records to each of 16M records? No wonder your query never finished.

Are your source datasets already sorted?

What are the min/max number of records per BY group for both of the files? Perhaps one of those will be small enough that you can load each BY group from one of the files into memory/hash.

SASKiwi · Posted 08-30-2019 06:36 PM

Have you tried a simple DATA step MERGE? This requires sorted or indexed keys though. If you are essentially joining all rows from both source tables, performance will be limited by IO anyway, regardless of what joining technique you use.

You should also explore other options like using COMPRESS to shrink the size of your SAS datasets to begin with. If your datasets are wide and contain a lot of text then compressing can really speed things up.

constancezeng · Posted 08-30-2019 07:22 PM

Yes I tried. Both merge and sql gave me an error says im running out of space. And I used the option compress already. Thank you for your suggestions!

SASKiwi · Posted 08-30-2019 07:31 PM

In my experience COMPRESS = BINARY is best. Are you running SAS on a PC or a remote server? You need to fix the out of space issue first as any join technique is going to need a similar amount of free space. Your SAS administrator should be able to help if you are using a remote SAS server.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away