BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
gsanchez123
Calcite | Level 5

Hello, 

I am attempting my first table lookup, but I am not achieving my intended goal after reading several tutorials. I’d really appreciate some guidance. I don’t know SQL yet, but I hear it can be a great tool for merges.

 

This is what my primary dataset looks like: 

RowId

DTE_BIRTH

Name

Primary

Secondary

Third

1508634

 

 

A0470

B8290

#######

3895072

 

 

A0780

B800

R1970

3895072

 

 

A0780

B800

#######

318053

 

 

A0840

B800

#######

 

My reference table looks like this:

CDE_DIAG_andVERSION

CDE_DIAG

CDE_ICD_VERSION

DSC_25

109

A0470

9

CHOLERA DUE TO VIBRIO CHOLERAE                              

 

I would like to attach the character string (DSC_25) from the reference table which corresponds to the primary diagnosis variable above(Primary in the main dataset, CDE_DIAG in the reference table). 

 

Here is the main piece of code I have written. I changed the Primary and DSC_25 variables into a 6-length character string variable named mergevariable (previously one was a numeric, and the other was a character). But my results end up looking like a concatenation, D'oh!

 

proc sort data=sth_icd10; by mergevariable; run;
proc sort data=reftable_merge; by mergevariable; run;

data all;
merge sth_icd10 reftable_merge;
by mergevariable;
drop mergevariable;
run;

 

Appreciate any thoughts or guidance on how I can perform this simple task and how I can grow more as a beginner programmer wanting to do more data merges and table lookups. Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
Shmuel
Garnet | Level 18

Your pseudo-code is correct.

I assume that sth_icd10 is your primary dataset with the "mergevariable"  name is primary 

and that reftable_merge is your reffernce table name with the "mergevariable" name is cde_diag.

 

There are several methods to deal with what you want:

 

1) 

proc sort data=sth_icd10; by primary; run;
proc sort data=reftable_merge; by cde_diag; run;

data all;
merge sth_icd10 
      reftable_merge (rename = (cde_diag = primary));
by primary;
      /*  drop mergevariable; << no need */
run;

2) use sql:

   

proc sql;
        create table all
        as select a.* , b.dsc_25
        from sth_icd10 as a
        left join reftable_merge as b
        on primary = cde_diag;
quit;

3) using format: (I believe this is the most efficient)

    

data cntl;
  set reftable_merge;
       retain fmtname "$DESC";
      start = cde_diag;
      end = start;
      label = dsc_25;   
      keep fmtname start end label;
run;
proc sort data=cntl; by start; run;
proc format lib=worl cntlin=cntl noprint; run; data all; set sth_icd10; length dsc_25 $10; /*adapt length to max desired */ dsc_25 = put(primary, $desc. ); run;

View solution in original post

17 REPLIES 17
Shmuel
Garnet | Level 18

Your pseudo-code is correct.

I assume that sth_icd10 is your primary dataset with the "mergevariable"  name is primary 

and that reftable_merge is your reffernce table name with the "mergevariable" name is cde_diag.

 

There are several methods to deal with what you want:

 

1) 

proc sort data=sth_icd10; by primary; run;
proc sort data=reftable_merge; by cde_diag; run;

data all;
merge sth_icd10 
      reftable_merge (rename = (cde_diag = primary));
by primary;
      /*  drop mergevariable; << no need */
run;

2) use sql:

   

proc sql;
        create table all
        as select a.* , b.dsc_25
        from sth_icd10 as a
        left join reftable_merge as b
        on primary = cde_diag;
quit;

3) using format: (I believe this is the most efficient)

    

data cntl;
  set reftable_merge;
       retain fmtname "$DESC";
      start = cde_diag;
      end = start;
      label = dsc_25;   
      keep fmtname start end label;
run;
proc sort data=cntl; by start; run;
proc format lib=worl cntlin=cntl noprint; run; data all; set sth_icd10; length dsc_25 $10; /*adapt length to max desired */ dsc_25 = put(primary, $desc. ); run;
gsanchez123
Calcite | Level 5

Thanks, Schmuel. For my primary dataset I have 800k observations, for the reference table there are 10k potential ICD codes. Which method do you think would be most appropriate for this situation? Forgot to mention this important detail. 

Shmuel
Garnet | Level 18

The merge method will always run, no matter how big are the data sets.

The format method and sql method are worth to try and compare run time.

Kurt_Bremser
Super User

@gsanchez123 wrote:

Thanks, Schmuel. For my primary dataset I have 800k observations, for the reference table there are 10k potential ICD codes. Which method do you think would be most appropriate for this situation? Forgot to mention this important detail. 


10,000 codes. Let's assume the combined length of code and text sum up to 100 characters, so you will need ~1MB for the raw data of the format. That will fit easily in available memory.

You can make the format permanent by storing it in a catalog that is included in your fmtsearch option.

 

So I advise using the format method, as it will combine efficiency with easily readable code.

hashman
Ammonite | Level 13

@Kurt_Bremser :

With 10,000, it hardly matters which lookup technique to use as long as it's not woefully inefficient (like a brute force sequential search). Methinks SQL is the most concise and easy to code - and most efficient to boot since (for an inner join) the optimizer will store the needed keys (whether compound or not) and data internally in a hash table. Hence, on part of the programmer, no effort to organize the most efficient search would be required.  

 

Of course, if the search must be done in the DATA step, it's a different story, and a format comes in quite handy. However, in this case I'll still vote for the hash object for several reasons:

 

  • It's much more versatile and fits the purpose more squarely since it's designed to look up discrete key values, while the formats are designed to look up ranges (for them, looking up discrete key values is a side effect of looking up ranges with identical endpoints).
  • The hash object can store items with duplicate key-values and iterate through same-key groups. I don't know whether or how it can be done with a format.
  • With the hash object, there's no need to store anything in a catalog and/or specify the format search path. The DATASET argument-tag goes directly to the library with the data set from which to extract the lookup data; and 10,000 items will load in 0 seconds.
  • The hash object handles compound keys and multiple satellite data variables naturally. With a format, one has to jump through hoops to handle these eventualities.
  • If the lookup source data set should change, there's no need to recompile anything with the hash object, while the format has to be recompiled every time the lookup source data changes.  

Suum cuique.

 

Kind regards

Paul D.

  

Kurt_Bremser
Super User

The reasons why I prefer the format:

  • IMO it's easier to code, especially for a beginner
  • once it's defined, it only takes the fmtsearch option, no other code needs to be run
  • it can be used anywhere, not only in a data step

The hash object is data step only, and needs to be repeatedly coded in every step where the lookup is needed, so at least some macro coding will be required; all things not easily done by most of the people we encounter here.

hashman
Ammonite | Level 13

@Kurt_Bremser :

 

Let me state up front that I love user-defined formats and find them indispensable under various circumstances. Having said that:

 

  • IMO it's easier to code, especially for a beginner

Depends on what the "beginner" begins with. Coding CNTLIN= data set is simple in its most primitive form, just like coding a hash is simple (in fact, even simpler and less verbose) in its primitive form. In fact, in/formats are quite a tricky SAS topic, and even advanced folks happen to trip over them. For example, I've seen quite a few at a loss why an in/format can't directly pair a numeric argument with a numeric response. In/formats are the vehicle that makes it possible for SAS to extract all its mileage from only two data types. For example, try making a newbie get the distinction between a character format and character informat, much less understand why an in/format can pair character-to-character but not numeric-to-numeric.    

 

  • once it's defined, it only takes the fmtsearch option, no other code needs to be run
  • it can be used anywhere, not only in a data step

But of course.

 

  • The hash object is data step only.

Nope; at least not any more. You can FCMP it and use as a function anywhere you want, including SQL. All you have to supply is a CMPLIB reference as a system option. Moreover, unlike in/formats, a hash can easily pair numeric-to-numeric without type conversions, not to mention that it can handle composite keys (with components of either data type) as an argument and composite data as a response. 

 

  • and needs to be repeatedly coded in every step where the lookup is needed, so at least some macro coding will be required

Nope, for the same reason as stated above; and for the same reason no macro coding is required, either. In fact, "macro thinking" is foreign to hash object programming because of its run time nature and the fact that its argument tags accept SAS run-time expressions.    

 

  • all things not easily done by most of the people we encounter here

Truer words have never been said. But if we confine out discussion here to what is "easily done by most of the people we encounter here", neither they will learn much from more experienced folks nor the latter will learn much from each other.   

 

Kind regards

Paul D.

 

 

Kurt_Bremser
Super User

As I said, I prefer the format approach, which does not rule out using any other method where appropriate (Maxim 14). As you can see, there is Maxim 51.

But there's also Maxim 29.

If you pack hash code into a user defined function, you somehow lose the composite data functionality of hash (either you have to use multiple functions, or deliver a compound result that needs post-processing).

And you have an additional layer of code between you and the original data source, which may make it harder to follow the data path through and puts even more importance on proper documentation/coding habits and the necessary infrastructure for this. Something that J. Random Sasuser rarely has.

(Granted, creating a global format will also need proper documentation and keeping the source code for the format at hand)

Correct me if I'm wrong here.

 

Nice discussion, BTW. And I once again was reminded of Maxim 13.

 

hashman
Ammonite | Level 13

@Kurt_Bremser :

 

A nice discussion indeed; thanks for keeping the embers hot. 

I don't see where you may be wrong: We're merely talking about two tools, both of which can be used well for the same purpose. Let's just agree that we're only talking about table lookup, since with respect to broader functionalities, hash dwarfs in/formats by an order of magnitude. As far as table lookup is concerned:

  • If you pack hash code into a user defined function, you somehow lose the composite data functionality of hash

Not really, since you can compile a call routine rather than a function and parameterize it with any VARARGS you want to return the data values in case of a match to the PDV. 

  • And you have an additional layer of code between you and the original data source.

True, but I believe only for the time being. The fact that in FCMP the hash data source has to be hard-coded is an abnormality running contrary to everything the hash is about. In the DATA step, the DATASET argument tag accepts a character expression. In time, FCMP will accept it as the value of a function's argument; and then when it's called, the data source will be evident from the code.

 

Kind regards

Paul D.  

ChrisNZ
Tourmaline | Level 20
Another major difference between hash and formats is the scalability. Don't even try to load a format with millions of values.Both have their place of course, and it's nice to be able to choose.
hashman
Ammonite | Level 13

@Shmuel

"There are several methods to deal with what you want".

  • Yes, "including but not limited to". Without mentioning the SAS indexes and the hash object, the list is quite incomplete. 

 

"using format: (I believe this is the most efficient)"

  • That's what many folks had thought before the advent of direct-addressing lookup methods in SAS (key-indexing, bitmapping, hashing).
  • As a lookup table, a format is more efficient than MERGE, unless the input is intrinsically sorted.
  • Not necessarily more efficient than SQL - depends on a whole variety of circumstances (for example, whether or not the optimizer decides to hash).
  • Less efficient and more resource-consumptive than direct-addressing methods.
  • More cumbersome if more than one variable is needed as the format's response. Delimiter-concatenating on the way in and unstringing on the way out is possible; but it's a pretty crude kludge , inefficient to boot. (A better subterfuge is a numeric format with the observation number N as a response and SET ... POINT=N downstream after the search to get the values of the needed variables if there's a match.)
  • Dealing with compound keys is no less cumbersome, as they have to be concatenated both in the format and in the code consuming it. And concatenated smartly, too, since mindlessly feeding the key components into CATS or CATX can result into scrambling two different compound keys, such as [A1,B] and [A,1B], into the same format key [A1B] (one solution to avoid it is to use CATQ with a delimiter).    

 

As a side note, the CNTLIN= data set requires only 3 variables: FMTNAME, START, LABEL. END should be included if the endpoints of the format's ranges are different; but this is not the case here (and proc FORMAT will set END=LABEL itself if only LABEL is included). But there's one extra variable, which, IMO, should never be omitted from CNTLIN=, and that variable is HLO="o", coupled with a label value different from any other range response (corresponds to OTHER= on the IN/VALUE statement). Without including it, one doesn't let SAS know which response to use if there's no match. E.g., in case of a character format and in lieu of appropriately set OTHER=, the value of the argument will be given to the format's response if the argument is not within one of the format's ranges. In most situations, this is not what one would expect the format to do.

 

Kind regards

Paul D.  

Shmuel
Garnet | Level 18

@hashman , thanks for your comprehensive answer.

 

hashman
Ammonite | Level 13

@Shmuel:

Thanks for thanking for what I've chided myself for excessive verbosity. Still have to work on my English (alas, my third language) to make it at least as terse as some of my SAS programs ;).  

ChrisNZ
Tourmaline | Level 20

@hashman I reckon too much explanation (not saying that's the case in your reply) is better than too little in these pages

I could give longer answers at times but I don't have the time or the motivation (and I don't chide myself for it 🙂 ) .

So good on you. And good on your English too (as far as I can see from my English-as-a-late-second language) viewpoint.

 

That's what I meant by giving you a like, but since it's expend-your-thoughts time, here we go.

 

Back to terse mode now...  Have a nice weekend!

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 17 replies
  • 2457 views
  • 5 likes
  • 5 in conversation