Hello,
I am attempting my first table lookup, but I am not achieving my intended goal after reading several tutorials. I’d really appreciate some guidance. I don’t know SQL yet, but I hear it can be a great tool for merges.
This is what my primary dataset looks like:
RowId | DTE_BIRTH | Name | Primary | Secondary | Third |
1508634 |
|
| A0470 | B8290 | ####### |
3895072 |
|
| A0780 | B800 | R1970 |
3895072 |
|
| A0780 | B800 | ####### |
318053 |
|
| A0840 | B800 | ####### |
My reference table looks like this:
CDE_DIAG_andVERSION | CDE_DIAG | CDE_ICD_VERSION | DSC_25 |
109 | A0470 | 9 | CHOLERA DUE TO VIBRIO CHOLERAE |
I would like to attach the character string (DSC_25) from the reference table which corresponds to the primary diagnosis variable above(Primary in the main dataset, CDE_DIAG in the reference table).
Here is the main piece of code I have written. I changed the Primary and DSC_25 variables into a 6-length character string variable named mergevariable (previously one was a numeric, and the other was a character). But my results end up looking like a concatenation, D'oh!
proc sort data=sth_icd10; by mergevariable; run;
proc sort data=reftable_merge; by mergevariable; run;
data all;
merge sth_icd10 reftable_merge;
by mergevariable;
drop mergevariable;
run;
Appreciate any thoughts or guidance on how I can perform this simple task and how I can grow more as a beginner programmer wanting to do more data merges and table lookups. Thanks.
Your pseudo-code is correct.
I assume that sth_icd10 is your primary dataset with the "mergevariable" name is primary
and that reftable_merge is your reffernce table name with the "mergevariable" name is cde_diag.
There are several methods to deal with what you want:
1)
proc sort data=sth_icd10; by primary; run;
proc sort data=reftable_merge; by cde_diag; run;
data all;
merge sth_icd10
reftable_merge (rename = (cde_diag = primary));
by primary;
/* drop mergevariable; << no need */
run;
2) use sql:
proc sql;
create table all
as select a.* , b.dsc_25
from sth_icd10 as a
left join reftable_merge as b
on primary = cde_diag;
quit;
3) using format: (I believe this is the most efficient)
data cntl;
set reftable_merge;
retain fmtname "$DESC";
start = cde_diag;
end = start;
label = dsc_25;
keep fmtname start end label;
run;
proc sort data=cntl; by start; run;
proc format lib=worl cntlin=cntl noprint; run;
data all;
set sth_icd10;
length dsc_25 $10; /*adapt length to max desired */
dsc_25 = put(primary, $desc. );
run;
Your pseudo-code is correct.
I assume that sth_icd10 is your primary dataset with the "mergevariable" name is primary
and that reftable_merge is your reffernce table name with the "mergevariable" name is cde_diag.
There are several methods to deal with what you want:
1)
proc sort data=sth_icd10; by primary; run;
proc sort data=reftable_merge; by cde_diag; run;
data all;
merge sth_icd10
reftable_merge (rename = (cde_diag = primary));
by primary;
/* drop mergevariable; << no need */
run;
2) use sql:
proc sql;
create table all
as select a.* , b.dsc_25
from sth_icd10 as a
left join reftable_merge as b
on primary = cde_diag;
quit;
3) using format: (I believe this is the most efficient)
data cntl;
set reftable_merge;
retain fmtname "$DESC";
start = cde_diag;
end = start;
label = dsc_25;
keep fmtname start end label;
run;
proc sort data=cntl; by start; run;
proc format lib=worl cntlin=cntl noprint; run;
data all;
set sth_icd10;
length dsc_25 $10; /*adapt length to max desired */
dsc_25 = put(primary, $desc. );
run;
Thanks, Schmuel. For my primary dataset I have 800k observations, for the reference table there are 10k potential ICD codes. Which method do you think would be most appropriate for this situation? Forgot to mention this important detail.
The merge method will always run, no matter how big are the data sets.
The format method and sql method are worth to try and compare run time.
@gsanchez123 wrote:
Thanks, Schmuel. For my primary dataset I have 800k observations, for the reference table there are 10k potential ICD codes. Which method do you think would be most appropriate for this situation? Forgot to mention this important detail.
10,000 codes. Let's assume the combined length of code and text sum up to 100 characters, so you will need ~1MB for the raw data of the format. That will fit easily in available memory.
You can make the format permanent by storing it in a catalog that is included in your fmtsearch option.
So I advise using the format method, as it will combine efficiency with easily readable code.
With 10,000, it hardly matters which lookup technique to use as long as it's not woefully inefficient (like a brute force sequential search). Methinks SQL is the most concise and easy to code - and most efficient to boot since (for an inner join) the optimizer will store the needed keys (whether compound or not) and data internally in a hash table. Hence, on part of the programmer, no effort to organize the most efficient search would be required.
Of course, if the search must be done in the DATA step, it's a different story, and a format comes in quite handy. However, in this case I'll still vote for the hash object for several reasons:
Suum cuique.
Kind regards
Paul D.
The reasons why I prefer the format:
The hash object is data step only, and needs to be repeatedly coded in every step where the lookup is needed, so at least some macro coding will be required; all things not easily done by most of the people we encounter here.
Let me state up front that I love user-defined formats and find them indispensable under various circumstances. Having said that:
Depends on what the "beginner" begins with. Coding CNTLIN= data set is simple in its most primitive form, just like coding a hash is simple (in fact, even simpler and less verbose) in its primitive form. In fact, in/formats are quite a tricky SAS topic, and even advanced folks happen to trip over them. For example, I've seen quite a few at a loss why an in/format can't directly pair a numeric argument with a numeric response. In/formats are the vehicle that makes it possible for SAS to extract all its mileage from only two data types. For example, try making a newbie get the distinction between a character format and character informat, much less understand why an in/format can pair character-to-character but not numeric-to-numeric.
But of course.
Nope; at least not any more. You can FCMP it and use as a function anywhere you want, including SQL. All you have to supply is a CMPLIB reference as a system option. Moreover, unlike in/formats, a hash can easily pair numeric-to-numeric without type conversions, not to mention that it can handle composite keys (with components of either data type) as an argument and composite data as a response.
Nope, for the same reason as stated above; and for the same reason no macro coding is required, either. In fact, "macro thinking" is foreign to hash object programming because of its run time nature and the fact that its argument tags accept SAS run-time expressions.
Truer words have never been said. But if we confine out discussion here to what is "easily done by most of the people we encounter here", neither they will learn much from more experienced folks nor the latter will learn much from each other.
Kind regards
Paul D.
As I said, I prefer the format approach, which does not rule out using any other method where appropriate (Maxim 14). As you can see, there is Maxim 51.
But there's also Maxim 29.
If you pack hash code into a user defined function, you somehow lose the composite data functionality of hash (either you have to use multiple functions, or deliver a compound result that needs post-processing).
And you have an additional layer of code between you and the original data source, which may make it harder to follow the data path through and puts even more importance on proper documentation/coding habits and the necessary infrastructure for this. Something that J. Random Sasuser rarely has.
(Granted, creating a global format will also need proper documentation and keeping the source code for the format at hand)
Correct me if I'm wrong here.
Nice discussion, BTW. And I once again was reminded of Maxim 13.
A nice discussion indeed; thanks for keeping the embers hot.
I don't see where you may be wrong: We're merely talking about two tools, both of which can be used well for the same purpose. Let's just agree that we're only talking about table lookup, since with respect to broader functionalities, hash dwarfs in/formats by an order of magnitude. As far as table lookup is concerned:
Not really, since you can compile a call routine rather than a function and parameterize it with any VARARGS you want to return the data values in case of a match to the PDV.
True, but I believe only for the time being. The fact that in FCMP the hash data source has to be hard-coded is an abnormality running contrary to everything the hash is about. In the DATA step, the DATASET argument tag accepts a character expression. In time, FCMP will accept it as the value of a function's argument; and then when it's called, the data source will be evident from the code.
Kind regards
Paul D.
"There are several methods to deal with what you want".
"using format: (I believe this is the most efficient)"
As a side note, the CNTLIN= data set requires only 3 variables: FMTNAME, START, LABEL. END should be included if the endpoints of the format's ranges are different; but this is not the case here (and proc FORMAT will set END=LABEL itself if only LABEL is included). But there's one extra variable, which, IMO, should never be omitted from CNTLIN=, and that variable is HLO="o", coupled with a label value different from any other range response (corresponds to OTHER= on the IN/VALUE statement). Without including it, one doesn't let SAS know which response to use if there's no match. E.g., in case of a character format and in lieu of appropriately set OTHER=, the value of the argument will be given to the format's response if the argument is not within one of the format's ranges. In most situations, this is not what one would expect the format to do.
Kind regards
Paul D.
@hashman , thanks for your comprehensive answer.
Thanks for thanking for what I've chided myself for excessive verbosity. Still have to work on my English (alas, my third language) to make it at least as terse as some of my SAS programs ;).
@hashman I reckon too much explanation (not saying that's the case in your reply) is better than too little in these pages
I could give longer answers at times but I don't have the time or the motivation (and I don't chide myself for it 🙂 ) .
So good on you. And good on your English too (as far as I can see from my English-as-a-late-second language) viewpoint.
That's what I meant by giving you a like, but since it's expend-your-thoughts time, here we go.
Back to terse mode now... Have a nice weekend!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.