I am checking a field for a valid postal code.
'Property State' IN ( "AL", "AK", "AZ", "AR", "CA", etc...
Is setting this up in a macro faster than having it in an if/then statement? I'm checking a file with a couple million rows and want to make sure I'm efficient as possible.
Thanks in advance, have a great weekend.
If I were using something like this very often I would set up a custom format. Possibly with a result of 'Valid' for valid codes and 'Invalid' for anything else.
proc format;
value &customformat
"AL", "AK", "AZ", "AR", "CA", etc = "Valid"
other = "Invalid"
;
run;
Then the code could be
If put(upcase(variable),$customformat.) ='Valid' then do;
end;
Else do;
end;
I use the approach to catch site codes that get misentered OR get added to data without forewarning.
I'm curious, What would be the macro version of this?
Other than using a macro variable so I don't have a long in() clause I have no idea. Looking for anything to be more efficient. There's a large amount of records that I need to check.
I did some testing and the use of INDEX() proved quite fast. I cannot imagine that looping through an array of 52 states for every observation would be faster. But ultimately the proof of the pudidng is in the eating.
states="AL|AK|AR|AZ|CA|...|WV|WY";
if (index(states, statecode)) > 0 then put 'State OK';
Retain the variable states and assign only when _N_=1 for even better performance.
In general this class of problems (table lookup) can be addressed multiple ways:
1) Format
2) Hash table
3) SQL
4) prxmatch (I tried; wasslower)
5) if/then/else
6) Character functions
7) ...
This time I choose 6) but 1) and 2) do in-memory searches and can be quite fast as well. They do require more programming. It is worth trying though. They usually work best when the list of values is in a dataset. But unless the USA plans to acquire new states or spin off existing ones any time soon the list should be quite stable.
Hope this helps,
- Jan.
If I were using something like this very often I would set up a custom format. Possibly with a result of 'Valid' for valid codes and 'Invalid' for anything else.
proc format;
value &customformat
"AL", "AK", "AZ", "AR", "CA", etc = "Valid"
other = "Invalid"
;
run;
Then the code could be
If put(upcase(variable),$customformat.) ='Valid' then do;
end;
Else do;
end;
I use the approach to catch site codes that get misentered OR get added to data without forewarning.
What version of SAS are you on?
Before 9.4 there appears to be some SAS out of the box formats for State Names/Codes that may be available to you, but they're gone in SAS 9.4.
I would echo the format solution.
Maybe one of these:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000211303.htm
Why not use an informat instead?
proc format;
invalue $stateFmt (default=2 just upcase)
"AL", "AK", "AZ", "AR", "CA" = [$upcase2.]
other = "??";
run;
data test;
length code stateCode $2;
input code;
stateCode = input(code, $stateFmt.);
codeError = stateCode = "??";
datalines;
al
Ak
AZ
AB
;
proc print; run;
Check functions: If it return missing value, then it should be invalid State Name. data _NULL_; fips=stfips ('NC'); put fips=; state=stname('NC'); put state=; state=stname('AA'); put state=; run;
@PGStats wrote:
Why not use an informat instead?
I used an informat for a while with one project but when I went back to the data supplier I then had to re-read the data to get the original unexpected value(s) and associated record identifiers to show them what the unexpected values were. By using the format with a report step I could then use a where statement in proc print to list the same information easier.
Volume of records (relatively low) and project requirements (very clean data) made the format approach a better fit in my case.
You are right @ballardw, if you use the informat to read the data, you lose the original information and that can lead to some problems later. That's why I created a new variable with the input function instead of using my informat within the input statement.
The informat approach nevertheless allows you to hide the UPCASE and LEFT operations in the format, and to exploit the _ERROR_ mechanism, if you wish.
@PGStats and I should have spent a bit of time realizing that not everyone is looking at dozens of variables with custom formats/informats. I mostly worry about informat when reading the data but could use this for a report approach as well.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.