Hi all!
In my dataset, I have a data column named "SIC", it is a char variable
A small example
SIC
3559
3559
3559
3559
4911
4911
4911
2070
2070
2070
And I need to exclude observation that ( 4900<SIC<4949) and (6000<SIC<6999)
Here is my code
data want;
set have;
SIC_ =input (SIC,8.);
if SIC_<4900 or (4949<SIC_ and SIC_ < 6000) or (SIC_>6999)
run
1. Can you tell me whether my datatype conversion as above is reasonable, I do not know when I should put 4. or 8. or 12., I have a look on this document but it is quite overwhelming to me
2. And is there any alternative way to code the condition above?
if SIC_<4900 or (4949<SIC_ and SIC_ < 6000) or (SIC_>6999)
Thanks in advance
You don't really need to convert data types:
data want;
set have;
where not (length(sic)=4
and
(('4900'<sic<'4949') or ('6000'<sic<'6999'))
);
run;
The "length(sic)=4" condition prevents a value like "493" (left justified) from qualifying as being between "4900" and "4949". This is what would happen when lexicographic ordering is being applied.
1) The 4., 8., or 12. is the informat. Meaning how SAS should read the variable. In this case, if you know you have only 4 digits, the right thing is to use 4.
2) If you don't really need the numeric value of SIC, you can use its converted value directly and do something like this
data have;
input SIC $;
datalines;
3559
3559
3559
3559
4911
4911
4911
2070
2070
2070
;
data want;
set have;
if input(sic, 4.) not in (4900:4949, 6000:6999);
run;
You don't really need to convert data types:
data want;
set have;
where not (length(sic)=4
and
(('4900'<sic<'4949') or ('6000'<sic<'6999'))
);
run;
The "length(sic)=4" condition prevents a value like "493" (left justified) from qualifying as being between "4900" and "4949". This is what would happen when lexicographic ordering is being applied.
@ResoluteCarbon wrote:
Hi @mkeintz
Is there anyway to convert this code from "where" to "if" please ?
TIA.
Context? Why is "IF" preferred over "WHERE"?
Did you try replacing "where" with "if"?
"Where" is typically faster as it applies to the data vector as it is read into the data set and "If" is applied later.
@ResoluteCarbon wrote:
Hi @mkeintz
Is there anyway to convert this code from "where" to "if" please ?
TIA.
Yes ... replace the word "where" with the word "if".
But, as @ballardw said, where is likely faster. That's because the where filter is outsourced to the data engine, and the data step never sees the rejected data. But the subsetting IF is not applied until after the data is read in.
Have you cleaned the data? Values like those below could throw off your calculations:
0000
493A
Hi @Astounding
It is very nice to see your caution.
Can I ask why I need to filter "0000" value?
And I am quite confident about that "493A" will not exist following some finance papers. SIC is a 4-digit code as described from the datasource.
Thanks.
"0000" would be selected, as it satisfies the condition SIC_ < 4900.
Not sure if you want it to be selected or not.
Missing values would also be selected, as they also satisfy that same condition. (But they would be removed by checking for length of 4.)
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.