BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ResoluteCarbon
Obsidian | Level 7

Hi all!

 

In my dataset, I have a data column named "SIC", it is a char variable

A small example

SIC
3559
3559
3559
3559
4911
4911
4911
2070
2070
2070

And I need to exclude observation that ( 4900<SIC<4949) and (6000<SIC<6999)

 

Here is my code

data want;
     set have;
SIC_ =input (SIC,8.); if SIC_<4900 or (4949<SIC_ and SIC_ < 6000) or (SIC_>6999) run

1. Can you tell me whether my datatype conversion as above is reasonable, I do not know when I should put 4. or 8. or 12., I have a look on this document but it is quite overwhelming to me

(https://go.documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.4&docsetId=allprodslang&docsetTar...)

 

2. And is there any alternative way to code the condition above?

if SIC_<4900 or (4949<SIC_  and SIC_ < 6000) or (SIC_>6999)

Thanks in advance

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

You don't really need to convert data types:

 

 data want;
   set have;
   where not (length(sic)=4 
              and
              (('4900'<sic<'4949') or ('6000'<sic<'6999'))
             );
run;

The "length(sic)=4" condition prevents a value like "493" (left justified) from qualifying as being between "4900" and "4949".  This is what would happen when lexicographic ordering is being applied.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

9 REPLIES 9
PeterClemmensen
Tourmaline | Level 20

1) The 4., 8., or 12. is the informat. Meaning how SAS should read the variable. In this case, if you know you have only 4 digits, the right thing is to use 4.

 

2) If you don't really need the numeric value of SIC, you can use its converted value directly and do something like this

 

data have;
input SIC $;
datalines;
3559
3559
3559
3559
4911
4911
4911
2070
2070
2070
;

data want;
   set have;
   if input(sic, 4.) not in (4900:4949, 6000:6999);
run;
mkeintz
PROC Star

You don't really need to convert data types:

 

 data want;
   set have;
   where not (length(sic)=4 
              and
              (('4900'<sic<'4949') or ('6000'<sic<'6999'))
             );
run;

The "length(sic)=4" condition prevents a value like "493" (left justified) from qualifying as being between "4900" and "4949".  This is what would happen when lexicographic ordering is being applied.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ResoluteCarbon
Obsidian | Level 7

Hi @mkeintz 

 

Is there anyway to convert this code from "where" to "if" please ?

 

TIA.

ballardw
Super User

@ResoluteCarbon wrote:

Hi @mkeintz 

 

Is there anyway to convert this code from "where" to "if" please ?

 

TIA.


Context? Why is "IF" preferred over "WHERE"?

 

Did you try replacing "where" with "if"?

 

"Where" is typically faster as it applies to the data vector as it is read into the data set and "If" is applied later.

mkeintz
PROC Star

@ResoluteCarbon wrote:

Hi @mkeintz 

 

Is there anyway to convert this code from "where" to "if" please ?

 

TIA.


Yes ... replace the word "where" with the word "if".

 

But, as @ballardw  said, where is likely faster.  That's because the where filter is outsourced to the data engine, and the data step never sees the rejected data.  But the subsetting IF is not applied until after the data is read in.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ResoluteCarbon
Obsidian | Level 7

@mkeintz 

 

So, you mean, in this specific case, where and if can be used interchangeable ?

 

TIA.

Astounding
PROC Star

Have you cleaned the data?  Values like those below could throw off your calculations:

 

0000

493A

ResoluteCarbon
Obsidian | Level 7

Hi @Astounding 

It is very nice to see your caution.

Can I ask why I need to filter "0000" value?

And I am quite confident about that "493A" will not exist following some finance papers. SIC is a 4-digit code as described from the datasource.

Thanks.

Astounding
PROC Star

"0000" would be selected, as it satisfies the condition SIC_ < 4900.  

 

Not sure if you want it to be selected or not.

 

Missing values would also be selected, as they also satisfy that same condition.  (But they would be removed by checking for length of 4.)

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2564 views
  • 6 likes
  • 5 in conversation