I have a dataset which has more than 10 million observations. In this, I have variable name NAICS it has different values. I want to remove all those values that start with 21. It contains different no of digits like 21, 2145, 210454 etc. So My intention is to remove all values with start with 21. There are some values of NAICS variable which end with 21 but I don't want them to be affected. Thanking in anticipation.
data have;
input NAICS;
datalines;
21234
2134664
42355
235353
214356
;
data want;
set have;
if substr(left(NAICS),1,2) NE '21';
run;
data want;
set have;
if NAICS =:'21' then delete;
run;
Assuming that variable is a character variable
@Jahanzaib: Is your variable numeric or character? And what do you mean by "removing" a value? Do you mean setting the variable to missing, or do you mean deleting the whole observation? In your example, all the values that you want to get rid of start with the number 21, does that mean that you do not want to get rid of values like 18212 or 321?
I assumed a numeric variable, but my solution will work for a character variable as well.
@Jahanzaib: Then the fastest and simplest is probably using a WHERE clause:
data want; set have; where NAICS not like '21%'; run;
- this assuming that the values are left aligned, otherwise use
where left(NAICS) not like '21%';
data have;
input NAICS;
datalines;
21234
2134664
42355
235353
214356
;
data want;
set have;
if substr(left(NAICS),1,2) NE '21';
run;
I would really advise you to try both methods on the 10mill records as I suspect this (from @s_lassen😞
data want; set have (where=(substr(naics,1,2) ne '21')); run;
Would be faster than the given solution from @PeterClemmensen:
data want; set have; if substr(left(naics),1,2) ne '21'; run;
I can't prove this offhand, but vaguely remember something aboutthe set reading in the data for each row and then outputting on the if, where the where clause restricts what is coming in, so slightly earlier in the process. If so then this fractional saving would add up of millions of records. But do test.
From your responses, i assume you are not familiar with wild card operators or colon modifiers. If you have a grasping you would have chosen @s_lassen 's answer without having to deal with functions that makes SAS work more.
If you understand collating sequences , the below solution is easy
data want;
set have;
if var>=:'90' then delete;
run;
Regards,
Naveen Srinivasan
Also, the same can be applied to a where clause:
data want;
set have;
where not (var>=:'90');
run;
Well 90-99 all consist of 9x, so you can just do:
data want; set have (where=(substr(naics,1,2) ne '21' and char(naics,1) ne '9')); run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.