Hi,
I have many accounting variables that are illogically negative. I want to delete all the negative values from all the variables except a few whose values can be negative. I can do that in the following ways:
if a<0 then delete;
if b<0 then delete;
so on. But this is not an efficient way to delete. Can anyone suggest me any efficient method?
@abdulla wrote:
In my dataset, I have 15 variables in which there are some illogical negative values and there are 3 more variables in which negative values are logical. So, I want to delete all the rows that have negative values in the 15 variables. I can do that by writing 15 separate lines. But that is not efficient.
Nope, it's not efficient. If you have negative values they will be the lowest value so check what your minimum is, if the minimum value across all 15 variables is less than 0 you delete the row.
@abdulla wrote:
In my dataset, I have 15 variables in which there are some illogical negative values and there are 3 more variables in which negative values are logical. So, I want to delete all the rows that have negative values in the 15 variables. I can do that by writing 15 separate lines. But that is not efficient.
if min(of listofVars) < 0 then delete;
Here is a reference that illustrates how to refer to variables and datasets in a short cut list to replace "listofVars" above.
https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html
If you code
if a<0 then delete;
if b<0 then delete;
you will delete the entire record -- all of the variables. Is that what you want?
If you want to delete the entire record from the SAS dataset, then the DELETE command is reasonably efficient.
But I may not be understanding what you're trying to do. Can you post 1) some sample data and 2) what your results should look like?
Jim
The way you are doing it is reasonably efficient. You could make it a bit more efficient if you change
FROM:
DATA FINAL_DATA;
SET ACCT_DATA;
if a<0 then delete;
if b<0 then delete;
RUN;
TO:
DATA FINAL_DATA2;
SET ACCT_DATA;
if a>0 AND b>0;
RUN;
Is it a very large dataset? If it is not a large dataset, then you won't even notice any significant difference. If you run with your DELETE's, how long is it taking to run?
You could also use a PROC SQL with a DELETE FROM - WHERE construct, but the rows will only be logically deleted. The rows will not be physically deleted from the SAS dataset. This would not be my preference.
Jim
@abdulla wrote:
In my dataset, I have 15 variables in which there are some illogical negative values and there are 3 more variables in which negative values are logical. So, I want to delete all the rows that have negative values in the 15 variables. I can do that by writing 15 separate lines. But that is not efficient.
Oh, oh, you mean in terms of coding. "Efficient" usually means in terms of run time or CPU time.
Sure. Let me code something for you.
Jim
OK, so let's say the variables that should not be negative are variables A through O (15 variables). Let's make up some data:
Here's some code that will delete any of the variables A through O if any one of the variables is negative:
DATA Final_Data;
SET Acct_Data;
ARRAY Variables [*] A -- O;
DO _i = 1 to DIM(Variables);
if Variables[_i] < 0 THEN
DELETE;
END;
RUN;
Results:
How's that look?
One thing to be aware of, the variables have to be all next to each other in order to use the A -- O notation. If some of the variables that are allowed to be negative are in between A and O, you have to code the range a bit differently. For example, if D is allowed to be negative, you would have to code A -- C and E -- P.
Jim
Here's a quick fix for missing values:
DATA Final_Data;
DROP _:;
SET Acct_Data;
ARRAY Variables [*] A -- C E -- P;
DO _i = 1 to DIM(Variables);
if NOT MISSING(Variables[_i]) THEN
if Variables[_i] < 0 THEN
DELETE;
END;
RUN;
Jim
@abdulla wrote:
In my dataset, I have 15 variables in which there are some illogical negative values and there are 3 more variables in which negative values are logical. So, I want to delete all the rows that have negative values in the 15 variables. I can do that by writing 15 separate lines. But that is not efficient.
Nope, it's not efficient. If you have negative values they will be the lowest value so check what your minimum is, if the minimum value across all 15 variables is less than 0 you delete the row.
@abdulla wrote:
In my dataset, I have 15 variables in which there are some illogical negative values and there are 3 more variables in which negative values are logical. So, I want to delete all the rows that have negative values in the 15 variables. I can do that by writing 15 separate lines. But that is not efficient.
if min(of listofVars) < 0 then delete;
Here is a reference that illustrates how to refer to variables and datasets in a short cut list to replace "listofVars" above.
https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html
Nice suggestion. I'll have to try that some time on a really big data set. I would think the MIN function is going to have to check for missing and then compare each value successively to determine the minimum non-missing value and then determine if that minimum value is less than zero. The array method has to check each for missing and then check each for less than zero -- but only until it finds a negative value. I'm betting that there wouldn't be much difference in CPU time, but I could be wrong, particularly if negative values were for some reason weighted toward the end of the array.
Jim
@Reeza, you are completely (completely) on point here.
I ran my array code and the MIN(of xxx) method you suggested with 500 million rows. The MIN(of xxx) method is considerably faster, BUT there's some subtlety here. Recall that @Rick_SAS's article mentioned different methods of specifying the variables referenced by the "of". While I was at it, I compared three types of references and my original array code:
With 500 million rows, the results were (CPU time):
My array code was about 2 CPU minutes slower than the fastest MIN(of xxx) code.
As you might expect, examples 2 and 4 ran in about the same time. They're both references by individual variable name and, when compiled, should be about the same.
Surprisingly, (to me) the example using an array in the MIN(of xxx) code, ran about a CPU minute slower than the non-array examples.
So some conclusions here:
This is all a bit esoteric, but still, these are good things to know. I had always assumed that array overhead was minimal and that arrays were a fast way of doing most things. Apparently this is not so, which is news to me. I learned a little something here, so I thank you for your code suggestion, @Reeza, even though I was not the original poster in this thread. 🙂
Jim
Since I know you'll all be just dying to try this in your environment, here is my test script. Seriously, though, if you come up with anything substantially different, I'd be curious to know about it. I'm running on Windows Server 2016 x64 with SAS 9.4 M6 submitted via EG 8.2.
Jim
%Time_Stamp(START);
**------------------------------------------------------------------------------**;
%LET Iterations = 100000000;
*%LET Iterations = 5;
%LET DSN = Acct_Data2;
**------------------------------------------------------------------------------**;
DATA ACCT_DATA;
INFILE DATALINES DSD DLM='|';
INPUT A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
;
DATALINES;
1|-5|4|-7|18|15|241|78|18|15|241|78|18|15|241|5
18|15|241|-78|18|15|241|78|18|15|241|78|18|15|241|.
8|25|256|-718|18|15|241|78|18|15|241|78|18|15|241|3
76|4|-212|-32|18|15|241|78|18|15|241|78|18|15|241|22
7|4|212|-32|18|15|241|78|18|15|241|78|18|15|241|-2
;
RUN;
**------------------------------------------------------------------------------**;
PROC CONTENTS DATA=Acct_Data
OUT=Acct_Data_Contents;
RUN;
**------------------------------------------------------------------------------**;
PROC SQL NOPRINT;
SELECT Name
INTO : Var_List SEPARATED BY ' '
FROM Acct_Data_Contents
WHERE Type = 1
AND Name ^= 'D';
QUIT;
%PUT &Nte1 &=Var_List;
**------------------------------------------------------------------------------**;
DATA Acct_Data2;
DROP _:;
SET Acct_Data;
DO _i = 1 TO &Iterations;
OUTPUT;
END;
RUN;
**------------------------------------------------------------------------------**;
DATA Final_Data1;
DROP _:;
SET &DSN;
ARRAY Variables [*] A -- C E -- P;
DO _i = 1 to DIM(Variables);
if NOT MISSING(Variables[_i]) THEN
if Variables[_i] < 0 THEN
DELETE;
END;
RUN;
**------------------------------------------------------------------------------**;
DATA Final_Data2;
DROP _:;
SET &DSN;
IF MIN(of A -- C E -- P) < 0 then delete;
RUN;
**------------------------------------------------------------------------------**;
DATA Final_Data3;
DROP _:;
SET &DSN;
ARRAY Variables [*] A -- C E -- P;
IF MIN(of Variables[*]) < 0 then delete;
RUN;
**------------------------------------------------------------------------------**;
DATA Final_Data4;
DROP _:;
SET &DSN;
IF MIN(of &Var_List) < 0 then delete;
RUN;
**------------------------------------------------------------------------------**;
&Null %MACRO Delete_Work_Data;
PROC DELETE DATA=Acct_Data;
RUN;
PROC DELETE DATA=Acct_Data2;
RUN;
%DO i = 1 %TO 4;
PROC DELETE DATA=Final_Data&i;
RUN;
%END;
%MEND Delete_Work_Data;
%Delete_Work_Data;
**------------------------------------------------------------------------------**;
%Time_Stamp(STOP);
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.