Solved: Re: Delete negative values of multiple variables

abdulla · Posted 09-30-2020 09:38 PM

Hi,

I have many accounting variables that are illogically negative. I want to delete all the negative values from all the variables except a few whose values can be negative. I can do that in the following ways:

if a<0 then delete;

if b<0 then delete;

so on. But this is not an efficient way to delete. Can anyone suggest me any efficient method?

Reeza · Posted 09-30-2020 10:47 PM

@abdulla wrote:
In my dataset, I have 15 variables in which there are some illogical negative values and there are 3 more variables in which negative values are logical. So, I want to delete all the rows that have negative values in the 15 variables. I can do that by writing 15 separate lines. But that is not efficient.

Nope, it's not efficient. If you have negative values they will be the lowest value so check what your minimum is, if the minimum value across all 15 variables is less than 0 you delete the row.

@abdulla wrote:
In my dataset, I have 15 variables in which there are some illogical negative values and there are 3 more variables in which negative values are logical. So, I want to delete all the rows that have negative values in the 15 variables. I can do that by writing 15 separate lines. But that is not efficient.

if min(of listofVars) < 0 then delete;

Here is a reference that illustrates how to refer to variables and datasets in a short cut list to replace "listofVars" above.
https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html

View solution in original post

jimbarbour · Posted 09-30-2020 09:45 PM

If you code

if a<0 then delete;
if b<0 then delete;

you will delete the entire record -- all of the variables. Is that what you want?

If you want to delete the entire record from the SAS dataset, then the DELETE command is reasonably efficient.

But I may not be understanding what you're trying to do. Can you post 1) some sample data and 2) what your results should look like?

Jim

abdulla · Posted 09-30-2020 09:52 PM

Yes, if any variable has a negative value, I want to delete the row.

jimbarbour · Posted 09-30-2020 09:58 PM

The way you are doing it is reasonably efficient. You could make it a bit more efficient if you change

FROM:

DATA	FINAL_DATA;
	SET	ACCT_DATA;
	if a<0 then delete;
	if b<0 then delete;
RUN;

TO:

DATA	FINAL_DATA2;
	SET	ACCT_DATA;
	if a>0 AND b>0;
RUN;

Is it a very large dataset? If it is not a large dataset, then you won't even notice any significant difference. If you run with your DELETE's, how long is it taking to run?

You could also use a PROC SQL with a DELETE FROM - WHERE construct, but the rows will only be logically deleted. The rows will not be physically deleted from the SAS dataset. This would not be my preference.

Jim

abdulla · Posted 09-30-2020 09:55 PM

In my dataset, I have 15 variables in which there are some illogical negative values and there are 3 more variables in which negative values are logical. So, I want to delete all the rows that have negative values in the 15 variables. I can do that by writing 15 separate lines. But that is not efficient.

jimbarbour · Posted 09-30-2020 09:59 PM

@abdulla wrote:
In my dataset, I have 15 variables in which there are some illogical negative values and there are 3 more variables in which negative values are logical. So, I want to delete all the rows that have negative values in the 15 variables. I can do that by writing 15 separate lines. But that is not efficient.

Oh, oh, you mean in terms of coding. "Efficient" usually means in terms of run time or CPU time.

Sure. Let me code something for you.

Jim

jimbarbour · Posted 09-30-2020 10:10 PM

OK, so let's say the variables that should not be negative are variables A through O (15 variables). Let's make up some data:

Here's some code that will delete any of the variables A through O if any one of the variables is negative:

DATA	Final_Data;
	SET	Acct_Data;
	ARRAY	Variables	[*] A -- O;

	DO	_i	=	1	to	DIM(Variables);
		if	Variables[_i] <	0	THEN
			DELETE;
	END;
RUN;

Results:

How's that look?

One thing to be aware of, the variables have to be all next to each other in order to use the A -- O notation. If some of the variables that are allowed to be negative are in between A and O, you have to code the range a bit differently. For example, if D is allowed to be negative, you would have to code A -- C and E -- P.

Jim

abdulla · Posted 09-30-2020 10:47 PM

It seems like that it deletes the missing observations too. I don't want to delete missing observations or anything else.

jimbarbour · Posted 09-30-2020 10:54 PM

Here's a quick fix for missing values:

DATA	Final_Data;
	DROP _:;
	SET	Acct_Data;
	ARRAY	Variables	[*] A -- C E -- P;

	DO	_i	=	1	to	DIM(Variables);
		if	NOT	MISSING(Variables[_i])	THEN
			if	Variables[_i] <	0		THEN
				DELETE;
	END;
RUN;

Jim

abdulla · Posted 10-01-2020 01:25 AM

Thanks Jim.

Reeza · Posted 09-30-2020 10:47 PM

@abdulla wrote:
In my dataset, I have 15 variables in which there are some illogical negative values and there are 3 more variables in which negative values are logical. So, I want to delete all the rows that have negative values in the 15 variables. I can do that by writing 15 separate lines. But that is not efficient.

Nope, it's not efficient. If you have negative values they will be the lowest value so check what your minimum is, if the minimum value across all 15 variables is less than 0 you delete the row.

@abdulla wrote:
In my dataset, I have 15 variables in which there are some illogical negative values and there are 3 more variables in which negative values are logical. So, I want to delete all the rows that have negative values in the 15 variables. I can do that by writing 15 separate lines. But that is not efficient.

if min(of listofVars) < 0 then delete;

Here is a reference that illustrates how to refer to variables and datasets in a short cut list to replace "listofVars" above.
https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html

jimbarbour · Posted 09-30-2020 11:12 PM

@Reeza,

Nice suggestion. I'll have to try that some time on a really big data set. I would think the MIN function is going to have to check for missing and then compare each value successively to determine the minimum non-missing value and then determine if that minimum value is less than zero. The array method has to check each for missing and then check each for less than zero -- but only until it finds a negative value. I'm betting that there wouldn't be much difference in CPU time, but I could be wrong, particularly if negative values were for some reason weighted toward the end of the array.

Jim

jimbarbour · Posted 10-01-2020 03:25 PM

@Reeza, you are completely (completely) on point here.

I ran my array code and the MIN(of xxx) method you suggested with 500 million rows. The MIN(of xxx) method is considerably faster, BUT there's some subtlety here. Recall that @Rick_SAS's article mentioned different methods of specifying the variables referenced by the "of". While I was at it, I compared three types of references and my original array code:

Original array code
IF MIN(of A -- C E -- P) < 0 then delete;
ARRAY Variables [*] A -- C E -- P; IF MIN(of Variables[*]) < 0 then delete;
IF MIN(of &Var_List) < 0 then delete;

With 500 million rows, the results were (CPU time):

0:07:15.07
0:05:18.89
0:06:16.18
0:05:24.98

My array code was about 2 CPU minutes slower than the fastest MIN(of xxx) code.

As you might expect, examples 2 and 4 ran in about the same time. They're both references by individual variable name and, when compiled, should be about the same.

Surprisingly, (to me) the example using an array in the MIN(of xxx) code, ran about a CPU minute slower than the non-array examples.

So some conclusions here:

Indexing through an array is considerably slower than a function with "of" and a variable list.
The way the variable list is referenced matters. Reference by array is significantly slower than by individual variable name.
Arrays appear to have a material amount of overhead associated with them.

This is all a bit esoteric, but still, these are good things to know. I had always assumed that array overhead was minimal and that arrays were a fast way of doing most things. Apparently this is not so, which is news to me. I learned a little something here, so I thank you for your code suggestion, @Reeza, even though I was not the original poster in this thread. 🙂

Jim

jimbarbour · Posted 10-01-2020 03:33 PM

Since I know you'll all be just dying to try this in your environment, here is my test script. Seriously, though, if you come up with anything substantially different, I'd be curious to know about it. I'm running on Windows Server 2016 x64 with SAS 9.4 M6 submitted via EG 8.2.

Jim

%Time_Stamp(START);

**------------------------------------------------------------------------------**;

%LET	Iterations	=	100000000;
*%LET	Iterations	=	5;
%LET	DSN			=	Acct_Data2;

**------------------------------------------------------------------------------**;

DATA	ACCT_DATA;
	INFILE	DATALINES DSD	DLM='|';
	INPUT	A
			B
			C
			D
			E
			F
			G
			H
			I
			J
			K
			L
			M
			N
			O
			P
			;
DATALINES;
1|-5|4|-7|18|15|241|78|18|15|241|78|18|15|241|5
18|15|241|-78|18|15|241|78|18|15|241|78|18|15|241|.
8|25|256|-718|18|15|241|78|18|15|241|78|18|15|241|3
76|4|-212|-32|18|15|241|78|18|15|241|78|18|15|241|22
7|4|212|-32|18|15|241|78|18|15|241|78|18|15|241|-2
;
RUN;

**------------------------------------------------------------------------------**;

PROC	CONTENTS	DATA=Acct_Data
					OUT=Acct_Data_Contents;
RUN;

**------------------------------------------------------------------------------**;

PROC	SQL	NOPRINT;
	SELECT	Name
			INTO	:	Var_List	SEPARATED BY ' '
		FROM	Acct_Data_Contents
			WHERE	Type	=	1
				AND	Name	^=	'D';
QUIT;

%PUT	&Nte1  &=Var_List;

**------------------------------------------------------------------------------**;

DATA	Acct_Data2;
	DROP	_:;
	SET	Acct_Data;
	DO _i	=	1	TO	&Iterations;
		OUTPUT;
	END;
RUN;

**------------------------------------------------------------------------------**;

DATA	Final_Data1;
	DROP _:;
	SET	&DSN;
	ARRAY	Variables [*] A -- C E -- P;

	DO	_i	=	1	to	DIM(Variables);
		if	NOT	MISSING(Variables[_i])	THEN
			if	Variables[_i] <	0		THEN
				DELETE;
	END;
RUN;

**------------------------------------------------------------------------------**;

DATA	Final_Data2;
	DROP _:;
	SET	&DSN;

	IF	MIN(of  A -- C E -- P) < 0 then delete;
RUN;

**------------------------------------------------------------------------------**;

DATA	Final_Data3;
	DROP _:;
	SET	&DSN;
	ARRAY	Variables [*] A -- C E -- P;  
	IF	MIN(of Variables[*]) < 0 then delete;
RUN;

**------------------------------------------------------------------------------**;

DATA	Final_Data4;
	DROP _:;
	SET	&DSN;

	IF	MIN(of &Var_List) < 0 then delete;
RUN;

**------------------------------------------------------------------------------**;

&Null	%MACRO	Delete_Work_Data;
	PROC	DELETE	DATA=Acct_Data;
	RUN;
	PROC	DELETE	DATA=Acct_Data2;
	RUN;

	%DO	i	=	1	%TO	4;
		PROC	DELETE	DATA=Final_Data&i;
		RUN;
	%END;
%MEND	Delete_Work_Data;

%Delete_Work_Data;

**------------------------------------------------------------------------------**;

%Time_Stamp(STOP);

Reeza · Posted 10-01-2020 04:14 PM

Does resolving the macro variable add any overhead as well? I want to assume no but I'm not sure...otherwise your results are pretty much what I would have expected 🙂

Thanks for doing the leg work on this, definitely an interesting read!

Registration is open

SAS Training: Just a Click Away