Solved: Re: Grouping same values together

akberali67 · Posted 11-11-2012 11:08 AM

Hi,

Is there a procedure or data step to group same values in variables together. This is my data and what I am trying to achieve:

Data:

Var1 Var2

A 1

B 2

C 2

D 1

E 3

F 6

G 6

H 6

I 6

J 5

K 4

Desired Result:

Var1 Var2

B 2

C 2

F 6

G 6

H 6

I 6

I basically want all the same value in the variable along side each other to be extracted and delete the rest. I am not sure whether its possible technically because manually the dataset is way too large. Thanks in advance for the help.

-Akber.

PGStats · Posted 11-11-2012 11:58 AM

The question seems to be : How to eliminate sequences of var2 of length = 1. Here is a simple way :

data have;
input Var1 $   Var2;
datalines;
A          1
B          2
C          2
D          1
E          3
F          6
G          6
H          6
I          6
J          5
K          4
;

data want;
set have; by var2 notsorted;
if not (first.var2 and last.var2);
run;

PG

View solution in original post

AncaTilea · Posted 11-11-2012 11:17 AM

I don't understand how did you summarize the data set that you say you want.

What happened to the values 1,3,4 and 5?

When you say you want all the same values in the variable, you mean var1 or var2 in your data set?

Please give a bit more detail.

Thanks!

akberali67 · Posted 11-11-2012 11:40 AM

Hi,

I didnt summarize this using code, that is what I was TRYING to find out.

I wanted to pick all rows with 'Var2' being exactly the same and together in sequence and delete the rest. Is there anyway to achieve this?

I mean occurrence of Var2 say equal to 2 more than once in continuous sequence. Thanks for trying to help.

-Akber.

PGStats · Posted 11-11-2012 11:58 AM

The question seems to be : How to eliminate sequences of var2 of length = 1. Here is a simple way :

data have;
input Var1 $   Var2;
datalines;
A          1
B          2
C          2
D          1
E          3
F          6
G          6
H          6
I          6
J          5
K          4
;

data want;
set have; by var2 notsorted;
if not (first.var2 and last.var2);
run;

PG

akberali67 · Posted 11-11-2012 12:39 PM

Hi PG,

Thanks for the answer. It is quite difficult to explain but what I am trying to do is: Keep rows with var2 recurring and delete the ones that are not for example 1, 3, 5, 4 which are not recurring again in the next row. Thanks again.

-Akber.

art297 · Posted 11-11-2012 01:15 PM

: PG has already provided the best answer, the one that is easiest to code and takes the least time to run. However, since I was comparing the processing time with another way to accomplish the same thing, thought I'd post another way of doing it.

I recommend using PG's method, but a better way of explaining what you wanted would be to say that you only want records that have var2 values that are either the same as the previous record or the same as the next record. PG's code does that nicely. The following does the same thing, but runs slightly slower (for a 5 million record file, PG's code ran in 1.3 seconds, where the following code took 1.5 seconds):

data want (drop=next);

set have;

set have ( firstobs = 2 keep = var2 rename = (var2 = next) )

have ( obs = 1 drop = _all_ );

if var2 eq lag(var2) or var2 eq next;

run;

Tal · Posted 11-11-2012 01:49 PM

Hey Arthur,

how would you know this info if you dont run the actual data?

(for a 5 million record file, PG's code ran in 1.3 seconds, where the following code took 1.5 seconds)

art297 · Posted 11-11-2012 03:04 PM

: I took the original example, concatenated 500,000 copies of it so each 11 records were in the same order as the original example, and ran both both sets of code on the expanded file. The only reason I made the file bigger was so that I could see the time differences. On just the example dataset, both programs took 0 seconds to run.

Tal · Posted 11-11-2012 04:36 PM

sorry but how did you manage to concatenate that many copies?

Any sort of abbreviation you've used here on the concatenation statement?

The little book of SAS does not have many of the tricks

art297 · Posted 11-11-2012 05:45 PM

: No trick. There is probably an easier way, but I just read and wrote out the dataset 500,000 times. I.e.:

data have;

input Var1 $ Var2;

datalines;

A 1

B 2

C 2

D 1

E 3

F 6

G 6

H 6

I 6

J 5

K 4

;

data test (drop=i);

do i=1 to 500000;

do j=1 to 11;

set have point=j;

output;

end;

stop;

run;

Tal · Posted 11-11-2012 07:47 PM

thanks alot

akberali67 · Posted 11-11-2012 01:53 PM

Thanks Arthur, ur code is classy and the explanation suttle as always.

AncaTilea · Posted 11-11-2012 12:12 PM

Thank you PG for clarifying.

I modified your code to see what's happening;

data want;

set have; by var2 notsorted;

*if not (first.var2 and last.var2);

if first.var2 = 1 and last.var2 = 1 then to_delete = 1;

run;

so I can see clearly what gets deleted.

Thanks!

akberali67 · Posted 11-11-2012 01:28 PM

Thanks for the help!

Registration is open

SAS Training: Just a Click Away