Solved: Re: Subsetting

robertrao · Posted 06-07-2013 01:24 PM

hi,

in1 out1 is a pair likewise is in2 out2

when both the values of the pair are missing then i want to delete those records....

when a single value is absent of the pair i want to keep those records intact

Thanks

HAVE
ID cat value

101 in1 30

101 out1 60

101 in2 .

101 out2 .

102 in1 20

102 out1 30

102 in2 69

102 out2 .

103 in1 20

103 out1 .

104 in1 10

104 out1 .

104 in2 .

104 out2 .

WANT

ID cat value

101 in1 30

101 out1 60

102 in1 20

102 out1 30

102 in2 69

102 out2 .

103 in1 20

103 out1 .

104 in1 10

104 out1 .

Astounding · Posted 06-07-2013 02:35 PM

This version should work:

data want;

n_missing=0;

do _n_=1 to 2;

set have;

if value=. then n_missing + 1;

end;

do _n_=1 to 2;

set have;

if n_missing < 2 then output;

end;

drop n_missing;

run;

One of the keys here is that each SET statement operates independently of the other. Each starts with the first observation, and reads the next observation whenever it executes.

The program does rely on the structure to the data being accurate. It doesn't even look at CAT. It just assumes that each pair of observations are a matching group that should either both remain or both be deleted.

Good luck.

View solution in original post

caveman529 · Posted 06-07-2013 01:34 PM

Maybe you can do a lag of value based on id and then delete that observation if both are missing on a row?

Anotherdream · Posted 06-07-2013 02:33 PM

data answer1;

set yourdata;

order=input(substr(reverse(strip(cat)),1,1),7.);

run;

proc sql;

create table answer2 as

select sum(case when value is null then 1 else 0 end) as missing

,id

,order

from answer1

group by id

,order;

quit;

run;

proc sql;

create table FinalAnswer as

select a1.id

,a1.cat

,a1.value

from answer1 a1

left join answer12 a12 on a1.id=a12.id and a1.order=a12.order

where a12.missing~=2;

quit;

run;

Notice this only works if you variables comes in groups of 2 and have numerics at the end of their variables (aka your example). However this type of code can be made far more generic pretty easily.

Astounding · Posted 06-07-2013 02:35 PM

This version should work:

data want;

n_missing=0;

do _n_=1 to 2;

set have;

if value=. then n_missing + 1;

end;

do _n_=1 to 2;

set have;

if n_missing < 2 then output;

end;

drop n_missing;

run;

One of the keys here is that each SET statement operates independently of the other. Each starts with the first observation, and reads the next observation whenever it executes.

The program does rely on the structure to the data being accurate. It doesn't even look at CAT. It just assumes that each pair of observations are a matching group that should either both remain or both be deleted.

Good luck.

robertrao · Posted 06-07-2013 02:55 PM

Hi,

Thanks for the reply.

I tried the datastep since i am a beginner and it seems to be like how I intend.

I dint understand your code fully..

Could you please explain to me...?????

I dont underastand why we are getting a n_missing of 2 when the value is missing??? etc

Thanks

Astounding · Posted 06-07-2013 03:06 PM

The top DO loop reads two observations (notice the loop has _N_=1 TO 2).

Before that loop starts, N_MISSING=0. Inside the loop, N_MISSING gets increased each time a missing value is found. So when the top loop ends, N_MISSING=2 whenever both observations had a missing value.

At that point, the heavy lifting is over. The bottom DO loop reads the same observations, and outputs them when fewer than 2 missing values were found.

robertrao · Posted 06-17-2013 04:30 PM

Hi,

How is that if i run the Upper DO loop alone it outputs only the second record of the pair????

Thanks

Astounding · Posted 06-17-2013 04:40 PM

That result seems perfectly normal. To execute a DATA step, SAS performs all the statements in the DATA step and then outputs whatever the results are. So the top DO loop reads two observations, and then when the DATA statements are complete, the results contain the contents of the second observation. That gets output. Then the process repeats. Execute the top DO loop again, and get the next two observations. When that is complete, the results contain the contents of the fourth observation, so output that.

Good to see that you are trying test programs to get a sense of how SAS will handle different situations.

robertrao · Posted 06-17-2013 04:52 PM

That was quick!!!!!

So after the first loop we have only the last record of the pair and the corresponding n_missing value.A total of 7 records for our example

The seond Do loop takes the 14 records again fresh(ignoring what the first loop has ). and how does it get to know the n_missing value for the forst record of the pair????.....

Thanks a ton

Astounding · Posted 06-17-2013 05:09 PM

I think you have the order a little mixed up here. When the first loop is over, you do not have 7 observations. Eventually, by the end of the DATA step, you will have run through the first loop 7 times. But the first loop executes once, reading the first two observations. Then the second loop executes, reading the same two observations. So the value of N_MISSING (based on the first two observations) just sits there and is available to the second loop. After the second loop finishes, the DATA step continues.

Once again, the first loop reads two observations (observations 3 and 4). Then the second loop reads the same observations.

It may help to examine a simpler DATA step:

data new;

set old;

name='Fred';

x=5;

run;

The SET statement executes many times here. The first time, it reads the first observation from OLD. For that first observation, NAME= executes, then X= executes. Whatever is left gets output and becomes the first observation in NEW. Then the DATA step continues. The SET statement executes again, reading the second observation from OLD. Then NAME= executes, X= executes, and the result gets output. So the SET statement reads a single observation at a time, starting at the beginning of the incoming data set, and executes as many times as needed (yes, there are more details to that part). If there are multiple SET statements (as in the original problem), each one operates independently of the other (each starts by reading the first observation, and continues wherever it left off).

Hope this helps.

robertrao · Posted 06-18-2013 09:14 AM

That was very detailed. ...

I understood the concept well after that explanation......One last Question.

IST DO LOOP

n_missing=1 ID=101 cat=in2 value=. _ERROR_=0 _N_=1

n_missing=2 ID=101 cat=out2 value=. _ERROR_=0 _N_=2 <<<-----how does SAS know that it has to put n_missing of this record (from top loop) to "both" the records of the bottom loop from the second SET stmnt??????

IIND DO LOOP

n_missing=2 ID=101 cat=in2 value=. _ERROR_=0 _N_=1

n_missing=2 ID=101 cat=out2 value=. _ERROR_=0 _N_=2

Thnks

Astounding · Posted 06-18-2013 10:48 AM

SAS has no choice. It doesn't have access to any previous values of N_MISSING. It only has access to the current value. (OK, technically we could use a LAG function if needed, but that's not in play here.) When the top DO loop finishes, there is only one final value for N_MISSING and that is the value that the second DO loop can use.

robertrao · Posted 06-18-2013 11:38 AM

Great. .Thanks for your patience....This is a great place to learn and share the knowledge ....Also when the second loop brings in the same 2 records.......which statement in the code specifically says to put the value of the N_MISSING from prior loop to both the new records.......Automatically retained?????

Thanks

Astounding · Posted 06-18-2013 11:55 AM

Automatically retained? Sort of, and OK to think of it that way. Again, consider a simpler DATA step:

data new;

set old;

name='Fred';

if name='Fred' then x=5;

run;

How does SAS know whether NAME is "Fred" when it gets to the IF/THEN statement? Whatever value was assigned by earlier statements justs sits there, and is automatically available.

Technically, RETAIN in SAS has slightly different meaning. It refers to the process that occurs when one observation is over and the next one begins.

data new;

set old;

put 'X is now ' x;

name='Fred';

if name='Fred' then x=5;

* RETAIN X;

run;

Try this program, both with and without the RETAIN statement, and it will illustrate the difference.

robertrao · Posted 06-18-2013 12:41 PM

Thanks so much.I appreciate your time

Regards

Registration is open

SAS Training: Just a Click Away