Solved: Re: DELETING PRECEDING RECORDS

d6k5d3 · Posted 01-08-2019 02:12 AM

Spent a long time trying to figure out how it can be done. But failed. Let's say my dataset look like this:

Sl. Return Conz

1 4.5563 missing

2 4.8562 missing

3 0.0000 1

4 3.5879 missing

5 0.0000 missing

6 0.0000 missing

7 0.0000 missing

8 0.0000 4

9 2.5879 missing

10 3.6879 missing

11 3.6869 missing

12 4.5859 missing

13 0.0000 missing

14 0.0000 missing

15 0.0000 missing

16 0.0000 missing

17 0.0000 missing

18 0.0000 6

19 4.5879 missing

20 3.5629 missing

This dataset is a result of a merge. Hence, Conz has missing values. Conz variable is a count of zero returns. I want to delete corresponding zero returns for Conz>= 4. So, row 5-8 and row 13-18 should get deleted.

How can I achieve this???

Much thanks!

mkeintz · Posted 01-08-2019 03:01 AM

You have to read and count observations until CONZ>=4 (or until you encounter end of data set have). Then set a keep count as the original count-CONZ (if CONZ>=4) or the original count otherwise. Reread the original count, outputting only up to the keep count.

data have;
  input Sl  Return  Conz;
datalines;
 1  4.5563  .
 2  4.8562  .
 3  0.0000  1
 4  3.5879  .
 5  0.0000  .
 6  0.0000  .
 7  0.0000  .
 8  0.0000  4
 9  2.5879  .
10  3.6879  .
11  3.6869  .
12  4.5859  .
13  0.0000  .
14  0.0000  .
15  0.0000  .
16  0.0000  .
17  0.0000  .
18  0.0000  6
19  4.5879  .
20  3.5629  .
run;

data want (drop=_:);
  do _n=1 by 1 until (conz>=4 or end_of_have);
    set have end=end_of_have;
  end;
  if conz>=4 then _keep=_n-conz;
  else _keep=_n;
  do _j=1 to _n;  /*Now reread*/
    set have;      
    if _j<=_keep then output; /*and selectively output*/
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

andreas_lds · Posted 01-08-2019 02:54 AM

Please post the data in a usable form: as data-step with datalines statement, so that we have all variables exactly in the form you have them.

d6k5d3 · Posted 01-08-2019 02:58 AM

Unfortunately, the dataset I have is huge one having more than 800k observations. There are 19 columns. I do not have the datalines. What I have described above is the problem which I need to solve. I am sorry; I cannot get how I can make it easy for you.

Kurt_Bremser · Posted 01-08-2019 03:04 AM

@d6k5d3 wrote:

Unfortunately, the dataset I have is huge one having more than 800k observations. There are 19 columns. I do not have the datalines. What I have described above is the problem which I need to solve. I am sorry; I cannot get how I can make it easy for you.

Obviously, we do not need all the observations, just enough to test code, and mainly for having the structure of your dataset. That's quite easy to do, a tool for this is in my footnotes.

Creating fake data for testing in a data step with datalines is one of THE essential skills you'll need, so learn that NOW.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 01-08-2019 03:01 AM

You have to read and count observations until CONZ>=4 (or until you encounter end of data set have). Then set a keep count as the original count-CONZ (if CONZ>=4) or the original count otherwise. Reread the original count, outputting only up to the keep count.

data have;
  input Sl  Return  Conz;
datalines;
 1  4.5563  .
 2  4.8562  .
 3  0.0000  1
 4  3.5879  .
 5  0.0000  .
 6  0.0000  .
 7  0.0000  .
 8  0.0000  4
 9  2.5879  .
10  3.6879  .
11  3.6869  .
12  4.5859  .
13  0.0000  .
14  0.0000  .
15  0.0000  .
16  0.0000  .
17  0.0000  .
18  0.0000  6
19  4.5879  .
20  3.5629  .
run;

data want (drop=_:);
  do _n=1 by 1 until (conz>=4 or end_of_have);
    set have end=end_of_have;
  end;
  if conz>=4 then _keep=_n-conz;
  else _keep=_n;
  do _j=1 to _n;  /*Now reread*/
    set have;      
    if _j<=_keep then output; /*and selectively output*/
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

d6k5d3 · Posted 01-08-2019 04:15 AM

Thank you very much! It works great! However, since I am a beginner, I am having some difficulty in getting the code. I wish I could understand:
what does it mean when you write _:, _n, _keep, _j (what's the significance of keeping an underscore. How is it different from _TYPE_ or _FREQ_? I don't know what I should search online to get an understanding of this.)

What is happening when you write "do _n=1 by 1". You did not mention 'by' in "do _j=1 to _n".

Much thanks once again!

Kurt_Bremser · Posted 01-08-2019 03:01 AM

According to your description, that's a simple where condition:

where return ne 0 or conz ne .;

You can use the same condition as a subsetting if in your merge data step.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

d6k5d3 · Posted 01-08-2019 03:11 AM

If I write "return ne 0", I would lose all zero return observations, which I do not want.

Kurt_Bremser · Posted 01-08-2019 03:16 AM

@d6k5d3 wrote:
If I write "return ne 0", I would lose all zero return observations, which I do not want.

PLEASE.READ.MY.WHOLE.POST. With diligence.

My condition will remove only observations where return is zero AND conz is missing.

Just test it.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 01-08-2019 03:42 AM

@Kurt_Bremser

I think the OP is worried about a sequence of 2 records with return=0 followed by a record with CONZ=2. Although that situation is not in the sample date, I believe it is supposed to be kept.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Kurt_Bremser · Posted 01-08-2019 04:40 AM

@mkeintz wrote:

@Kurt_Bremser

I think the OP is worried about a sequence of 2 records with return=0 followed by a record with CONZ=2. Although that situation is not in the sample date, I believe it is supposed to be kept.

Well, that's what happens when someone posts incomplete example data in unusable form. Maxim 42.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!