Re: Keep accounts with missing value for variable in one month but not...

Justin9 · Posted 08-04-2020 12:16 PM

Hi, I've got one dataset that contains three variables:

1. ACCOUNT_ID (Type: Numeric, Format: 13.)

2. RUN_DATE (Type: Numeric, Group: Date, Format: DATE9.)

3. SCORE (Type: Character)

In my dataset, ac.test, I would like to create another dataset with all of the the account ID that had a missing SCORE value in July 2020, but not missing in June 2020. Can someone help provide me with code to answer the question, so the new dataset should contain account ID's where this is the case, as well as all its RUN_DATE and SCORE information from June 2020? Thanks!

My dataset contains over 500,000 records, so the code would ideally work for the entire dataset. The screenshots below are just an example (so the dataset would keep accounts that had a missing SCORE value in July, but not missing in June - in the example, the new dataset would contain 16211274, 16211277 and 16211278, as well as its RUN_DATE and SCORE values).

Note: I only have one dataset, and the lines are just to separate June and July 2020 in this example:

Kurt_Bremser · Posted 08-04-2020 12:24 PM

Join the dataset with itself:

proc sql;
create table want as
  select distinct a.account_id
  from have a, have b
  where
    a.account_id = b.account_id and
    put(a.run_date,yymmn6.) = '202006' and
    a.score ne . and
    put(b.run_date,yymmn6.) = '202007' and
    b.score = .
;
quit;

Untested; for tested code, supply usable example data in a data step with datalines.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Justin9 · Posted 08-04-2020 12:50 PM

Thanks for your reply. I've edited my post slightly - I've only got one dataset (that contains both June 2020 and July 2020) and the SCORE is a character variable, so I've edited the code you provided but the new output table doesn't give me the results as intended (how I want it to look in my example). Would it be possible to advise me on where I've gone wrong?

proc sql;

create table analysis as

select distinct ACCOUNT_ID

,RUN_DATE

,SCORE

from ac.test

where (put(RUN_DATE,yymmn6.)='202006' and SCORE ne ' ') and (put(RUN_DATE,yymmn6.)='202007' and SCORE=' ')

;

quit;

Kurt_Bremser · Posted 08-04-2020 01:12 PM

As stated, the code is not tested. For tested code, supply example data in usable form.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 08-05-2020 03:41 AM

@Justin9 wrote:

Thanks for your reply. I've edited my post slightly - I've only got one dataset (that contains both June 2020 and July 2020) and the SCORE is a character variable, so I've edited the code you provided but the new output table doesn't give me the results as intended (how I want it to look in my example). Would it be possible to advise me on where I've gone wrong?

proc sql;

create table analysis as

select distinct ACCOUNT_ID

,RUN_DATE

,SCORE

from ac.test

where (put(RUN_DATE,yymmn6.)='202006' and SCORE ne ' ') and (put(RUN_DATE,yymmn6.)='202007' and SCORE=' ')

;

quit;

I explicitly stated in my post that I join the dataset to ITSELF, so I am taking into account that you only have one dataset.

See here the proof that my code works (I just added two columns to the output, changed score to character, and deleted the DISTINCT keyword, as it is not needed):

data have;
input account_id $ run_date :date9. score $;
format run_date date9.;
datalines;
16211273 26jun2020 125
16211274 26jun2020 235
16211275 26jun2020 .
16211276 28jun2020 .
16211277 29jun2020 400
16211278 29jun2020 500
16211279 30jun2020 .
16211280 30jun2020 .
16211273 22jul2020 150
16211274 24jul2020 .
16211275 26jul2020 200
16211276 28jul2020 .
16211277 29jul2020 .
16211278 29jul2020 .
16211279 30jul2020 402
16211280 30jul2020 326
;

proc sql;
create table want as
  select a.account_id, a.run_date, a.score
  from have a, have b
  where
    a.account_id = b.account_id and
    put(a.run_date,yymmn6.) = '202006' and
    a.score ne "" and
    put(b.run_date,yymmn6.) = '202007' and
    b.score = ""
;
quit;

proc print data=want noobs;
run;

Result:

account_id	run_date	score
16211274	26JUN2020	235
16211277	29JUN2020	400
16211278	29JUN2020	500

Note that I had to waste a lot of time typing data off a picture(!). Next time, please provide data as shown in this code, so we can immediately start testing.

Pictures are the WORST way to present example data, by lunar distances.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Justin9 · Posted 08-04-2020 12:43 PM

Hi, I've got one dataset that contains three variables:

1. ACCOUNT_ID (Type: Numeric, Format: 13.)

2. RUN_DATE (Type: Numeric, Group: Date, Format: DATE9.)

3. SCORE (Type: Character)

In my dataset, ac.test, I would like to create another dataset with all of the the account ID that had a missing SCORE value in July 2020, but not missing in June 2020. Can someone help provide me with code to answer the question, so the new dataset should contain account ID's where this is the case, as well as all its RUN_DATE and SCORE information from June 2020? Thanks!

My dataset contains over 500,000 records, so the code would ideally work for the entire dataset. The screenshots below are just an example (so the dataset would keep accounts that had a missing SCORE value in July, but not missing in June - in the example, the new dataset would contain 16211274, 16211277 and 16211278, as well as its RUN_DATE and SCORE values).

Note: I only have one dataset, and the lines are just to separate June and July 2020 in this example:

Kurt_Bremser · Posted 08-04-2020 12:47 PM

Please do not double-post; instead read the answers that were given before replying yourself.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 08-04-2020 01:13 PM

data want;
  merge have (where=(run_date between '01jul2020'd and '31jul2020'd and score=.)  in=in1)
        have (where=(run_date between '01jun2020'd and '30jun2020'd and score^=.) in=in2);
  by account_id;
  if in1 and in2;

Note that the JULY 2020 filter is the first merge operand, and the JUNE filter is the last. So any common variables will get their values from the JUNE record.

This assumes your data are sorted by ID (but not necessarily by run_date within id).

Editted addition: But I see that your sample data is sorted by date/id, but within monthly groupings it appears to effectively sorted by id. If so then the above program should still work.

But if, within each month, date are NOT sorted by id within each month, you would need to pre-sort the records of interest:

proc sort data=have out=need;
 by account_id run_date;
 where
    (run_date between '01jul2020'd and '31jul2020'd and score=.) 
    or
    (run_date between '01jun2020'd and '30jun2020'd and score^=.) ;
run;
data want;
  set need;
  by account_id;
  if first.account_id=1 and last.account_id=0;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Keep accounts that had missing value for a variable in one month but not missing in previous month

Re: Keep accounts that had missing value for a variable in one month but not missing in previous mon

Re: Keep accounts that had missing value for a variable in one month but not missing in previous mon

Re: Keep accounts that had missing value for a variable in one month but not missing in previous mon

Re: Keep accounts that had missing value for a variable in one month but not missing in previous mon

Keep accounts with missing value for variable in one month but not missing in previous month

Re: Keep accounts with missing value for variable in one month but not missing in previous month

Re: Keep accounts with missing value for variable in one month but not missing in previous month

SAS Innovate 2025: Call for Content

Classroom Training Available!