BookmarkSubscribeRSS Feed
Justin9
Obsidian | Level 7

Hi, I've got one dataset that contains three variables:

1. ACCOUNT_ID (Type: Numeric, Format: 13.)

2. RUN_DATE (Type: Numeric, Group: Date, Format: DATE9.)

3. SCORE (Type: Character)

 

In my dataset, ac.test, I would like to create another dataset with all of the the account ID that had a missing SCORE value in July 2020, but not missing in June 2020. Can someone help provide me with code to answer the question, so the new dataset should contain account ID's where this is the case, as well as all its RUN_DATE and SCORE information from June 2020? Thanks!

 

My dataset contains over 500,000 records, so the code would ideally work for the entire dataset. The screenshots below are just an example (so the dataset would keep accounts that had a missing SCORE value in July, but not missing in June - in the example, the new dataset would contain 16211274, 16211277 and 16211278, as well as its RUN_DATE and SCORE values).

 

Note: I only have one dataset, and the lines are just to separate June and July 2020 in this example:

Score query.JPG  

Justin9_0-1596558183373.png

 

 

 

7 REPLIES 7
Kurt_Bremser
Super User

Join the dataset with itself:

proc sql;
create table want as
  select distinct a.account_id
  from have a, have b
  where
    a.account_id = b.account_id and
    put(a.run_date,yymmn6.) = '202006' and
    a.score ne . and
    put(b.run_date,yymmn6.) = '202007' and
    b.score = .
;
quit;

Untested; for tested code, supply usable example data in a data step with datalines.

Justin9
Obsidian | Level 7

Thanks for your reply. I've edited my post slightly - I've only got one dataset (that contains both June 2020 and July 2020) and the SCORE is a character variable, so I've edited the code you provided but the new output table doesn't give me the results as intended (how I want it to look in my example). Would it be possible to advise me on where I've gone wrong?

 

proc sql;

create table analysis as

   select distinct ACCOUNT_ID

                         ,RUN_DATE

                         ,SCORE

   from ac.test

   where (put(RUN_DATE,yymmn6.)='202006' and SCORE ne ' ') and (put(RUN_DATE,yymmn6.)='202007' and SCORE=' ') 

   ;

quit;

 

 

Kurt_Bremser
Super User

@Justin9 wrote:

Thanks for your reply. I've edited my post slightly - I've only got one dataset (that contains both June 2020 and July 2020) and the SCORE is a character variable, so I've edited the code you provided but the new output table doesn't give me the results as intended (how I want it to look in my example). Would it be possible to advise me on where I've gone wrong?

 

proc sql;

create table analysis as

   select distinct ACCOUNT_ID

                         ,RUN_DATE

                         ,SCORE

   from ac.test

   where (put(RUN_DATE,yymmn6.)='202006' and SCORE ne ' ') and (put(RUN_DATE,yymmn6.)='202007' and SCORE=' ') 

   ;

quit;

 

 


I explicitly stated in my post that I join the dataset to ITSELF, so I am taking into account that you only have one dataset.

See here the proof that my code works (I just added two columns to the output, changed score to character, and deleted the DISTINCT keyword, as it is not needed):

data have;
input account_id $ run_date :date9. score $;
format run_date date9.;
datalines;
16211273 26jun2020 125
16211274 26jun2020 235
16211275 26jun2020 .
16211276 28jun2020 .
16211277 29jun2020 400
16211278 29jun2020 500
16211279 30jun2020 .
16211280 30jun2020 .
16211273 22jul2020 150
16211274 24jul2020 .
16211275 26jul2020 200
16211276 28jul2020 .
16211277 29jul2020 .
16211278 29jul2020 .
16211279 30jul2020 402
16211280 30jul2020 326
;

proc sql;
create table want as
  select a.account_id, a.run_date, a.score
  from have a, have b
  where
    a.account_id = b.account_id and
    put(a.run_date,yymmn6.) = '202006' and
    a.score ne "" and
    put(b.run_date,yymmn6.) = '202007' and
    b.score = ""
;
quit;

proc print data=want noobs;
run;

Result:

account_id	run_date	score
16211274	26JUN2020	235
16211277	29JUN2020	400
16211278	29JUN2020	500

Note that I had to waste a lot of time typing data off a picture(!). Next time, please provide data as shown in this code, so we can immediately start testing.

Pictures are the WORST way to present example data, by lunar distances.

Justin9
Obsidian | Level 7

Hi, I've got one dataset that contains three variables:

1. ACCOUNT_ID (Type: Numeric, Format: 13.)

2. RUN_DATE (Type: Numeric, Group: Date, Format: DATE9.)

3. SCORE (Type: Character)

 

In my dataset, ac.test, I would like to create another dataset with all of the the account ID that had a missing SCORE value in July 2020, but not missing in June 2020. Can someone help provide me with code to answer the question, so the new dataset should contain account ID's where this is the case, as well as all its RUN_DATE and SCORE information from June 2020? Thanks!

 

My dataset contains over 500,000 records, so the code would ideally work for the entire dataset. The screenshots below are just an example (so the dataset would keep accounts that had a missing SCORE value in July, but not missing in June - in the example, the new dataset would contain 16211274, 16211277 and 16211278, as well as its RUN_DATE and SCORE values).

 

Note: I only have one dataset, and the lines are just to separate June and July 2020 in this example:

Score query.JPG  

Justin9_0-1596558183373.png

mkeintz
PROC Star
data want;
  merge have (where=(run_date between '01jul2020'd and '31jul2020'd and score=.)  in=in1)
        have (where=(run_date between '01jun2020'd and '30jun2020'd and score^=.) in=in2);
  by account_id;
  if in1 and in2;

Note that the JULY 2020 filter is the first merge operand, and the JUNE filter is the last.  So any common variables will get their values from the JUNE record.

 

This assumes your data are sorted by ID  (but not necessarily by run_date within id).

 

Editted addition:  But I see that your sample data is sorted by date/id, but within monthly groupings it appears to effectively sorted by id.  If so then the above program should still work.

 

But if, within each month, date are NOT sorted by id within each month, you would need to pre-sort the records of interest:

 

proc sort data=have out=need;
 by account_id run_date;
 where
    (run_date between '01jul2020'd and '31jul2020'd and score=.) 
    or
    (run_date between '01jun2020'd and '30jun2020'd and score^=.) ;
run;
data want;
  set need;
  by account_id;
  if first.account_id=1 and last.account_id=0;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 916 views
  • 1 like
  • 3 in conversation