BookmarkSubscribeRSS Feed
Bounce
Fluorite | Level 6

Hello SAS community,


I have a dataset with 3 rows: ID, Date, Event. ID and Date are pretty self-explanatory but Event contains different education phases for each ID: Kindergarten, Preschool, Primary, High School and University. Now, the data is full of errors - sometimes the dates are wrong in which it is out of sequence or duplicated. Sometimes, it may show that the customer participated in say for example, high school before primary school etc. (Gaps are completely fine - ie. Fine for a customer to have a date where he attended Kindergarten then very next line can be the date he went to University). My logic is to assume these are entries are completely erroneous.

 

How can I logically sort this so that any data out-of-sequence which does not make sense is deleted? (If customer had a 'University' stint in between 'Primary' and 'High School' then delete this row)

 

Data I have:

 

data have;

input id date event;

1 01012000 Kindergarten

1 01012001 High school

1 01012003 Primary

1 01012010 High School

 

2 01012000 University

2 01012010 High School

 

3 01012009 Kindergarten

3 01012010 Preschool

3 01012012 Kindergarten

3 01012019 High School

3 01012022 Kindergarten

3 01012024 University

;

run;

 

Data I want:

 

data want;

input id date event;

1 01012000 Kindergarten

1 01012003 Primary

1 01012010 High School

 

2 01012010 High School

 

3 01012009 Kindergarten

3 01012010 Preschool

3 01012019 High School

3 01012024 University

;

run;

 

Many thanks for your kind help SAS community.

 

3 REPLIES 3
mkeintz
PROC Star

@Bounce wrote:

Hello SAS community,

...

How can I logically sort this so that any data out-of-sequence which does not make sense is deleted? (If customer had a 'University' stint in between 'Primary' and 'High School' then delete this row)

 

Data I have:

 

data have;

input id date event;

 

2 01012000 University

2 01012010 High School

 


 

 

For ID 2 you have University followed by High School.  You could delete either one to meet your objective.  What rule are you using to choose which one to delete?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
andreas_lds
Jade | Level 19

First thing to do: convert the dates to sas-dates, so that sorting by date yields the expected results - also necessary to compare dates.

Jagadishkatam
Amethyst | Level 16

Please try the below code

 

data have;
input id date:ddmmyy8.  event&:$20.;
format date date9.;
cards;
1 01012000 Kindergarten
1 01012001 High school
1 01012003 Primary
1 01012010 High School
2 01012000 University
2 01012010 High School
3 01012009 Kindergarten
3 01012010 Preschool
3 01012012 Kindergarten
3 01012019 High School
3 01012022 Kindergarten
3 01012024 University
;

proc format;
invalue sch
'Kindergarten'=1
'High school'=3
'High School'=3
'Primary'=2
'University'=4
'Preschool'=1
;

proc sort data=have;
by id date;
run;

data have;
set have;
schnum=input(event,sch.);
by id date;
run;

proc sort data=have;
by id schnum;
run;

data want;
set have;
by id schnum;
retain schnum2 date2;
if first.id then schnum2=schnum;
if first.id then date2=date;
format date2 date9.;
if schnum>schnum2 and date<date2 then delete;
run;
Thanks,
Jag

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 514 views
  • 0 likes
  • 4 in conversation