Hi...I am trying to correct mistakes that occur in a dataset. The mistake is identified when a StudentUID has two different Programs for the same AcademicYear as it is possible to be in both but not in the same Academic Year. I would like to be able to select the program to be used in over writing the program that is an error. For an example, if the same StudntUID is enrolled in both International Business and Leadership and Management, I would like to be able to select International Business to either replace the Leadership and Management entry or delete the row with the Program as Leadership and Management would work too.... Any suggestion on how to approach and achieve this. Thanks.
data have;
length StudentUID 8 Program $ 38 AcademicYear $ 7;
format StudentUID best12. Program $char38. AcademicYear $char7.;
informat StudentUID best12. Program $char38. AcademicYear $char7.;
infile datalines4 dlm='7F'x missover dsd;
input StudentUID : best32. Program : $char38. AcademicYear : $char7.;
datalines4;
8642Network and Computer Technology2016/17
19269Network and Computer Technology2017/18
19269Cyber Defence and Cloud Administration2017/18
48311Cyber Defence and Cloud Administration2017/18
49684Hotel & Hospitality Services2016/17
49684Hospitality Management2016/17
52502Hospitality Management2017/18
52854Hotel & Hospitality Services2016/17
52966Leadership and Management2017/18
52966International Business2017/18
53283Leadership and Management2017/18
55605International Business2017/18
;;;;
Want:
StudentUID Program AcademicYear
8642 Network and Computer Technology 2016/17
19269 Cyber Defence and Cloud Administration 2017/18
48311 Cyber Defence and Cloud Administration 2017/18
49684 Hotel & Hospitality Services 2016/17
52502 Hospitality Management 2017/18
52854 Hotel & Hospitality Services 2016/17
52966 International Business 2017/18
53283 Leadership and Management 2017/18
55605 International Business 2017/18
see if this works.
Proc sort data=have nounikey uniqueout=tmp1 ;
by StudentUID AcademicYear;
run;
Proc format ;
value $program 'Hospitalilty Management and Hotel' = 'Hotel & Hospitality Services'
'Leadership and Management'='International Business'
'Network and Computer Technology'='Cyber Defence and Cloud Administration'
;
run;
data tmp2;
set have;
program=put(program,$program38.);
run;
proc sort data=tmp2 nodupkey out=tmp3;
by StudentUID AcademicYear;
run;
data final;
set tmp1 tmp3;
run;
data have;
length StudentUID 8 Program $ 38 AcademicYear $ 7;
format StudentUID best12. Program $char38. AcademicYear $char7.;
informat StudentUID best12. Program $char38. AcademicYear $char7.;
infile datalines4 dlm='7F'x missover dsd;
input StudentUID : best32. Program : $char38. AcademicYear : $char7.;
datalines4;
8642Network and Computer Technology2016/17
19269Network and Computer Technology2017/18
19269Cyber Defence and Cloud Administration2017/18
48311Cyber Defence and Cloud Administration2017/18
49684Hotel & Hospitality Services2016/17
49684Hospitality Management2016/17
52502Hospitality Management2017/18
52854Hotel & Hospitality Services2016/17
52966Leadership and Management2017/18
52966International Business2017/18
53283Leadership and Management2017/18
55605International Business2017/18
;
data want;
set have;
by studentUID notsorted;
if last.studentUID;
run;
/* I would like to be able to select International Business
to either replace the Leadership and Management entry
or delete the row with the Program as Leadership and Management would work too..*/
proc sort data=have out=want nodupkey;
by studentuid academicyear;
run;
Do you have any rules for identifying which program is entered in error and which is not?
if you just want to just eliminate one record from the duplicates no matter what, I think this step is good enough.
proc sort data=have out=want nodupkey;
by StudentUID AcademicYear;
run;
Hi Behata….Yes...when in both Hospitalilty Management and Hotel & Hospitality Services for the same academic year, then the correct program should have been Hotel & Hospitality Services. Similiarly, when in both International Business and Leadership and Management, International Business should have been the correct entry.
@twildone wrote:
Hi Behata….Yes...when in both Hospitalilty Management and Hotel & Hospitality Services for the same academic year, then the correct program should have been Hotel & Hospitality Services. Similiarly, when in both International Business and Leadership and Management, International Business should have been the correct entry.
How do you know this? Is there a different criterion for each person? What's the logic to say one is correct over another?
Hi Rezza,....no...its the same for all students. The program which is considered a mistakes is actually the first part or half of the other program.....so a student can either take the first part of the program or both parts of the correct program. What happens is sometimes a student will decide to only take the first part of the program ( for example, Hospitality Management) and then decides to take the entire program ( which would be Hotel & Hospitality Services) which they end up be enrolled in both program when actually they were enrolled in Hotel & Hospitality Services.
Create a data set with a ranked order for each of your programs. If a student happens to take two types of program in one semester you can then sort by the ordered variable and take the first or last based on that data set.
Hardcoding isn't a sustainable solution but creating that data set will likely take some effort.
@twildone wrote:
Hi Rezza,....no...its the same for all students. The program which is considered a mistakes is actually the first part or half of the other program.....so a student can either take the first part of the program or both parts of the correct program. What happens is sometimes a student will decide to only take the first part of the program ( for example, Hospitality Management) and then decides to take the entire program ( which would be Hotel & Hospitality Services) which they end up be enrolled in both program when actually they were enrolled in Hotel & Hospitality Services.
Your request needs clarification and a layout for program flow. how do we know which program is first or second when you may have unlisted programs in your larger dataset. And you did not fully tell us how to handle programs.
see if this works.
Proc sort data=have nounikey uniqueout=tmp1 ;
by StudentUID AcademicYear;
run;
Proc format ;
value $program 'Hospitalilty Management and Hotel' = 'Hotel & Hospitality Services'
'Leadership and Management'='International Business'
'Network and Computer Technology'='Cyber Defence and Cloud Administration'
;
run;
data tmp2;
set have;
program=put(program,$program38.);
run;
proc sort data=tmp2 nodupkey out=tmp3;
by StudentUID AcademicYear;
run;
data final;
set tmp1 tmp3;
run;
both programs provided give different results that your want sample, is something missing that would help us understand why you want
this record results?
49684 Hotel & Hospitality Services 2016/17
Hi VDD.....49684 is in both Hotel & Hospitality Services and Hospitality Management during 2016/17 so the mistake in this case is Hospitality Management so correct program should have been Hotel & Hospitality Services for 2016/17.
49684Hotel & Hospitality Services2016/17
49684Hospitality Management2016/17
you need a indexing or a formatting on the program, because based on your sample data for have and want an error could accrue in the selection process.
need a look up table maybe?
none tested sample
proc format;
value myprogram
"Network and Computer Technology"=1
'Cyber Defence and Cloud Administration'=2;
quit;
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.