Hi...I am trying to correct mistakes that occur in a dataset. The mistake is identified when a StudentUID has two different Programs for the same AcademicYear as it is possible to be in both but not in the same Academic Year. I would like to be able to select the program to be used in over writing the program that is an error. For an example, if the same StudntUID is enrolled in both International Business and Leadership and Management, I would like to be able to select International Business to either replace the Leadership and Management entry or delete the row with the Program as Leadership and Management would work too.... Any suggestion on how to approach and achieve this. Thanks.
data have;
    length StudentUID 8 Program $ 38 AcademicYear $ 7;
    format StudentUID best12. Program $char38. AcademicYear $char7.;
    informat StudentUID best12. Program $char38. AcademicYear $char7.;
    infile datalines4 dlm='7F'x missover dsd;
    input StudentUID : best32. Program : $char38. AcademicYear : $char7.;
datalines4;
8642Network and Computer Technology2016/17
19269Network and Computer Technology2017/18
19269Cyber Defence and Cloud Administration2017/18
48311Cyber Defence and Cloud Administration2017/18
49684Hotel & Hospitality Services2016/17
49684Hospitality Management2016/17
52502Hospitality Management2017/18
52854Hotel & Hospitality Services2016/17
52966Leadership and Management2017/18
52966International Business2017/18
53283Leadership and Management2017/18
55605International Business2017/18
;;;;
Want:
StudentUID	Program	AcademicYear
8642	Network and Computer Technology	2016/17
19269	Cyber Defence and Cloud Administration	2017/18
48311	Cyber Defence and Cloud Administration	2017/18
49684	Hotel & Hospitality Services	2016/17
52502	Hospitality Management	2017/18
52854	Hotel & Hospitality Services	2016/17
52966	International Business	2017/18
53283	Leadership and Management	2017/18
55605	International Business	2017/18
see if this works.
Proc sort data=have nounikey uniqueout=tmp1  ;
	by StudentUID AcademicYear;
run;
Proc format ;
value $program  'Hospitalilty Management and Hotel' = 'Hotel & Hospitality Services'
	  			'Leadership and Management'='International Business'
				'Network and Computer Technology'='Cyber Defence and Cloud Administration'
;
run;
data tmp2;
	set have;
	program=put(program,$program38.);
run;
proc sort data=tmp2 nodupkey out=tmp3;
	by StudentUID AcademicYear;
run;
data final;
	set tmp1 tmp3;
run;data have;
    length StudentUID 8 Program $ 38 AcademicYear $ 7;
    format StudentUID best12. Program $char38. AcademicYear $char7.;
    informat StudentUID best12. Program $char38. AcademicYear $char7.;
    infile datalines4 dlm='7F'x missover dsd;
    input StudentUID : best32. Program : $char38. AcademicYear : $char7.;
datalines4;
8642Network and Computer Technology2016/17
19269Network and Computer Technology2017/18
19269Cyber Defence and Cloud Administration2017/18
48311Cyber Defence and Cloud Administration2017/18
49684Hotel & Hospitality Services2016/17
49684Hospitality Management2016/17
52502Hospitality Management2017/18
52854Hotel & Hospitality Services2016/17
52966Leadership and Management2017/18
52966International Business2017/18
53283Leadership and Management2017/18
55605International Business2017/18
;
data want;
	set have;
	by studentUID notsorted;
	if last.studentUID;
run;
/* I would like to be able to select International Business 
to either replace the Leadership and Management entry 
or delete the row with the Program as Leadership and Management would work too..*/
proc sort data=have out=want nodupkey;
by studentuid academicyear;
run;Do you have any rules for identifying which program is entered in error and which is not?
if you just want to just eliminate one record from the duplicates no matter what, I think this step is good enough.
proc sort data=have out=want nodupkey;
	by StudentUID AcademicYear;
run;
Hi Behata….Yes...when in both Hospitalilty Management and Hotel & Hospitality Services for the same academic year, then the correct program should have been Hotel & Hospitality Services. Similiarly, when in both International Business and Leadership and Management, International Business should have been the correct entry.
@twildone wrote:
Hi Behata….Yes...when in both Hospitalilty Management and Hotel & Hospitality Services for the same academic year, then the correct program should have been Hotel & Hospitality Services. Similiarly, when in both International Business and Leadership and Management, International Business should have been the correct entry.
How do you know this? Is there a different criterion for each person? What's the logic to say one is correct over another?
Hi Rezza,....no...its the same for all students. The program which is considered a mistakes is actually the first part or half of the other program.....so a student can either take the first part of the program or both parts of the correct program. What happens is sometimes a student will decide to only take the first part of the program ( for example, Hospitality Management) and then decides to take the entire program ( which would be Hotel & Hospitality Services) which they end up be enrolled in both program when actually they were enrolled in Hotel & Hospitality Services.
Create a data set with a ranked order for each of your programs. If a student happens to take two types of program in one semester you can then sort by the ordered variable and take the first or last based on that data set.
Hardcoding isn't a sustainable solution but creating that data set will likely take some effort.
@twildone wrote:
Hi Rezza,....no...its the same for all students. The program which is considered a mistakes is actually the first part or half of the other program.....so a student can either take the first part of the program or both parts of the correct program. What happens is sometimes a student will decide to only take the first part of the program ( for example, Hospitality Management) and then decides to take the entire program ( which would be Hotel & Hospitality Services) which they end up be enrolled in both program when actually they were enrolled in Hotel & Hospitality Services.
Your request needs clarification and a layout for program flow. how do we know which program is first or second when you may have unlisted programs in your larger dataset. And you did not fully tell us how to handle programs.
see if this works.
Proc sort data=have nounikey uniqueout=tmp1  ;
	by StudentUID AcademicYear;
run;
Proc format ;
value $program  'Hospitalilty Management and Hotel' = 'Hotel & Hospitality Services'
	  			'Leadership and Management'='International Business'
				'Network and Computer Technology'='Cyber Defence and Cloud Administration'
;
run;
data tmp2;
	set have;
	program=put(program,$program38.);
run;
proc sort data=tmp2 nodupkey out=tmp3;
	by StudentUID AcademicYear;
run;
data final;
	set tmp1 tmp3;
run;both programs provided give different results that your want sample, is something missing that would help us understand why you want
this record results?
49684 Hotel & Hospitality Services 2016/17
Hi VDD.....49684 is in both Hotel & Hospitality Services and Hospitality Management during 2016/17 so the mistake in this case is Hospitality Management so correct program should have been Hotel & Hospitality Services for 2016/17.
49684Hotel & Hospitality Services2016/17
49684Hospitality Management2016/17
you need a indexing or a formatting on the program, because based on your sample data for have and want an error could accrue in the selection process.
need a look up table maybe?
none tested sample
proc format;
	value myprogram
	"Network and Computer Technology"=1
	'Cyber Defence and Cloud Administration'=2;
	
quit;It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
