- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Could sure use some help on this problem please. I have a data set where there are instances of duplicate job codes where one instance the job title is abbreviated and the other instance the job title is complete. I need to keep the observation with the complete job title and ignore the observation with the abbreviated job title. As you can see with some observations the abbreviated form is first but in other cases the complete form is first.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It is easy to check both job_code and BU_level_1 - updated code:
proc sort data=have out=temp;
by job_code bu_level_1;
run;
data want;
set temp;
by Job_code bu_level_1;
length title $50; /* max expected length */
retain title;
if first.bu_level_1 then title = Job_title;
else if length(Job_Title) > length(title) then title = job_title;
if last.bu_level_1 then do;
Job_title = title;
output;
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I assume that:
- the only difference between observations of the same job_ID is the job title,
- the complete title is the longest one
- the file is sorted by job_id
then use next code to save wanted result:
data want;
set have;
by Job_code;
length title $50; /* max expected length */
retain title;
if first.job_code then title = Job_title;
else if length(Job_Title) > length(title) then title = job_title;
if last.Job_code then do;
Job_title = title;
output;
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your solution worked great, except for one thing....my fault I totally forgot to mention it.....in these examples, the job_code and job_title are duplicates but the BU_Level_01 are different, so in these cases I would need to keep both. So sorry for the confusion totally my fault for omitting this info. Other than that the code you provided works perfectly! Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It is easy to check both job_code and BU_level_1 - updated code:
proc sort data=have out=temp;
by job_code bu_level_1;
run;
data want;
set temp;
by Job_code bu_level_1;
length title $50; /* max expected length */
retain title;
if first.bu_level_1 then title = Job_title;
else if length(Job_Title) > length(title) then title = job_title;
if last.bu_level_1 then do;
Job_title = title;
output;
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Perfect! Thank you so much Shmuel works like a charm!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
So by "abbreviation" you don't mean things like "ltd" (for limited) or "dir." (for director). Your examples seem to be such that one title is a sequence of characters that falls within the other title.
If so, and your data are sorted by jobcode:
data want;
set have;
by jobcode;
merge have have (firstobs=2 rename=(job_title=nxt_job_title));
if (first.jobcode=0 and length(job_title)>length(lag(job_title)))
or (last.jobcode=0 and length(job_title)>length(nxt_job_title))
or (first.jobcode=1 and last.jobcode=1);
run;
The subsetting if is satisfied by any of these three conditions
- The record-in-hand is the second for this jobcode and its jobtitle length is greater than the preceding (first for this jobcode) jobtitle.
- The record-in-hand is the first for this jobcode and its jobtitle length is greater than the following (second for this jobcode).
- The record-in-hand is the only one for this jobcode.
Note this program assumes you never have more than 2 records per jobcode.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your response, yes things like "Dir" would be considered a duplicate if there were a second observation with the same job_code but "Director" spelled out as seen here:
100041 Learning & Development Dir
100041 Learning & Development Director
This is a duplicate, the first needs dropped and the second is the observation I want to keep.