06-26-2013 11:24 AM
I'm trying to solve the following problem by merging multiple data sets.
I have a data set that contains information about employees:
|Department ID||Emp. ID||Role|
And another data set that contains information regarding their salary:
I want to combine those two data sets and remove the ID column in the first data set and the role column in the second data set. The resulting data set should only contain one column that contains both salary and role. I call this column "Attribute".
Roles within the departments are unique. Emp IDs are unique, as well. The problem is that the second data set does not contain the employee ID column. However, we know that there is only one IT Lead in the department and that his emp ID is 0012. That information is given in the first data set.
Another problem is that in the second data set, there are 5 possible values to choose from for each role, i.d. SE could be Software Engineer, SE, Soft-Eng, S.E., or S-Eng. These synonyms are predefined. The roles in the first data set are already standardized and do not vary.
I would really appreciate your help.
06-26-2013 01:27 PM
Do you already have a table or something that matches the "Project Lead" to "IT Lead", "Software Engineer" to "SE" and any other pairs of "Roles" that have different values but mean the same? If not, that's going to be a needed piece unless you only have the three values shown.
06-27-2013 02:48 AM
yes, that table exists. It looks like this:
|IT Lead||IT Lead, ITL, Lead, Project Lead, Program Manager|
|SE||Software Engineer, Soft.-Eng., S-eng, S-Engineer, SE|
06-26-2013 03:04 PM
We need to use Merge and Append to get the result.
input deptno empid role $;
1234 1 lead
1234 2 lead1
9999 1 it
input deptno salary $;
proc sort data=y1;by deptno;run;
proc sort data=y2;by deptno;run;
merge y1(in=a) y2(in=b);
if a and b;
data final(drop= salary role);
set y4 y5;
06-27-2013 02:49 AM
Thank you for your answer, yeshwanth. However, your code does not solve the problem, primarily because it ignores the role and only matches based on sorting
06-27-2013 04:47 AM
data emp; input DepartmentID $ EmpID $ Role & $20.; cards; 12345 0012 IT Lead 12345 0013 SE 23456 0014 SE ; run; data dic; infile cards truncover; input Synonym $100.; cards; IT Lead IT Lead, ITL, Lead, Project Lead, Program Manager SE Software Engineer, Soft.-Eng., S-eng, S-Engineer, SE ; run; data salary; input DepartmentID $ Salary $ Role & $20.; cards; 12345 70,000 Project Lead 12345 55,000 Software Engineer 23456 80,000 SE ; run; proc sql; create table x as select * from emp,dic where Synonym contains strip(Role) ; quit; data want; if _n_ eq 1 then do; if 0 then set salary(rename=(role=_role)); declare hash ha(dataset:'salary(rename=(role=_role))'); ha.definekey('DepartmentID','_Role'); ha.definedata('Salary'); ha.definedone(); end; set x; do i=1 to countw(Synonym,','); _role=strip(scan(Synonym,i,',')); if ha.find()=0 then do; output;call missing(salary);leave;end; end; run;
06-28-2013 01:38 AM
How about this one...
input dept_id emp_id role $12.;
12345 0012 IT Lead
12345 0013 SE
23456 0014 SE
input dept_id salary role $12.;
12345 70000 Project Lead
12345 55000 Software Eng.
23456 80000 SE
data both(drop = salary role);
merge one two(drop = role);
dept_id = dept_id;
ID = emp_id;
attributes = compress(put(salary,8.));
attributes = compress(role);