Re: Help with combining and merging data sets

hovicke · Posted 03-16-2023 10:52 AM

I have three separate data sets below. I want to use a data step to show for each of the new employees and interns who their manager is and which department they will work in. Also, which department has the most new people. I don't know how to sort the new_employee and new_intern data sets to then have them merge with the manager data set.

data New_Employee;
length name $15;
input name $ team training_course work_experience;
datalines;
Bohr,Neils 1 0 3
Zook,Carla 2 2 2
Penrose,Roger 1 3 1
Martinez,Maria 3 1 2
Orfali,Philip 2 1 1
;
run;


data New_Intern;
length name $15 major $20;
input name $ team training_course major $;
datalines;
Capalleti,Jimmy 3 1 FashionDesign
Chen,Len 2 0 BusinessAnalytics
Cannon,Annie 1 0 Mathematics
Davis,Brad 3 0 Art
Einstein,Albert 1 1 ComputerScience
;
run;


data Manager;
length mname $15 department $10;
input mname $ team department $;
datalines;
Wilson,Kenneth 1 Operation
Bardeen,John 2 Marketing
Sorrell,Joseph 3 Design
;
run;

PaigeMiller · Posted 03-16-2023 10:57 AM

The only variable you can possibly merge by is TEAM, as that exists in all three data sets.

Therefore, you would have to sort all three data sets by TEAM before doing the merge.

--
Paige Miller

s_lassen · Posted 03-17-2023 06:52 AM

Something like this?

proc sort data=New_Employee(keep=name team) out=emp;                                                                                    
  by team name;                                                                                                                         
run;                                                                                                                                    
                                                                                                                                        
proc sort data=New_Intern(keep=name team) out=int;                                                                                      
  by team name;                                                                                                                         
run;                                                                                                                                    
                                                                                                                                        
data all;                                                                                                                               
  set emp int;                                                                                                                          
  by team name;                                                                                                                         
run;                                                                                                                                    
                                                                                                                                        
proc sort data=Manager;                                                                                                                 
  by team;                                                                                                                              
run;                                                                                                                                    
                                                                                                                                        
data want;                                                                                                                              
  if 0 then set all manager; /* just to get the right order of variables */                                                                                                           
  do count=1 by 1 until(last.team);                                                                                                     
    set all;                                                                                                                            
    by team;                                                                                                                            
    end;                                                                                                                                
  do until(last.team);                                                                                                                  
    merge all manager;                                                                                                                  
    by team;                                                                                                                            
    output;                                                                                                                             
    end;                                                                                                                                
run;

Patrick · Posted 03-18-2023 08:10 PM

Below using SQL.

proc sql;
  select 
    m.*
    ,emp.emp_type
    ,emp.name
    ,emp.training_course
    ,emp.work_experience
    ,emp.major
  from manager m

  left join 
  (
    select 
      team
      ,'new_employee' as emp_type
      ,name
      ,training_course
      ,work_experience
      ,'' as major
    from new_employee
    union corr all
    select
      team
      ,'new_intern' as emp_type
      ,name
      ,training_course
      ,major
      ,. as work_experience
    from new_intern 
  ) emp
  on m.team=emp.team

  order by team, name
  ;
quit;

ichisdeno · Posted 03-21-2023 10:49 AM

I am trying to make a process run faster, that uses 3 years worth of hospital data. My goal is to just run the past 3 months of data, and append it to the existing large dataset. If a patient identifier appears in both tables, I want to only use the data from the new table. For example: data have_a; input ID amount; datalines; 1 10 3 15 4 20 7 10 7 15 9 12 10 14 ; run; data have_b; input ID amount; datalines; 2 15 3 20 4 10 4 15 5 12 7 20 8 15 9 10 11 20 ; run; data test; merge have_a have_b; by ID; run; My goal: If have_a and have_b contain any ID# in common, it should only keep the values from have_b. However, as it stands, ID#7 has 2 rows in the final dataset (with amounts 20 and 15) when it should only have 1. This is a very simplified version of my real data, which contains 20+ columns and millions of rows.

Appvalley https://vlc.onl/

ballardw · Posted 03-21-2023 11:25 AM

@ichisdeno

How does this relate to the original question about combining 3 data sets???

This looks like it should be a completely new topic as your requirements are very different from the original question starting this thread.

Also, you seem not to actually want an Append (add records) as an UPDATE, which would replace the values in the master table from a new (transaction) data set based on matching criteria, likely the ID. An update will add new records if the Id is new.

If you have corrupted your master data set by creating records with duplicate ids then you need to recover the old version before you start.

Help with combining and merging data sets