Solved: Re: Map and Aggregate Data

michokwu · Posted 01-27-2020 03:18 PM

Hello,

I would like to link and aggregate records (over 10 million) in a dataset to another dataset. Is there a way to achieve this in one task? Thank you

Data 1 - what I have

City	City_Code	Item 1	Item 2	Item 3	Item 4
City A	123001001	10	20	30	40
City B	123001002	20	30	40	50
City C	123002001	30	40	50	60
City D	123002002	40	50	60	70
City E	123003001	50	60	70	80
City F	123003002	60	70	80	90
City G	123004001	70	80	90	100
City H	123004002	80	90	100	110

Data 2 - what I want (the region code forms the beginning part of the city code)

Region	Region_Code	Item 1	Item 2	Item 3	Item 4
North	123001	30	50	70	90
South	123002	70	90	110	130
East	123003	110	130	150	170
West	123004	150	170	190	210

Reeza · Posted 01-27-2020 03:47 PM

You can do it quite easily in two steps.

This assumes that Region_Code is character.

proc means data=have SUM stackODS;
class city_code;
format city_code 6.; *uses only 6 characters for grouping, very important here;
var item1-item4;
ods output summary = want1;
output out=want2 sum= / autoname;
run;

Now apply the Region name to the data.

Or you can do it via one SQL pass but it's more complicated IMO.

proc sql;
create table want3 as
select substr(region_code, 1, 6) as region_code, put(region_code, $region_fmt.) as region, sum(item1) as item1, sum(item2) as item2, sum(item3) as item3, sum(item4) as item4
from have
group by calculated region_code;
quit;

View solution in original post

Reeza · Posted 01-27-2020 03:47 PM

You can do it quite easily in two steps.

This assumes that Region_Code is character.

proc means data=have SUM stackODS;
class city_code;
format city_code 6.; *uses only 6 characters for grouping, very important here;
var item1-item4;
ods output summary = want1;
output out=want2 sum= / autoname;
run;

Now apply the Region name to the data.

Or you can do it via one SQL pass but it's more complicated IMO.

proc sql;
create table want3 as
select substr(region_code, 1, 6) as region_code, put(region_code, $region_fmt.) as region, sum(item1) as item1, sum(item2) as item2, sum(item3) as item3, sum(item4) as item4
from have
group by calculated region_code;
quit;

michokwu · Posted 01-28-2020 03:25 PM

Thank you Reeza.

Please can you clarify 'Now apply the Region name to the data'

Reeza · Posted 01-28-2020 03:43 PM

If you run the PROC MEANS it will combine up the region_codes, ie 123001 but it doesn't know that that region_code means East. I suggest adding another format for that but you'll still need to apply it in a second step with the PROC MEANS approach but its pretty simple IMO.

michokwu · Posted 01-29-2020 04:02 PM

I can't seem to format the city code, even though the format contains fewer characters, when I run the program it retains the region code. The codes are numeric

michokwu · Posted 01-31-2020 03:55 PM

Thanks Reeza,

I also had to transpose the output into the format I preferred.

Reeza · Posted 02-02-2020 11:35 PM

You may not want the STACKODS option on PROC MEANS then, that changes how the output is structured.

hashman · Posted 01-27-2020 06:56 PM

@michokwu:

SQL is perhaps the best (or at least simplest) option:

data city ;                                                                                                                                                                                                                                                     
  input @1 City $6. City_Code:$9. Item1-item4 ;                                                                                                                                                                                                                 
  cards ;                                                                                                                                                                                                                                                       
City A  123001001  10  20   30   40                                                                                                                                                                                                                             
City B  123001002  20  30   40   50                                                                                                                                                                                                                             
City C  123002001  30  40   50   60                                                                                                                                                                                                                             
City D  123002002  40  50   60   70                                                                                                                                                                                                                             
City E  123003001  50  60   70   80                                                                                                                                                                                                                             
City F  123003002  60  70   80   90                                                                                                                                                                                                                             
City G  123004001  70  80   90  100                                                                                                                                                                                                                             
City H  123004002  80  90  100  110                                                                                                                                                                                                                             
;                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                
data region ;                                                                                                                                                                                                                                                   
  input Region $ Region_Code:$6. ;                                                                                                                                                                                                                              
  cards ;                                                                                                                                                                                                                                                       
North  123001                                                                                                                                                                                                                                                   
South  123002                                                                                                                                                                                                                                                   
East   123003                                                                                                                                                                                                                                                   
West   123004                                                                                                                                                                                                                                                   
;                                                                                                                                                                                                                                                               
run ;                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                
proc sql ;                                                                                                                                                                                                                                                      
  create table want as                                                                                                                                                                                                                                          
  select region                                                                                                                                                                                                                                                 
       , region_code                                                                                                                                                                                                                                            
       , sum (item1) as item1                                                                                                                                                                                                                                   
       , sum (item2) as item2                                                                                                                                                                                                                                   
       , sum (item3) as item3                                                                                                                                                                                                                                   
       , sum (item4) as item4                                                                                                                                                                                                                                   
  from   city, region                                                                                                                                                                                                                                           
  where  region_code = put (city_code, $6.)                                                                                                                                                                                                                     
  group  1, 2                                                                                                                                                                                                                                                   
  order  2                                                                                                                                                                                                                                                      
  ;                                                                                                                                                                                                                                                             
quit ;

If you loathe to list all the items as above, especially if there are way many than 4, heed what @Reeza has said. It can be done in a single DATA step, too (i.e. without listing all the items), using the hash object; but if you see the requisite code, you'll sure appreciate the simplicity of SQL:

data _null_ ;                                                                                                                                                                                                                                                   
  if _n_ = 1 then do ;                                                                                                                                                                                                                                          
    if 0 then set region city ;                                                                                                                                                                                                                                 
    array it item: ;                                                                                                                                                                                                                                            
    dcl hash h (ordered:"a") ;                                                                                                                                                                                                                                  
    h.definekey ("region_code") ;                                                                                                                                                                                                                               
    h.definedata ("region", "region_code") ;                                                                                                                                                                                                                    
    do over it ;                                                                                                                                                                                                                                                
      h.definedata (vname (it)) ;                                                                                                                                                                                                                               
    end ;                                                                                                                                                                                                                                                       
    h.definedone() ;                                                                                                                                                                                                                                            
    do until (lr) ;                                                                                                                                                                                                                                             
      set region end = lr ;                                                                                                                                                                                                                                     
      h.add() ;                                                                                                                                                                                                                                                 
    end ;                                                                                                                                                                                                                                                       
  end ;                                                                                                                                                                                                                                                         
  set city (rename=(item1-item4=_it1-_it4)) end = lc ;                                                                                                                                                                                                          
  array _it _it: ;                                                                                                                                                                                                                                              
  region_code = put (city_code, $6.) ;                                                                                                                                                                                                                          
  if h.find() ne 0 then call missing (of item:) ;                                                                                                                                                                                                               
  do over it ;                                                                                                                                                                                                                                                  
    it + _it ;                                                                                                                                                                                                                                                  
  end ;                                                                                                                                                                                                                                                         
  h.replace() ;                                                                                                                                                                                                                                                 
  if lc then h.output (dataset:"want") ;                                                                                                                                                                                                                        
run ;

OTOH, instead of listing all the sum(item1) as item1, ... in the SQL query, you can auto-construct a macro variable containing the necessary text:

data _null_ ;                                                                                                                                                                                                                                                   
  length sumit $ 32767 ;                                                                                                                                                                                                                                        
  do i = 1 to 4 ;                                                                                                                                                                                                                                               
    sumit = catx (",", sumit, cats ("sum(item", i, ") as item", i)) ;                                                                                                                                                                                           
  end ;                                                                                                                                                                                                                                                         
  call symputx ("sumit", sumit) ;                                                                                                                                                                                                                               
run ;                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                
proc sql ;                                                                                                                                                                                                                                                      
  create table want as                                                                                                                                                                                                                                          
  select region                                                                                                                                                                                                                                                 
       , region_code                                                                                                                                                                                                                                            
       , &sumit                                                                                                                                                                                                                                                 
  from   city, region                                                                                                                                                                                                                                           
  where  region_code = put (city_code, $6.)                                                                                                                                                                                                                     
  group  1, 2                                                                                                                                                                                                                                                   
  order  2                                                                                                                                                                                                                                                      
  ;                                                                                                                                                                                                                                                             
quit ;

Kind regards

Paul D.

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away