Re: Duplicate observations with new StateID column?

imdickson · Posted 07-31-2019 03:46 AM

data have;
input Id Twin Age1 Age2 Sex1 $ Sex2 $ Cost;
infile datalines missover;
datalines;1 1 45 35 M F 10
2 0 32 .  F  20
3 0 26 .  M  30
4 1 20 21 M F .;

With the dataset above, is there a way for me to duplicate the record set with a new column called "StateID" with every 1 full set having 1 StateID. For example, first full set of data will have StateId=1. 2nd set onwards will have Stateid=2 and all the way until the 13th set which will have StateId=13.

1 full set = 34 rows.

How can i achieve that?

Kurt_Bremser · Posted 07-31-2019 03:49 AM

So you have a dataset with 34 * 13 observations, and you want to increment your new column every 34 observations?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

imdickson · Posted 07-31-2019 03:52 AM

Meaning to say 1 full set of data has 34 rows.

What i want is every 34 rows of record, stateid will increase by 1.

So the first 34 rows will have StateID=1.

Second 34 rows will have StateID=2 and so forth.

imdickson · Posted 07-31-2019 03:54 AM

Currently there is only 34 rows of records in this table.

Kurt_Bremser · Posted 07-31-2019 04:07 AM

Simple do loop, and sorting afterwards:

data want;
set have;
do StateID = 1 to 13;
  output;
end;
run;

proc sort data=want;
by StateID Id;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

hashman · Posted 07-31-2019 04:50 AM

@imdickson:

This is perhaps a relatively efficient way since it reads the input data set once but requires sorting if you want WANT to be sorted by StateID:

data have ;                                        
  retain stateID ;                                 
  input Id Twin Age1 Age2 (Sex1 Sex2) (:$1.) Cost ;
  cards ;                                          
1 1 45 35 M F 10                                   
2 0 32  . F . 20                                   
3 0 26  . M . 30                                   
4 1 20 21 M F  .                                   
;                                                  
run ;                                              
                                                   
data want ;                                        
  retain stateID ;                                 
  set have ;                                       
  do stateID = 1 to 13 ;                           
    output ;                                       
  end ;                                            
run ;                                              
                                                   
proc sort ;                                        
  by stateID ;                                     
run ;

This is a much less efficient way since you have to re-read the input data set 12 times; but you don't have to sort:

data want ;                        
  do StateID = 1 to 13 ;           
    do p = 1 to n ;                
      set have point = p nobs = n ;
      output ;                     
    end ;                          
  end ;                            
  stop ;                           
run ;

Yet another way, probably most efficient since it reads HAVE only once and requires no sorting or means to put StateID first in the variable list:

data v / view = v ;                           
  do StateID = 1 to 13 ;                      
    output ;                                  
  end ;                                       
run ;                                         
                                              
proc sql ;                                    
  create table want as select * from v, have ;
quit ;

It's still most efficient, even though the log says:

NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.

The note is valid; however, it doesn't testify to an inefficiency because the Cartesian product is exactly what here you're aiming to create.

Kind regards

Paul D.

Duplicate observations with new StateID column?