Solved: How do i put most common input in a new variable

Diosfreak · Posted 10-31-2019 07:11 PM

My dataset looks like this

id gap insurance

1 -5 one

1 2 one

1 9 two

2 1 three

2 10 one

3 5 four

3 4 four

3 2 one

3 -3 two

What i want to do is the following, i want a new variable created that will give me the most occuring insurance if the gap is between -5 to 5 days.

id-Unique id of the individual

gap-gap in days

insurance-the different types of insurance

So my final result should look like

id gap insurance finalinsurance

1 -5 one one

1 2 one one

1 9 two one

2 1 three three

2 10 one three

3 5 four four

3 4 four four

3 2 one four

3 -3 two four

some more insight on the result:

ID 1 has 2 insurance that falls within my -5 to 5 days gap so i take a count of it, and i pick the most common one. In this case it is one so i populate the new variable with this answer.

ID 2 only one met my criteria so i went with the insurance , which is three.

ID 3, they all fall within my gap, so i take a count of the insurance input and go with the most common one, in this case insurance type four is the most common one.

Please any assistance will be appreciated. Thank you.

unison · Posted 10-31-2019 08:23 PM

Give this a shot.

1. Read in "raw_data" that you supplied.

2. Create a frequency table on id and insurance- call it "outfreq".

(be sure to filter raw_data down to observations with gaps between -5 and 5)

3. Sort raw_data for later merge.

4. Sort "outfreq" so that most frequent by id is on top.

5. De-duplicate "outfreq" - call it "mostfreq".

6. Merge "mostfreq" with raw_data (like a left join).

data raw_data;
input id gap insurance $;
datalines;
1 -5 one
1 2 one
1 9 two
2 1 three
2 10 one 
3 5 four
3 4 four
3 2 one
3 -3 two
;
run;

proc freq data = raw_data(where=(gap ge -5 and gap le 5)) noprint;
tables id*insurance / norow nocol nopercent nocum
out = outfreq(rename=count=finalinsurance drop=percent);
run; 

proc sort data=raw_data; 
	by id; 
run;
proc sort data=outfreq; 
	by id descending finalinsurance; 
run;
proc sort data=outfreq(keep=id insurance) 
	nodupkey 
	out=mostfreq(rename=insurance=finalinsurance); 
	by id; 
run;

data want;
	merge
		raw_data(in=a)
		mostfreq(in=b) 
	;
	by id;
	if a;
run;

-unison

View solution in original post

Diosfreak · Posted 10-31-2019 07:10 PM

My dataset looks like this

id gap insurance

1 -5 one

1 2 one

1 9 two

2 1 three

2 10 one

3 5 four

3 4 four

3 2 one

3 -3 two

What i want to do is the following, i want a new variable created that will give me the most occuring insurance if the gap is between -5 to 5 days.

id-Unique id of the individual

gap-gap in days

insurance-the different types of insurance

So my final result should look like

id gap insurance finalinsurance

1 -5 one one

1 2 one one

1 9 two one

2 1 three three

2 10 one three

3 5 four four

3 4 four four

3 2 one four

3 -3 two four

some more insight on the result:

ID 1 has 2 insurance that falls within my -5 to 5 days gap so i take a count of it, and i pick the most common one. In this case it is one so i populate the new variable with this answer.

ID 2 only one met my criteria so i went with the insurance , which is three.

ID 3, they all fall within my gap, so i take a count of the insurance input and go with the most common one, in this case insurance type four is the most common one.

Please any assistance will be appreciated. Thank you.

mkeintz · Posted 10-31-2019 08:28 PM

I suggest you remove this topic, or mark it as solved, since youi already have posted it in a more appropriate forum group.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Diosfreak · Posted 10-31-2019 11:26 PM

Yes thanks! Its my first time using this forum, i was not sure where to paste my question. Thanks again.

Kurt_Bremser · Posted 11-01-2019 04:14 AM

I merged the two questions.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Diosfreak · Posted 11-01-2019 11:02 AM

Thank you. I wanted to delete one, but I couldn't figure out how to..im new here! ><. Thanks again.

unison · Posted 10-31-2019 08:23 PM

Give this a shot.

1. Read in "raw_data" that you supplied.

2. Create a frequency table on id and insurance- call it "outfreq".

(be sure to filter raw_data down to observations with gaps between -5 and 5)

3. Sort raw_data for later merge.

4. Sort "outfreq" so that most frequent by id is on top.

5. De-duplicate "outfreq" - call it "mostfreq".

6. Merge "mostfreq" with raw_data (like a left join).

data raw_data;
input id gap insurance $;
datalines;
1 -5 one
1 2 one
1 9 two
2 1 three
2 10 one 
3 5 four
3 4 four
3 2 one
3 -3 two
;
run;

proc freq data = raw_data(where=(gap ge -5 and gap le 5)) noprint;
tables id*insurance / norow nocol nopercent nocum
out = outfreq(rename=count=finalinsurance drop=percent);
run; 

proc sort data=raw_data; 
	by id; 
run;
proc sort data=outfreq; 
	by id descending finalinsurance; 
run;
proc sort data=outfreq(keep=id insurance) 
	nodupkey 
	out=mostfreq(rename=insurance=finalinsurance); 
	by id; 
run;

data want;
	merge
		raw_data(in=a)
		mostfreq(in=b) 
	;
	by id;
	if a;
run;

-unison

Diosfreak · Posted 11-07-2019 01:02 PM

This worked great! Thank you very much for the prompt solution.

mkeintz · Posted 10-31-2019 08:26 PM

You first need to first get frequencies of INSURANCE, by ID, for records with gap between -5 and 5:

data have;
  input id     gap    insurance :$5. ;
datalines;
1    -5      one
1     2      one
1     9      two
2     1      three
2    10      one 
3     5      four
3     4      four
3     2      one
3    -3      two
run;

proc summary data=have (where=(gap between -5 and 5)) order=freq noprint nway; 
  by id;
  class insurance ;
  output out=t;
run;

The PROC SUMMARY will generate many requested statistics (just frequencies in this case) of every level of the class variable (insurance). And the "by id" tells SAS to do this separately for each ID (the data are assumed to be sorted by ID). The "order=freq" options tells proc summary not to output (to data set T) INSURANCE frequencies alphabetically, but in descending frequency order (again within each level of the ID variable). The NWAY option says don't output global frequency.

Now you need a way to combine this with the original dataset HAVE:

data want (drop=_ins);
  retain _ins '       ';
  set t (keep=id insurance)
      have (in=inhave);
  by id;
  if first.id=1 then _ins=insurance;
  else insurnace=_ins;
  if inhave=1;
run;

The program, for each ID, reads all the records in T (sorted by descending frequency), followed by all the records in HAVE. The "BY ID" statement also tells SAS to make temporary dummy variables FIRST.ID and LAST.ID telling you whether the record-in-hand is the first record for an ID (i.e. the first, highest-frequency, record in T, has first.id=1). The last.id=1 would be the final record for an ID in HAVE. They are zero otherwise.

So when the first, highest-frequency, record is in hand copy the INSURANCE value to the retained variable _INS. Otherwise copy _INS back to INSURANCE. Finally the subsetting IF (if inhave=1) tells SAS to keep only the observations starting out in HAVE.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Diosfreak · Posted 11-07-2019 01:05 PM

This worked as well. THank you very much.

hashman · Posted 11-01-2019 02:05 PM

@Diosfreak:

Using the power of the hash object, this can be done in a single step:

data have ;                                                                                                                             
  input id gap insurance :$5. ;                                                                                                         
  cards ;                                                                                                                               
1    -5      one                                                                                                                        
1     2      one                                                                                                                        
1     9      two                                                                                                                        
2     1      three                                                                                                                      
2    10      one                                                                                                                        
3     5      four                                                                                                                       
3     4      four                                                                                                                       
3     2      one                                                                                                                        
3    -3      two                                                                                                                        
run ;                                                                                                                                   
                                                                                                                                        
data want (drop = _:) ;                                                                                                                 
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h() ;                                                                                                                      
    dcl hiter i ("h") ;                                                                                                                 
    h.definekey  ("insurance") ;                                                                                                        
    h.definedata ("insurance", "_f") ;                                                                                                  
    h.definedone () ;                                                                                                                   
  end ;                                                                                                                                 
  do _n_ = 1 by 1 until (last.id) ;                                                                                                     
    set have ;                                                                                                                          
    by id ;                                                                                                                             
    if not (-5 <= gap <= 5) then continue ;                                                                                             
    if h.find() ne 0 then _f = 1 ;                                                                                                      
    else                  _f + 1 ;                                                                                                      
    h.replace() ;                                                                                                                       
  end ;                                                                                                                                 
  do while (i.next() = 0) ;                                                                                                             
    if _f <= _fmax then continue ;                                                                                                      
    _fmax = _f ;                                                                                                                        
    finalinsurance = insurance ;                                                                                                        
  end ;                                                                                                                                 
  do _n_ = 1 to _n_ ;                                                                                                                   
    set have ;                                                                                                                          
    output ;                                                                                                                            
  end ;                                                                                                                                 
  h.clear() ;                                                                                                                           
run ;

How it works:

1. The next by-group by ID is read twice.

2. On the first pass, only the records with gap between -5 and 5 are considered.

3. If the insurance key-value not found in the hash table H (created beforehand at _N_=1), it is inserted into the table with _F=1. Otherwise, 1 is added to the table value of _F for this key-value. So, by the end of the by group, the table contains the unique values of insurance and corresponding frequencies _F.

4. After the by-group is this processed, the table is scanned using the hash iterator I to find the value of insurance with the highest frequency assigned to variable finalinsurance.

5. The same by-group (i.e. with the same ID value) is read again, and each record is written out with the value of finalinsurance attached.

6. The hash table is cleared of all the items, and the process is repeated from #1.

The program above assumes that the input file is sorted by ID; however it can be rewritten even without this assumption. For those learning programming with the SAS hash object it can be a nice exercise.

Kind regards

Paul D.

How to i put the most common input in a new variable

Re: How to i put the most common input in a new variable

How do i put most common input in a new variable

Re: How do i put most common input in a new variable

Re: How do i put most common input in a new variable

Re: How do i put most common input in a new variable

Re: How do i put most common input in a new variable

Re: How to i put the most common input in a new variable

Re: How to i put the most common input in a new variable

Re: How to i put the most common input in a new variable

Re: How to i put the most common input in a new variable

Re: How to i put the most common input in a new variable

Classroom Training Available!