BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Diosfreak
Fluorite | Level 6

My dataset looks like this

 

id     gap    insurance

1      -5      one

1      2       one

1     9        two

2     1        three

2    10      one 

3     5       four

3     4       four

3     2      one

3     -3       two

 

What i want to do is the following, i want a new variable created that will give me the most occuring insurance if the gap is between -5 to 5 days. 

id-Unique id of the individual

gap-gap in days

insurance-the different types of insurance

 

So my final result should look like 

 

 

 

id     gap    insurance   finalinsurance

1      -5      one              one

1      2       one              one

1     9        two               one

2     1        three            three

2    10      one               three

3     5       four              four

3     4       four              four

3     2      one               four

3     -3       two             four

 

some more insight on the result:

ID 1 has 2 insurance that falls within my -5 to 5 days gap so i take a count of it, and i pick the most common one. In this case it is one so i populate the new variable with this answer.

ID 2 only one met my criteria so i went with the insurance , which is three.

ID 3, they all fall within my gap, so i take a count of the insurance input and go with the most common one, in this case insurance type four is the most common one.

 

Please any assistance will be appreciated. Thank you. 



1 ACCEPTED SOLUTION

Accepted Solutions
unison
Lapis Lazuli | Level 10

Give this a shot.

1. Read in "raw_data" that you supplied.

2. Create a frequency table on id and insurance- call it "outfreq". 

     (be sure to filter raw_data down to observations with gaps between -5 and 5) 

3. Sort raw_data for later merge.

4. Sort "outfreq" so that most frequent by id is on top.

5. De-duplicate "outfreq" - call it "mostfreq".

6. Merge "mostfreq" with raw_data (like a left join).

 

data raw_data;
input id gap insurance $;
datalines;
1 -5 one
1 2 one
1 9 two
2 1 three
2 10 one 
3 5 four
3 4 four
3 2 one
3 -3 two
;
run;

proc freq data = raw_data(where=(gap ge -5 and gap le 5)) noprint;
tables id*insurance / norow nocol nopercent nocum
out = outfreq(rename=count=finalinsurance drop=percent);
run; 

proc sort data=raw_data; 
	by id; 
run;
proc sort data=outfreq; 
	by id descending finalinsurance; 
run;
proc sort data=outfreq(keep=id insurance) 
	nodupkey 
	out=mostfreq(rename=insurance=finalinsurance); 
	by id; 
run;

data want;
	merge
		raw_data(in=a)
		mostfreq(in=b) 
	;
	by id;
	if a;
run;

-unison

-unison

View solution in original post

10 REPLIES 10
Diosfreak
Fluorite | Level 6

My dataset looks like this

 

id     gap    insurance

1      -5      one

1      2       one

1     9        two

2     1        three

2    10      one 

3     5       four

3     4       four

3     2      one

3     -3       two

 

What i want to do is the following, i want a new variable created that will give me the most occuring insurance if the gap is between -5 to 5 days. 

id-Unique id of the individual

gap-gap in days

insurance-the different types of insurance

 

So my final result should look like 

 

 

 

id     gap    insurance   finalinsurance

1      -5      one              one

1      2       one              one

1     9        two               one

2     1        three            three

2    10      one               three

3     5       four              four

3     4       four              four

3     2      one               four

3     -3       two             four

 

some more insight on the result:

ID 1 has 2 insurance that falls within my -5 to 5 days gap so i take a count of it, and i pick the most common one. In this case it is one so i populate the new variable with this answer.

ID 2 only one met my criteria so i went with the insurance , which is three.

ID 3, they all fall within my gap, so i take a count of the insurance input and go with the most common one, in this case insurance type four is the most common one.

 

Please any assistance will be appreciated. Thank you. 

mkeintz
PROC Star

I suggest you remove this topic, or mark it as solved, since youi already have posted it in a more appropriate forum group.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Diosfreak
Fluorite | Level 6

Yes thanks! Its my first time using this forum, i was not sure where to paste my question. Thanks again. 

Diosfreak
Fluorite | Level 6

Thank you. I wanted to delete one, but I couldn't figure out how to..im new here! ><. Thanks again. 

unison
Lapis Lazuli | Level 10

Give this a shot.

1. Read in "raw_data" that you supplied.

2. Create a frequency table on id and insurance- call it "outfreq". 

     (be sure to filter raw_data down to observations with gaps between -5 and 5) 

3. Sort raw_data for later merge.

4. Sort "outfreq" so that most frequent by id is on top.

5. De-duplicate "outfreq" - call it "mostfreq".

6. Merge "mostfreq" with raw_data (like a left join).

 

data raw_data;
input id gap insurance $;
datalines;
1 -5 one
1 2 one
1 9 two
2 1 three
2 10 one 
3 5 four
3 4 four
3 2 one
3 -3 two
;
run;

proc freq data = raw_data(where=(gap ge -5 and gap le 5)) noprint;
tables id*insurance / norow nocol nopercent nocum
out = outfreq(rename=count=finalinsurance drop=percent);
run; 

proc sort data=raw_data; 
	by id; 
run;
proc sort data=outfreq; 
	by id descending finalinsurance; 
run;
proc sort data=outfreq(keep=id insurance) 
	nodupkey 
	out=mostfreq(rename=insurance=finalinsurance); 
	by id; 
run;

data want;
	merge
		raw_data(in=a)
		mostfreq(in=b) 
	;
	by id;
	if a;
run;

-unison

-unison
Diosfreak
Fluorite | Level 6

This worked great! Thank you very much for the prompt solution.

mkeintz
PROC Star

You first need to first get frequencies of INSURANCE, by ID, for records with gap between -5 and 5:

 

 

data have;
  input id     gap    insurance :$5. ;
datalines;
1    -5      one
1     2      one
1     9      two
2     1      three
2    10      one 
3     5      four
3     4      four
3     2      one
3    -3      two
run;

proc summary data=have (where=(gap between -5 and 5)) order=freq noprint nway; 
  by id;
  class insurance ;
  output out=t;
run;

The PROC SUMMARY will generate many requested statistics (just frequencies in this case) of every level of the class variable (insurance).  And the "by id" tells SAS to do this separately for each ID (the data are assumed to be sorted by ID).  The "order=freq" options tells proc summary not to output (to data set T)  INSURANCE frequencies alphabetically, but in descending frequency order (again within each level of the ID variable).  The NWAY option says don't output global frequency.

 

Now you need a way to combine this with the original dataset HAVE:

 

data want (drop=_ins);
  retain _ins '       ';
  set t (keep=id insurance)
      have (in=inhave);
  by id;
  if first.id=1 then _ins=insurance;
  else insurnace=_ins;
  if inhave=1;
run;

The program, for each ID, reads all the records in T  (sorted by descending frequency), followed by all the records in HAVE.  The "BY ID" statement also tells SAS to make temporary dummy variables FIRST.ID and LAST.ID telling you whether the record-in-hand is the first record for an ID (i.e. the first, highest-frequency, record in T, has first.id=1).  The last.id=1 would be the final record for an ID in HAVE.  They are zero otherwise.

 

So when the first, highest-frequency, record is in hand copy the INSURANCE value to the retained variable _INS.  Otherwise copy _INS back to INSURANCE.  Finally the subsetting IF  (if inhave=1) tells SAS to keep only the observations starting out in HAVE.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Diosfreak
Fluorite | Level 6

This worked as well. THank you very much.

hashman
Ammonite | Level 13

@Diosfreak:

Using the power of the hash object, this can be done in a single step:

data have ;                                                                                                                             
  input id gap insurance :$5. ;                                                                                                         
  cards ;                                                                                                                               
1    -5      one                                                                                                                        
1     2      one                                                                                                                        
1     9      two                                                                                                                        
2     1      three                                                                                                                      
2    10      one                                                                                                                        
3     5      four                                                                                                                       
3     4      four                                                                                                                       
3     2      one                                                                                                                        
3    -3      two                                                                                                                        
run ;                                                                                                                                   
                                                                                                                                        
data want (drop = _:) ;                                                                                                                 
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h() ;                                                                                                                      
    dcl hiter i ("h") ;                                                                                                                 
    h.definekey  ("insurance") ;                                                                                                        
    h.definedata ("insurance", "_f") ;                                                                                                  
    h.definedone () ;                                                                                                                   
  end ;                                                                                                                                 
  do _n_ = 1 by 1 until (last.id) ;                                                                                                     
    set have ;                                                                                                                          
    by id ;                                                                                                                             
    if not (-5 <= gap <= 5) then continue ;                                                                                             
    if h.find() ne 0 then _f = 1 ;                                                                                                      
    else                  _f + 1 ;                                                                                                      
    h.replace() ;                                                                                                                       
  end ;                                                                                                                                 
  do while (i.next() = 0) ;                                                                                                             
    if _f <= _fmax then continue ;                                                                                                      
    _fmax = _f ;                                                                                                                        
    finalinsurance = insurance ;                                                                                                        
  end ;                                                                                                                                 
  do _n_ = 1 to _n_ ;                                                                                                                   
    set have ;                                                                                                                          
    output ;                                                                                                                            
  end ;                                                                                                                                 
  h.clear() ;                                                                                                                           
run ;                                     

How it works:

1. The next by-group by ID is read twice.

2. On the first pass, only the records with gap between -5 and 5 are considered.

3. If the insurance key-value not found in the hash table H (created beforehand at _N_=1), it is inserted into the table with _F=1. Otherwise, 1 is added to the table value of _F for this key-value. So, by the end of the by group, the table contains the unique values of insurance and corresponding frequencies _F.

4. After the by-group is this processed, the table is scanned using the hash iterator I to find the value of insurance with the highest frequency assigned to variable finalinsurance.

5. The same by-group (i.e. with the same ID value) is read again, and each record is written out with the value of finalinsurance attached.

6. The hash table is cleared of all the items, and the process is repeated from #1.

 

The program above assumes that the input file is sorted by ID; however it can be rewritten even without this assumption. For those learning programming with the SAS hash object it can be a nice exercise.

 

Kind regards

Paul D.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 854 views
  • 3 likes
  • 5 in conversation