My dataset looks like this
id gap insurance
1 -5 one
1 2 one
1 9 two
2 1 three
2 10 one
3 5 four
3 4 four
3 2 one
3 -3 two
What i want to do is the following, i want a new variable created that will give me the most occuring insurance if the gap is between -5 to 5 days.
id-Unique id of the individual
gap-gap in days
insurance-the different types of insurance
So my final result should look like
id gap insurance finalinsurance
1 -5 one one
1 2 one one
1 9 two one
2 1 three three
2 10 one three
3 5 four four
3 4 four four
3 2 one four
3 -3 two four
some more insight on the result:
ID 1 has 2 insurance that falls within my -5 to 5 days gap so i take a count of it, and i pick the most common one. In this case it is one so i populate the new variable with this answer.
ID 2 only one met my criteria so i went with the insurance , which is three.
ID 3, they all fall within my gap, so i take a count of the insurance input and go with the most common one, in this case insurance type four is the most common one.
Please any assistance will be appreciated. Thank you.
Give this a shot.
1. Read in "raw_data" that you supplied.
2. Create a frequency table on id and insurance- call it "outfreq".
(be sure to filter raw_data down to observations with gaps between -5 and 5)
3. Sort raw_data for later merge.
4. Sort "outfreq" so that most frequent by id is on top.
5. De-duplicate "outfreq" - call it "mostfreq".
6. Merge "mostfreq" with raw_data (like a left join).
data raw_data;
input id gap insurance $;
datalines;
1 -5 one
1 2 one
1 9 two
2 1 three
2 10 one
3 5 four
3 4 four
3 2 one
3 -3 two
;
run;
proc freq data = raw_data(where=(gap ge -5 and gap le 5)) noprint;
tables id*insurance / norow nocol nopercent nocum
out = outfreq(rename=count=finalinsurance drop=percent);
run;
proc sort data=raw_data;
by id;
run;
proc sort data=outfreq;
by id descending finalinsurance;
run;
proc sort data=outfreq(keep=id insurance)
nodupkey
out=mostfreq(rename=insurance=finalinsurance);
by id;
run;
data want;
merge
raw_data(in=a)
mostfreq(in=b)
;
by id;
if a;
run;
-unison
My dataset looks like this
id gap insurance
1 -5 one
1 2 one
1 9 two
2 1 three
2 10 one
3 5 four
3 4 four
3 2 one
3 -3 two
What i want to do is the following, i want a new variable created that will give me the most occuring insurance if the gap is between -5 to 5 days.
id-Unique id of the individual
gap-gap in days
insurance-the different types of insurance
So my final result should look like
id gap insurance finalinsurance
1 -5 one one
1 2 one one
1 9 two one
2 1 three three
2 10 one three
3 5 four four
3 4 four four
3 2 one four
3 -3 two four
some more insight on the result:
ID 1 has 2 insurance that falls within my -5 to 5 days gap so i take a count of it, and i pick the most common one. In this case it is one so i populate the new variable with this answer.
ID 2 only one met my criteria so i went with the insurance , which is three.
ID 3, they all fall within my gap, so i take a count of the insurance input and go with the most common one, in this case insurance type four is the most common one.
Please any assistance will be appreciated. Thank you.
I suggest you remove this topic, or mark it as solved, since youi already have posted it in a more appropriate forum group.
Yes thanks! Its my first time using this forum, i was not sure where to paste my question. Thanks again.
Thank you. I wanted to delete one, but I couldn't figure out how to..im new here! ><. Thanks again.
Give this a shot.
1. Read in "raw_data" that you supplied.
2. Create a frequency table on id and insurance- call it "outfreq".
(be sure to filter raw_data down to observations with gaps between -5 and 5)
3. Sort raw_data for later merge.
4. Sort "outfreq" so that most frequent by id is on top.
5. De-duplicate "outfreq" - call it "mostfreq".
6. Merge "mostfreq" with raw_data (like a left join).
data raw_data;
input id gap insurance $;
datalines;
1 -5 one
1 2 one
1 9 two
2 1 three
2 10 one
3 5 four
3 4 four
3 2 one
3 -3 two
;
run;
proc freq data = raw_data(where=(gap ge -5 and gap le 5)) noprint;
tables id*insurance / norow nocol nopercent nocum
out = outfreq(rename=count=finalinsurance drop=percent);
run;
proc sort data=raw_data;
by id;
run;
proc sort data=outfreq;
by id descending finalinsurance;
run;
proc sort data=outfreq(keep=id insurance)
nodupkey
out=mostfreq(rename=insurance=finalinsurance);
by id;
run;
data want;
merge
raw_data(in=a)
mostfreq(in=b)
;
by id;
if a;
run;
-unison
This worked great! Thank you very much for the prompt solution.
You first need to first get frequencies of INSURANCE, by ID, for records with gap between -5 and 5:
data have;
input id gap insurance :$5. ;
datalines;
1 -5 one
1 2 one
1 9 two
2 1 three
2 10 one
3 5 four
3 4 four
3 2 one
3 -3 two
run;
proc summary data=have (where=(gap between -5 and 5)) order=freq noprint nway;
by id;
class insurance ;
output out=t;
run;
The PROC SUMMARY will generate many requested statistics (just frequencies in this case) of every level of the class variable (insurance). And the "by id" tells SAS to do this separately for each ID (the data are assumed to be sorted by ID). The "order=freq" options tells proc summary not to output (to data set T) INSURANCE frequencies alphabetically, but in descending frequency order (again within each level of the ID variable). The NWAY option says don't output global frequency.
Now you need a way to combine this with the original dataset HAVE:
data want (drop=_ins);
retain _ins ' ';
set t (keep=id insurance)
have (in=inhave);
by id;
if first.id=1 then _ins=insurance;
else insurnace=_ins;
if inhave=1;
run;
The program, for each ID, reads all the records in T (sorted by descending frequency), followed by all the records in HAVE. The "BY ID" statement also tells SAS to make temporary dummy variables FIRST.ID and LAST.ID telling you whether the record-in-hand is the first record for an ID (i.e. the first, highest-frequency, record in T, has first.id=1). The last.id=1 would be the final record for an ID in HAVE. They are zero otherwise.
So when the first, highest-frequency, record is in hand copy the INSURANCE value to the retained variable _INS. Otherwise copy _INS back to INSURANCE. Finally the subsetting IF (if inhave=1) tells SAS to keep only the observations starting out in HAVE.
This worked as well. THank you very much.
Using the power of the hash object, this can be done in a single step:
data have ;
input id gap insurance :$5. ;
cards ;
1 -5 one
1 2 one
1 9 two
2 1 three
2 10 one
3 5 four
3 4 four
3 2 one
3 -3 two
run ;
data want (drop = _:) ;
if _n_ = 1 then do ;
dcl hash h() ;
dcl hiter i ("h") ;
h.definekey ("insurance") ;
h.definedata ("insurance", "_f") ;
h.definedone () ;
end ;
do _n_ = 1 by 1 until (last.id) ;
set have ;
by id ;
if not (-5 <= gap <= 5) then continue ;
if h.find() ne 0 then _f = 1 ;
else _f + 1 ;
h.replace() ;
end ;
do while (i.next() = 0) ;
if _f <= _fmax then continue ;
_fmax = _f ;
finalinsurance = insurance ;
end ;
do _n_ = 1 to _n_ ;
set have ;
output ;
end ;
h.clear() ;
run ;
How it works:
1. The next by-group by ID is read twice.
2. On the first pass, only the records with gap between -5 and 5 are considered.
3. If the insurance key-value not found in the hash table H (created beforehand at _N_=1), it is inserted into the table with _F=1. Otherwise, 1 is added to the table value of _F for this key-value. So, by the end of the by group, the table contains the unique values of insurance and corresponding frequencies _F.
4. After the by-group is this processed, the table is scanned using the hash iterator I to find the value of insurance with the highest frequency assigned to variable finalinsurance.
5. The same by-group (i.e. with the same ID value) is read again, and each record is written out with the value of finalinsurance attached.
6. The hash table is cleared of all the items, and the process is repeated from #1.
The program above assumes that the input file is sorted by ID; however it can be rewritten even without this assumption. For those learning programming with the SAS hash object it can be a nice exercise.
Kind regards
Paul D.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.