dropping data with low freq

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 10
Accepted Solution

dropping data with low freq

Hi all,

I have a dataset from which I wish to drop (for subsequent analysis) those values of the variables with only one observation, that is with a Freq count of 1. In following table there is a similar example. The values in bold are those I do not want to consider in my analysis. I appreciate any suggestion. Thanks

 

Trait Var1 Var2 ...Var300
1 200/200 120/120 100/100
1 150/150 140/140 99/99
1 100/100 180/180 135/135
1 150/150 120/120 100/100
1 100/100 180/180 160/160
1 150/150 119/119 160/160
1 200/200 140/140 135/135
0 150/150 180/180
0 149/149 140/140 160/160
0 100/100 120/120 133/133
0 100/100 140/140 100/100
0 200/200 180/180 135/135
0 150/150 140/140 160/160
0 200/200 180/180 100/100
0 100/100 120/120 135/135


Accepted Solutions
Solution
‎04-04-2016 07:36 AM
Super Contributor
Posts: 305

Re: dropping data with low freq

Hello,

 

It need a little bit of refinement (not creating the macro variable allvars manually), but may be a solution:

 


data have;
infile datalines missover;
input Trait $ Var1 :$7. Var2 :$7. Var300 :$7.;
datalines;
1 200/200 120/120 100/100
1 150/150 140/140 99/99
1 100/100 180/180 135/135
1 150/150 120/120 100/100
1 100/100 180/180 160/160
1 150/150 119/119 160/160
1 200/200 140/140 135/135
0 150/150 180/180
0 149/149 140/140 160/160
0 100/100 120/120 133/133
0 100/100 140/140 100/100
0 200/200 180/180 135/135
0 150/150 140/140 160/160
0 200/200 180/180 100/100
0 100/100 120/120 135/135
run;

%let allvars=Var1 Var2 Var300;

%macro h;

%let i=1;



%do %while (%scan(&allvars,&i) ne);

%let var=%scan(&allvars,&i);
proc sql;

create table int as
select &var, count(&var) as no
from have
group by &var
having no eq 1;

quit;

data have;

set have;

if _n_=1 then
 do;
     declare hash i(dataset:'int');
    i.definekey("&var");
    i.definedone();

 end;
 
rc=i.check();

if rc=0 then call missing(&var);

run;

%let i=%eval(&i+1);
%end;


%mend h;

%h

View solution in original post


All Replies
Solution
‎04-04-2016 07:36 AM
Super Contributor
Posts: 305

Re: dropping data with low freq

Hello,

 

It need a little bit of refinement (not creating the macro variable allvars manually), but may be a solution:

 


data have;
infile datalines missover;
input Trait $ Var1 :$7. Var2 :$7. Var300 :$7.;
datalines;
1 200/200 120/120 100/100
1 150/150 140/140 99/99
1 100/100 180/180 135/135
1 150/150 120/120 100/100
1 100/100 180/180 160/160
1 150/150 119/119 160/160
1 200/200 140/140 135/135
0 150/150 180/180
0 149/149 140/140 160/160
0 100/100 120/120 133/133
0 100/100 140/140 100/100
0 200/200 180/180 135/135
0 150/150 140/140 160/160
0 200/200 180/180 100/100
0 100/100 120/120 135/135
run;

%let allvars=Var1 Var2 Var300;

%macro h;

%let i=1;



%do %while (%scan(&allvars,&i) ne);

%let var=%scan(&allvars,&i);
proc sql;

create table int as
select &var, count(&var) as no
from have
group by &var
having no eq 1;

quit;

data have;

set have;

if _n_=1 then
 do;
     declare hash i(dataset:'int');
    i.definekey("&var");
    i.definedone();

 end;
 
rc=i.check();

if rc=0 then call missing(&var);

run;

%let i=%eval(&i+1);
%end;


%mend h;

%h

Occasional Contributor
Posts: 10

Re: dropping data with low freq

Hi Loko, thank you for your macro, it worked well, one more thing, if I want to remove those variable with only one class, e.g. suppose a variable (look below) with only two classes, this one is going to stay with an unique class (170/170) after running the macro, how I could drop those variables from my data using the same macro?

 

Var4
170/170
170/170
170/170
178/178
170/170
170/170
170/170
170/170
170/170
170/170
170/170
170/170
170/170
170/170
170/170

Grand Advisor
Posts: 17,325

Re: dropping data with low freq

Look at proc sort with nounique option. 

You'll have to run multiple iterations for each variable so maintaining a macro loop is probably still required. 

Trusted Advisor
Posts: 1,114

Re: dropping data with low freq

Hi @Fersal,

 

First of all, you should be sure that your analysis will still be valid after excluding values just because they happen to occur only once within a trait.

 

If your dataset contains only 7-8 observations per trait (as your sample data), isn't it quite likely that some valid values (in some of the 300 variables) will occur only once? On the other hand, if your real dataset has thousands of observations, some "improbable" values might occur 2, 3 or 4 times, or only once, just by chance.

 

What I'm trying to say is that the rule "exclude values with frequency count 1" is not necessarily sensible.

 

Maybe you can specify exclusion criteria based on the values themselves and apply these by means of (possibly user-defined) functions or formats.

 

Also, it could simplify the code if you transposed the data from wide to long format (esp. if the 300 variables have identical types and lengths).

Occasional Contributor
Posts: 10

Re: dropping data with low freq

 

Hi FreelanceReinhard,

Thanks for your comments, actually the original dataset is pretty big, and the number of observation is much higher than number of variables (not like the example I showed). I have had a convergence problem with this data when I try to fit them to a logit model (even after apply the Firth option), so there is not likelihood estimates for these variables, for that reason I want to know what happen if I remove those classes.

 

Regards

 

 

Grand Advisor
Posts: 17,325

Re: dropping data with low freq

I thought it was 0 cells that caused the issue, not 1's?

 

Respected Advisor
Posts: 4,606

Re: dropping data with low freq

If you still want to do it:

 

data have;
infile datalines truncover;
input Trait (Var1 - Var3) ($);
datalines;
1 200/200 120/120 100/100
1 150/150 140/140 99/99
1 100/100 180/180 135/135
1 150/150 120/120 100/100
1 100/100 180/180 160/160
1 150/150 119/119 160/160
1 200/200 140/140 135/135
0 150/150 180/180
0 149/149 140/140 160/160
0 100/100 120/120 133/133
0 100/100 140/140 100/100
0 200/200 180/180 135/135
0 150/150 140/140 160/160
0 200/200 180/180 100/100
0 100/100 120/120 135/135
;

data list;
set have;
obs = _n_;
array a{*} var:;
do i = 1 to dim(a);
    if not missing(a{i}) then do;
        value = a{i};
        output;
        end; 
    end;
keep Trait obs value;
run;

proc sql;
create table newList as
select Trait, obs, value
from list
group by Trait, value
having count(*) > 1
order by Trait, obs, value;
quit;

proc transpose data=newList out=want(drop=_name_ obs) prefix=var;
var value;
by Trait obs;
run;
PG
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 429 views
  • 2 likes
  • 5 in conversation