SAS Programming

SarahDew · Posted 05-04-2020 08:48 AM

From my not-so-clean data I want to count as correctly as possible the number of unique companies in the data. Each record has a company identifier number and a company name, however, the identifier can be missing. The identifier is assumed to be the most correct, but when missing I need to resort to the company name. I have managed to get the sum using many steps, but I am wondering if it can be done in less steps using "case when" or something else?

Mock data:

data test;
input record comp_ID Name $15.;
datalines;
1 101 SPORTSSHOP
2 101 SportsShop
3 101 Sports Shop
4 102 FoodShop
5 . ClothesShop
6 . CLOTHESSHOP 
7 . Electronics Co.
;
run;

The correct count would be 4 (2 based on comp_ID and 2 based on name).

My short attempt gives the wrong count:

proc sql;
create table count as
select distinct 
(case when comp_ID is not missing then count(distinct(comp_ID))
when comp_ID is missing then count(distinct(upcase(Name))) end) as n_comp_temp
from test;
quit;

proc sql;
create table sum as
select sum(n_comp_temp) as n_comp
from count;
quit;

The long way that works fine, but can most likely be reduced to less steps:

proc sql;
create table test2 as 
	select *,
	(case when comp_ID is not missing then 1
	when comp_ID is missing then 0 
	end) as ID_present
from test;
quit;

proc sql;
create table test3 as
select distinct ID_present
, count(distinct(comp_ID)) as n_IDs
, count(distinct(upcase(Name))) as n_names
from test2
group by ID_present;
quit;

proc sql;
create table n_IDs as
select n_IDs as n_comp
from test3
where ID_present = 1;
quit;

proc sql;
create table n_names as
select n_names as n_comp
from test3
where ID_present = 0;
quit;

data count;
set n_IDs n_names;
run;

proc sql;
create table count2 as
select sum(n_comp) as n_comp
from count;
quit;

PGStats · Posted 05-05-2020 12:48 AM

You could try:


proc sql;
select
	count(distinct case when missing(comp_id) then upcase(compress(name," ,.-")) else put(comp_id, best.) end) as n
from test;
quit;

PG

View solution in original post

novinosrin · Posted 05-04-2020 09:11 AM

HI @SarahDew Would this come close to meet your expectation?


data test;
input record comp_ID Name $15.;
datalines;
1 101 SPORTSSHOP
2 101 SportsShop
3 101 Sports Shop
4 102 FoodShop
5 . ClothesShop
6 . CLOTHESSHOP 
7 . Electronics Co.
;
run;
/*List report*/
proc sql;
create table want_list as
select distinct comp_id,compress(upcase(name)) as name,ifc(missing(comp_ID),'name','comp_id') as based_on
from test
group by based_on;
quit;
/*Count report*/
proc sql;
create table want_count as
select ifc(missing(comp_ID),'name','comp_id') as based_on, count( distinct compress(upcase(name))) as count
from test
group by based_on;
quit;

Tom · Posted 05-04-2020 09:34 AM

You should nest the COUNT() and CASE in the other order.

For example you could count distinct COMP_ID (which will ignore the missing values) and then count the distinct NAMEs only for records that have missing COMP_ID.

select
  count(distinct comp_id) as n_comp_ids
, count(distinct case when (missing(comp_id)) then name else ' ' end) as n_other_names
from test
;

mkeintz · Posted 05-05-2020 12:26 AM

What if your sample data had an 8th record, with no ID, but a name matching another record that did have a ID, such as below:

data test;
  input record comp_ID Name $15.;
datalines;
1 101 SPORTSSHOP
2 101 SportsShop
3 101 Sports Shop
4 102 FoodShop
5 . ClothesShop
6 . CLOTHESSHOP 
7 . Electronics Co.
8 . foodshop
run;

Then you can no longer do a simple count of distinct names for the records without id's. You would have to reduce it be the number of distinct names ~~also~~ NOT found in records with COMP_ID, as here:

data want (keep=N_:);
  set test (where=(comp_id^=.))
      test (where=(comp_id=.))
    end=end_of_testdata;

  if _n_=1 then do;
     declare hash hid ();  /*count unique company id's */
       hid.definekey('comp_id');
       hid.definedata('comp_id');
       hid.definedone();
     declare hash hname(); /*track unique names for records having company id's*/
       hname.definekey('name');
       hname.definedata('name');
       hname.definedone();
     declare hash noid (); /*track unique names not found in above */
       noid.definekey('name');
       noid.definedata('name');
       noid.definedone();
  end;
  name=upcase(name);
  if comp_id^=. then do;
     hid.replace();
     hname.replace();
  end;
  else if hname.check()^=0 then noid.replace();
  if end_of_testdata ;
  n_compid=hid.num_items;
  n_other=noid.num_items;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SarahDew · Posted 05-05-2020 08:48 AM

Thanks for pointing out this possible issue. I had a look at the data and indeed I can see some issues there too. I will look into it to see if I can further improve the count with your method.

PGStats · Posted 05-05-2020 12:48 AM

You could try:


proc sql;
select
	count(distinct case when missing(comp_id) then upcase(compress(name," ,.-")) else put(comp_id, best.) end) as n
from test;
quit;

PG

SarahDew · Posted 05-05-2020 08:50 AM

This seems to do exactly as asked and in the most concise way. I like it. Thanks

SAS Programming

Count distinct using case when

Re: Count distinct using case when

Re: Count distinct using case when

Re: Count distinct using case when

Re: Count distinct using case when

Re: Count distinct using case when

Re: Count distinct using case when

Re: Count distinct using case when

Follow Us

What is...

SAS Programming

Special offer for SAS Communities members

SAS Training: Just a Click Away

Follow Us

What is...