BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SarahDew
Obsidian | Level 7

From my not-so-clean data I want to count as correctly as possible the number of unique companies in the data. Each record has a company identifier number and a company name, however, the identifier can be missing. The identifier is assumed to be the most correct, but when missing I need to resort to the company name. I have managed to get the sum using many steps, but I am wondering if it can be done in less steps using "case when" or something else? 

 

Mock data:

data test;
input record comp_ID Name $15.;
datalines;
1 101 SPORTSSHOP
2 101 SportsShop
3 101 Sports Shop
4 102 FoodShop
5 . ClothesShop
6 . CLOTHESSHOP 
7 . Electronics Co.
;
run;

 

The correct count would be 4 (2 based on comp_ID and 2 based on name).

 

 

My short attempt gives the wrong count:

proc sql;
create table count as
select distinct 
(case when comp_ID is not missing then count(distinct(comp_ID))
when comp_ID is missing then count(distinct(upcase(Name))) end) as n_comp_temp
from test;
quit;

proc sql;
create table sum as
select sum(n_comp_temp) as n_comp
from count;
quit;

 

 

The long way that works fine, but can most likely be reduced to less steps:

proc sql;
create table test2 as 
	select *,
	(case when comp_ID is not missing then 1
	when comp_ID is missing then 0 
	end) as ID_present
from test;
quit;

proc sql;
create table test3 as
select distinct ID_present
, count(distinct(comp_ID)) as n_IDs
, count(distinct(upcase(Name))) as n_names
from test2
group by ID_present;
quit;

proc sql;
create table n_IDs as
select n_IDs as n_comp
from test3
where ID_present = 1;
quit;

proc sql;
create table n_names as
select n_names as n_comp
from test3
where ID_present = 0;
quit;

data count;
set n_IDs n_names;
run;

proc sql;
create table count2 as
select sum(n_comp) as n_comp
from count;
quit;

 

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

You could try:

 


proc sql;
select
	count(distinct case when missing(comp_id) then upcase(compress(name," ,.-")) else put(comp_id, best.) end) as n
from test;
quit;
PG

View solution in original post

6 REPLIES 6
novinosrin
Tourmaline | Level 20

HI @SarahDew  Would this come close to meet your expectation?

 


data test;
input record comp_ID Name $15.;
datalines;
1 101 SPORTSSHOP
2 101 SportsShop
3 101 Sports Shop
4 102 FoodShop
5 . ClothesShop
6 . CLOTHESSHOP 
7 . Electronics Co.
;
run;
/*List report*/
proc sql;
create table want_list as
select distinct comp_id,compress(upcase(name)) as name,ifc(missing(comp_ID),'name','comp_id') as based_on
from test
group by based_on;
quit;
/*Count report*/
proc sql;
create table want_count as
select ifc(missing(comp_ID),'name','comp_id') as based_on, count( distinct compress(upcase(name))) as count
from test
group by based_on;
quit;
Tom
Super User Tom
Super User

You should nest the COUNT() and CASE in the other order.

For example you could count distinct COMP_ID (which will ignore the missing values) and then count the distinct NAMEs only for records that have missing COMP_ID.

 

select
  count(distinct comp_id) as n_comp_ids
, count(distinct case when (missing(comp_id)) then name else ' ' end) as n_other_names
from test
;
mkeintz
PROC Star

What if your sample data had an 8th record, with no ID, but a name matching another record that did have a ID, such as below:

 

data test;
  input record comp_ID Name $15.;
datalines;
1 101 SPORTSSHOP
2 101 SportsShop
3 101 Sports Shop
4 102 FoodShop
5 . ClothesShop
6 . CLOTHESSHOP 
7 . Electronics Co.
8 . foodshop
run;

Then you can no longer do a simple count of distinct names for the records without id's.  You would have to reduce it be the number of distinct names also NOT found in records with COMP_ID, as here:

 

data want (keep=N_:);
  set test (where=(comp_id^=.))
      test (where=(comp_id=.))
    end=end_of_testdata;

  if _n_=1 then do;
     declare hash hid ();  /*count unique company id's */
       hid.definekey('comp_id');
       hid.definedata('comp_id');
       hid.definedone();
     declare hash hname(); /*track unique names for records having company id's*/
       hname.definekey('name');
       hname.definedata('name');
       hname.definedone();
     declare hash noid (); /*track unique names not found in above */
       noid.definekey('name');
       noid.definedata('name');
       noid.definedone();
  end;
  name=upcase(name);
  if comp_id^=. then do;
     hid.replace();
     hname.replace();
  end;
  else if hname.check()^=0 then noid.replace();
  if end_of_testdata ;
  n_compid=hid.num_items;
  n_other=noid.num_items;
run;
 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
SarahDew
Obsidian | Level 7

Thanks for pointing out this possible issue. I had a look at the data and indeed I can see some issues there too. I will look into it to see if I can further improve the count with your method.

PGStats
Opal | Level 21

You could try:

 


proc sql;
select
	count(distinct case when missing(comp_id) then upcase(compress(name," ,.-")) else put(comp_id, best.) end) as n
from test;
quit;
PG
SarahDew
Obsidian | Level 7
This seems to do exactly as asked and in the most concise way. I like it. Thanks

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 5422 views
  • 3 likes
  • 5 in conversation