- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
From my not-so-clean data I want to count as correctly as possible the number of unique companies in the data. Each record has a company identifier number and a company name, however, the identifier can be missing. The identifier is assumed to be the most correct, but when missing I need to resort to the company name. I have managed to get the sum using many steps, but I am wondering if it can be done in less steps using "case when" or something else?
Mock data:
data test;
input record comp_ID Name $15.;
datalines;
1 101 SPORTSSHOP
2 101 SportsShop
3 101 Sports Shop
4 102 FoodShop
5 . ClothesShop
6 . CLOTHESSHOP
7 . Electronics Co.
;
run;
The correct count would be 4 (2 based on comp_ID and 2 based on name).
My short attempt gives the wrong count:
proc sql;
create table count as
select distinct
(case when comp_ID is not missing then count(distinct(comp_ID))
when comp_ID is missing then count(distinct(upcase(Name))) end) as n_comp_temp
from test;
quit;
proc sql;
create table sum as
select sum(n_comp_temp) as n_comp
from count;
quit;
The long way that works fine, but can most likely be reduced to less steps:
proc sql;
create table test2 as
select *,
(case when comp_ID is not missing then 1
when comp_ID is missing then 0
end) as ID_present
from test;
quit;
proc sql;
create table test3 as
select distinct ID_present
, count(distinct(comp_ID)) as n_IDs
, count(distinct(upcase(Name))) as n_names
from test2
group by ID_present;
quit;
proc sql;
create table n_IDs as
select n_IDs as n_comp
from test3
where ID_present = 1;
quit;
proc sql;
create table n_names as
select n_names as n_comp
from test3
where ID_present = 0;
quit;
data count;
set n_IDs n_names;
run;
proc sql;
create table count2 as
select sum(n_comp) as n_comp
from count;
quit;
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You could try:
proc sql;
select
count(distinct case when missing(comp_id) then upcase(compress(name," ,.-")) else put(comp_id, best.) end) as n
from test;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
HI @SarahDew Would this come close to meet your expectation?
data test;
input record comp_ID Name $15.;
datalines;
1 101 SPORTSSHOP
2 101 SportsShop
3 101 Sports Shop
4 102 FoodShop
5 . ClothesShop
6 . CLOTHESSHOP
7 . Electronics Co.
;
run;
/*List report*/
proc sql;
create table want_list as
select distinct comp_id,compress(upcase(name)) as name,ifc(missing(comp_ID),'name','comp_id') as based_on
from test
group by based_on;
quit;
/*Count report*/
proc sql;
create table want_count as
select ifc(missing(comp_ID),'name','comp_id') as based_on, count( distinct compress(upcase(name))) as count
from test
group by based_on;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You should nest the COUNT() and CASE in the other order.
For example you could count distinct COMP_ID (which will ignore the missing values) and then count the distinct NAMEs only for records that have missing COMP_ID.
select
count(distinct comp_id) as n_comp_ids
, count(distinct case when (missing(comp_id)) then name else ' ' end) as n_other_names
from test
;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What if your sample data had an 8th record, with no ID, but a name matching another record that did have a ID, such as below:
data test;
input record comp_ID Name $15.;
datalines;
1 101 SPORTSSHOP
2 101 SportsShop
3 101 Sports Shop
4 102 FoodShop
5 . ClothesShop
6 . CLOTHESSHOP
7 . Electronics Co.
8 . foodshop
run;
Then you can no longer do a simple count of distinct names for the records without id's. You would have to reduce it be the number of distinct names also NOT found in records with COMP_ID, as here:
data want (keep=N_:);
set test (where=(comp_id^=.))
test (where=(comp_id=.))
end=end_of_testdata;
if _n_=1 then do;
declare hash hid (); /*count unique company id's */
hid.definekey('comp_id');
hid.definedata('comp_id');
hid.definedone();
declare hash hname(); /*track unique names for records having company id's*/
hname.definekey('name');
hname.definedata('name');
hname.definedone();
declare hash noid (); /*track unique names not found in above */
noid.definekey('name');
noid.definedata('name');
noid.definedone();
end;
name=upcase(name);
if comp_id^=. then do;
hid.replace();
hname.replace();
end;
else if hname.check()^=0 then noid.replace();
if end_of_testdata ;
n_compid=hid.num_items;
n_other=noid.num_items;
run;
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for pointing out this possible issue. I had a look at the data and indeed I can see some issues there too. I will look into it to see if I can further improve the count with your method.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You could try:
proc sql;
select
count(distinct case when missing(comp_id) then upcase(compress(name," ,.-")) else put(comp_id, best.) end) as n
from test;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content