BookmarkSubscribeRSS Feed
France
Quartz | Level 8

Dear all,

 

how can I keep a group of data if one of them satisfy the requirement?

for example 

table 1

company_name, country

apple,US

apple, 

apple,UK

tesco,UK

sony,Japan 

 

I would like to keep all rows relevant with company 'apple' and 'Tesco' if a country recorded with them is UK

 

table 2

company_name, country

apple,US

apple, 

apple,UK

tesco,UK

 

could you please give me some suggestion about this? 

thanks in advance.

 

9 REPLIES 9
novinosrin
Tourmaline | Level 20

Assuming I understand your requirement-->

 


data have;
infile cards dsd truncover;
input company_name $ country $;
cards;
apple,US
apple, 
apple,UK
tesco,UK
sony,Japan 
;
proc sql;
create table want(drop=t) as
select *,max(country='UK' and company_name in ('apple','tesco')) as t
from have
having t and company_name in ('apple','tesco');
quit;
France
Quartz | Level 8

Dear novinosrin,

 

Thank you for your advice.

 

However, my sample covers thousands of companies. 

 

besides, I would like to explain my requirements.

 

I expect to include all companies which are relevant with 'GB' country. and there are two types of revelations.

 

the first is that the 'company_name' is directly related to 'GB' country, such as 

company_name, country

Tesco, GB

 

I would like to include this row in the new dataset.

 

the second is, the 'company_name' is related with at least two country records(i.e., other country code or blank), and one of them is 'GB' country, for example,

company_name, country

Tesco,GB

Tesco,US

Tesco, 

 

I would like to include all these three rows in the new dataset because the company names in the second and third row are as same as the company name in the first row which recorded with 'GB' country.

 

Could you please give me suggestions based on this requirement?

many thanks in advance.

novinosrin
Tourmaline | Level 20

Hi @France  Ok, try this

 

data have;
infile cards dsd truncover;
input company_name $ country $;
cards;
apple,US
apple, 
apple,UK
tesco,UK
sony,Japan 
;

proc sql;
create table want(drop=t) as
select *,count(distinct country)>1 and sum(country='UK')>=1 as t
from have
group by company_name
having t or country='UK';
quit;

Logic:

1. check whether each company whether is present n the UK and is present in any other country

2. If 1 is true,output that

3. If the company is only present in UK and not in any other country, output that too 

 

If the above is correct, it's still piece of cake

novinosrin
Tourmaline | Level 20

Or Datastep

 


data have;
infile cards dsd truncover;
input company_name $ country $;
cards;
apple,US
apple, 
apple,UK
tesco,UK
sony,Japan 
;
proc sort data=have;
by company_name;
run;

data want;
do until(last.company_name);
set have;
by company_name ;
if country='UK' then f=1;
end;
do until(last.company_name);
set have;
by company_name ;
if  f then output;
end;
drop f;
run;
novinosrin
Tourmaline | Level 20

Even simpler

 

data have;
infile cards dsd truncover;
input company_name $ country $;
cards;
apple,US
apple, 
apple,UK
tesco,UK
sony,Japan 
;
proc sort data=have;
by company_name;
run;

data want;
merge have(in=a where=(country='UK')) have(in=b);
by company_name;
if a and b ;
run;

or a join

 

proc sql;
create table want as
select a.*
from have a inner join  have(where=(country='UK')) b
on a.company_name=b.company_name;
quit;
mkeintz
PROC Star

If your data are already sorted by company_name (but the UK entry appears randomly within the duplicate company_name group, then (note I've sorted the sample data):

 

data have;
infile cards dsd truncover;
input company_name $ country $;
cards;
apple,US
apple, 
apple,UK
sony,Japan 
tesco,UK
;

data want;
  merge have (where=(country='UK') in=ingb) have;
  by company_name;
  if ingb;
run;

 

The MERGE does many-to-one matches.  So the where=(country='UK') single entry will match all entry for the same company_name.  And if there are multiple UK entries (say 2) for a company, then the last 2nd such entry will be matched against the 2nd and all subsequent entries for that company.

 

Also the merge statement will overwrite value in the left entry with values from the right entry.  So the UK values will be overwritten by the values coming from the right-hand have records.

 

Now if the data are not sorted by company_name, and sorting is expensive, then (date here is unsorted):

 

data have;
infile cards dsd truncover;
input company_name $ country $;
cards;
apple,US
apple, 
tesco,UK
sony,Japan 
apple,UK
;

data want;
  set have;
  if _n_=1 then do;
    declare hash h (dataset:"have (where=(country='UK'))");
	  h.definekey('company_name');
	  h.definedone();
  end;
  if h.check()=0;
run;

 

This program builds a hash object H, keyed on company_name (meaning you can do a table lookup by company_name).  But the only company_name values in H are those with country='UK'.   Note, the H object is retained across observations, so it is instantiated only once, namely when the first iteration of the data step is executed - at which time the entire dataset is processed once for building the hash object.

 

The h.check() method returns a zero only when successful, i.e. only when the incoming company_name is found in h.

 

Note this program effectively reads the data set HAVE twice - once to build H, and once via the normal data step processing sequence.  Even so, it's cheaper than sorting then data, followed by another data step - and possibly by another sort to reproduce original order, if needed.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ksharp
Super User
data have;
infile cards dsd truncover;
input company_name $ country $;
cards;
apple,US
apple, 
apple,UK
sony,Japan 
tesco,UK
;

proc sql;
select *
 from have
  group by company_name
   having sum(country='UK') ne 0;
  quit;
PeterClemmensen
Tourmaline | Level 20

Would definitely go with a hash solution

 

data want;
  if _n_=1 then do;
     declare hash h (dataset:"have(where=(country='UK'))");
	  h.definekey('company_name');
	  h.definedone();
  end;

  set have;
  if h.check()=0;
run;

 

ErikLund_Jensen
Rhodochrosite | Level 12

Hi France

 

As I understand you, you want all records for a given company, if the company is associated with GB.

 

With respect for the suggestions already given, I would recommend a simple two-step approace: First find the relevant companies, and then get all records for these companies.

 

The following code wil give you that. If you want to include UK as well (if no GB record exists), just replace the where statements with the one in comments.

 

data have;
infile cards dsd truncover;
input company_name $ country $;
cards;
Tesco,GB
Tesco,US
Tesco, 
apple,US
apple, 
apple,UK
sony,Japan 
tesco,UK
;

proc sql;
	create table temp as
		select distinct company_name
		from have
		where country = 'GB';
	/*	where country = 'GB' or country = 'UK';*/

	create table want as
		select *
		from have
		where company_name in (select company_name from temp);
quit;

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2717 views
  • 3 likes
  • 6 in conversation