Re: keep a rows which belongs to a group of data if one of them satisf...

France · Posted 01-04-2019 12:15 PM

Dear all,

how can I keep a group of data if one of them satisfy the requirement?

for example

table 1

company_name, country

apple,US

apple,

apple,UK

tesco,UK

sony,Japan

I would like to keep all rows relevant with company 'apple' and 'Tesco' if a country recorded with them is UK

table 2

company_name, country

apple,US

apple,

apple,UK

tesco,UK

could you please give me some suggestion about this?

thanks in advance.

novinosrin · Posted 01-04-2019 12:27 PM

Assuming I understand your requirement-->


data have;
infile cards dsd truncover;
input company_name $ country $;
cards;
apple,US
apple, 
apple,UK
tesco,UK
sony,Japan 
;
proc sql;
create table want(drop=t) as
select *,max(country='UK' and company_name in ('apple','tesco')) as t
from have
having t and company_name in ('apple','tesco');
quit;

France · Posted 01-04-2019 01:45 PM

Dear novinosrin,

Thank you for your advice.

However, my sample covers thousands of companies.

besides, I would like to explain my requirements.

I expect to include all companies which are relevant with 'GB' country. and there are two types of revelations.

the first is that the 'company_name' is directly related to 'GB' country, such as

company_name, country

Tesco, GB

I would like to include this row in the new dataset.

the second is, the 'company_name' is related with at least two country records(i.e., other country code or blank), and one of them is 'GB' country, for example,

company_name, country

Tesco,GB

Tesco,US

Tesco,

I would like to include all these three rows in the new dataset because the company names in the second and third row are as same as the company name in the first row which recorded with 'GB' country.

Could you please give me suggestions based on this requirement?

many thanks in advance.

novinosrin · Posted 01-04-2019 02:00 PM

Hi @France Ok, try this

data have;
infile cards dsd truncover;
input company_name $ country $;
cards;
apple,US
apple, 
apple,UK
tesco,UK
sony,Japan 
;

proc sql;
create table want(drop=t) as
select *,count(distinct country)>1 and sum(country='UK')>=1 as t
from have
group by company_name
having t or country='UK';
quit;

Logic:

1. check whether each company whether is present n the UK and is present in any other country

2. If 1 is true,output that

3. If the company is only present in UK and not in any other country, output that too

If the above is correct, it's still piece of cake

novinosrin · Posted 01-04-2019 02:21 PM

Or Datastep


data have;
infile cards dsd truncover;
input company_name $ country $;
cards;
apple,US
apple, 
apple,UK
tesco,UK
sony,Japan 
;
proc sort data=have;
by company_name;
run;

data want;
do until(last.company_name);
set have;
by company_name ;
if country='UK' then f=1;
end;
do until(last.company_name);
set have;
by company_name ;
if  f then output;
end;
drop f;
run;

novinosrin · Posted 01-04-2019 02:32 PM

Even simpler

data have;
infile cards dsd truncover;
input company_name $ country $;
cards;
apple,US
apple, 
apple,UK
tesco,UK
sony,Japan 
;
proc sort data=have;
by company_name;
run;

data want;
merge have(in=a where=(country='UK')) have(in=b);
by company_name;
if a and b ;
run;

or a join

proc sql;
create table want as
select a.*
from have a inner join  have(where=(country='UK')) b
on a.company_name=b.company_name;
quit;

mkeintz · Posted 01-04-2019 08:31 PM

If your data are already sorted by company_name (but the UK entry appears randomly within the duplicate company_name group, then (note I've sorted the sample data):

data have;
infile cards dsd truncover;
input company_name $ country $;
cards;
apple,US
apple, 
apple,UK
sony,Japan 
tesco,UK
;

data want;
  merge have (where=(country='UK') in=ingb) have;
  by company_name;
  if ingb;
run;

The MERGE does many-to-one matches. So the where=(country='UK') single entry will match all entry for the same company_name. And if there are multiple UK entries (say 2) for a company, then the last 2nd such entry will be matched against the 2nd and all subsequent entries for that company.

Also the merge statement will overwrite value in the left entry with values from the right entry. So the UK values will be overwritten by the values coming from the right-hand have records.

Now if the data are not sorted by company_name, and sorting is expensive, then (date here is unsorted):

data have;
infile cards dsd truncover;
input company_name $ country $;
cards;
apple,US
apple, 
tesco,UK
sony,Japan 
apple,UK
;

data want;
  set have;
  if _n_=1 then do;
    declare hash h (dataset:"have (where=(country='UK'))");
	  h.definekey('company_name');
	  h.definedone();
  end;
  if h.check()=0;
run;

This program builds a hash object H, keyed on company_name (meaning you can do a table lookup by company_name). But the only company_name values in H are those with country='UK'. Note, the H object is retained across observations, so it is instantiated only once, namely when the first iteration of the data step is executed - at which time the entire dataset is processed once for building the hash object.

The h.check() method returns a zero only when successful, i.e. only when the incoming company_name is found in h.

Note this program effectively reads the data set HAVE twice - once to build H, and once via the normal data step processing sequence. Even so, it's cheaper than sorting then data, followed by another data step - and possibly by another sort to reproduce original order, if needed.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 01-05-2019 05:26 AM

data have;
infile cards dsd truncover;
input company_name $ country $;
cards;
apple,US
apple, 
apple,UK
sony,Japan 
tesco,UK
;

proc sql;
select *
 from have
  group by company_name
   having sum(country='UK') ne 0;
  quit;

PeterClemmensen · Posted 01-05-2019 07:23 AM

Would definitely go with a hash solution

data want;
  if _n_=1 then do;
     declare hash h (dataset:"have(where=(country='UK'))");
	  h.definekey('company_name');
	  h.definedone();
  end;

  set have;
  if h.check()=0;
run;

The DATA to DATA Step Macro
Blog: SASnrd

ErikLund_Jensen · Posted 01-05-2019 09:07 AM

Hi France

As I understand you, you want all records for a given company, if the company is associated with GB.

With respect for the suggestions already given, I would recommend a simple two-step approace: First find the relevant companies, and then get all records for these companies.

The following code wil give you that. If you want to include UK as well (if no GB record exists), just replace the where statements with the one in comments.

data have;
infile cards dsd truncover;
input company_name $ country $;
cards;
Tesco,GB
Tesco,US
Tesco, 
apple,US
apple, 
apple,UK
sony,Japan 
tesco,UK
;

proc sql;
	create table temp as
		select distinct company_name
		from have
		where country = 'GB';
	/*	where country = 'GB' or country = 'UK';*/

	create table want as
		select *
		from have
		where company_name in (select company_name from temp);
quit;

keep a rows which belongs to a group of data if one of them satisfy the requirement

Re: keep a rows which belongs to a group of data if one of them satisfy the requirement

Re: keep a rows which belongs to a group of data if one of them satisfy the requirement

Re: keep a rows which belongs to a group of data if one of them satisfy the requirement

Re: keep a rows which belongs to a group of data if one of them satisfy the requirement

Re: keep a rows which belongs to a group of data if one of them satisfy the requirement

Re: keep a rows which belongs to a group of data if one of them satisfy the requirement

Re: keep a rows which belongs to a group of data if one of them satisfy the requirement

Re: keep a rows which belongs to a group of data if one of them satisfy the requirement

Re: keep a rows which belongs to a group of data if one of them satisfy the requirement

Classroom Training Available!