Merge Two Datasets using String Variable and Year

AbuChowdhury · Posted 02-15-2020 06:12 AM

Dear Experts,

I need to merge two datasets using the string variable (the name of the of the variable in the attached file is lead_managers_name) and year. Those names are clearly separated in the second dataset named "rank". Please note in the first dataset (named "main_dataset") that many observations have more than one names joined together. In those cases, the variable rank_lead will be the average of those names in that specific year.

Let me know if you need more information.

Best regards,

Abu

ed_sas_member · Posted 02-15-2020 08:48 AM

Hi @AbuChowdhury

Here is an attempt to do this:

/*** UNTESTED CODE ****/

data one;
	set lib.rank;
run;

data two;
	set lib.main_dataset;
run;

/* Put distinct 'units' of Name into macrovariables */
proc sql noprint;
	select distinct quote(tranwrd(tranwrd(strip(lead_managers_name),'.','\.'),' ','\s'),"'") into: units separated by "," from one;
	select count(distinct lead_managers_name) into: total_units from one;
quit;

/* Separate Names into its distinct 'units' (one unit per variable) */
data two2;
	set two;
	if lead_managers_name="Goldmen (xxx)" then lead_managers_name="A.S. Goldmen"; /* Data management*/
	id+1;
	array _name(&total_units) $ 50;
	array _unit(&total_units) $ 50 _temporary_ (&units);
	do i=1 to &total_units;
		if prxmatch(cats('/^.*(',_unit(i),').*$/'),lead_managers_name) then
		_name(i)=prxchange(cats('s/^.*(',_unit(i),').*$/$1/i'),i,lead_managers_name);
	end;
	drop i;

proc sort data=two2;
	by id lead_managers_name year;
run;
proc transpose data=two2 out=two2_tr (drop=_name_ where=(unit ne "") rename=(col1=unit));
	var _name:;
	by id lead_managers_name year;
run;

/* Merge with ranks */
proc sql;
	create table one_two as
	select b.*, a.rank_lead
	from one as a inner join two2_tr as b
	on a.lead_managers_name = b.unit and a.year=b.year;
quit;

/* Compute mean */
proc sql;
	create table want as
	select distinct lead_managers_name,year, avg(rank_lead) as rank_lead
	from one_two
	group by lead_managers_name, year;
quit;

/*final merge*/
proc sql;
	create table want_final as
	select a.*, b.rank_lead
	from two (drop=rank_lead) as a inner join want as b
	on a.lead_managers_name = b.lead_managers_name and a.year=b.year;
quit;

AbuChowdhury · Posted 02-15-2020 10:27 AM

Hi,

Thanks but I see that observations are removed when the variable lead_managers_name has the observations with multiple names together. For example, if you go to row 13 of "main_dataset", then you will see the observation for lead_managers_name variable as JP Morgan & Co IncCitigroup Global Markets IncBMO Capital Markets. Actually there are 3 names here: JP Morgan, Citigroup Global Markets Inc and BMO Capital Markets (you will see these names separately by the variable "lead_managers_name in "rank" dataset and their respective ranks by the variable "rank_lead").

By the way, why did you use Goldmen in the following datastep?

data two2;
set two;
if lead_managers_name="Goldmen (xxx)" then lead_managers_name="A.S. Goldmen"; /* Data management*/
id+1;
array _name(&total_units) $ 50;
array _unit(&total_units) $ 50 _temporary_ (&units);
do i=1 to &total_units;
if prxmatch(cats('/^.*(',_unit(i),').*$/'),lead_managers_name) then
_name(i)=prxchange(cats('s/^.*(',_unit(i),').*$/$1/i'),i,lead_managers_name);
end;
drop i;

ed_sas_member · Posted 02-15-2020 01:21 PM

Hi @AbuChowdhury

In dataset 'rank', JP Morgan is referenced as:

JP Morgan (JPM)

JP Morgan Securities Inc

However, we can find for example "JP Morgan & Co Inc" in the Main_dataset.

So there is no match possible.

I think you should first clean the dataset 'rank', by providing SAS with a list of 'suffix' to remove from the lead_managers_name (words in parenthesis, "& Co", "Inc.", ...) before running the code :

JP Morgan ~~(JPM)~~

JP Morgan Securities ~~Inc~~

See for example thread https://communities.sas.com/t5/SAS-Programming/Removing-postfix-e-g-Inc-Limited-in-corporate-names/m...

Regarding the code "if lead_managers_name="Goldmen (xxx)" then lead_managers_name="A.S. Goldmen"; /* Data management*/", it is an error (copy-paste from a previous code". You can delete it.

Best,

AbuChowdhury · Posted 02-15-2020 03:19 PM

Hi,

Thanks again. I was trying to remove the company suffix using the code in the link you provided. But those are not removing. Why can't I remove the suffix?

AbuChowdhury · Posted 02-15-2020 05:13 PM

Hi again,

Please have a look at the following code. Suffix is removed only for some companies.

data word_list;
	input word $20.;
	if findc(word,"/") then word=tranwrd(word,"/",".");
	datalines;
Inc.
Inc
& Co.
Ltd
Ltd.
& Co
& Co Inc
& Co Inc.
& Co, Inc.
& Co, Inc
& Co Ltd
& Co., Ltd
& Co Ltd.
& Co., Ltd.
& Co., Ltd.
L.P.
PLC
LP
LLC
Corp
;
run;

data one;
	input lead_managers_name $80.;
	datalines;
A. G. Becker Paribas Inc.
A. J. Michaels & Co., Ltd.
A. L. Havens Securities
A. M. Levine
A. T. Brod & Co.
A.R. Baron & Co., Inc.
A.S. Goldmen & Company
AB Capital Markets
ABD Securities
ABN AMRO Chicago Corp
ABN AMRO Incorporated
ABN AMRO Rothschild
ABN-AMRO Holding NV
Access Securities
Acciones y Valores de Mexico
Adams Harkness & Hill Inc
Adams, Cohen
Adams, James & Foor
Advest Inc
Aegis Capital
AG Edwards & Sons Inc
Agean Group Inc
AIB Capital Markets
Akroyd & Smithers Inc
Alan-Bush Brokerage
Alex Brown & Sons Inc
;
run;

proc sql noprint; create table xxx as
	select quote(trim(word),"'")
	into:word_list separated by ","
	from word_list
	order by countw(word) desc;
quit;

data want;
	set one;
	lead_managers_name = tranwrd(lead_managers_name,"/",".");
	array _list(&sqlobs) $ _temporary_ (&word_list);
	do i=1 to dim(_list);
		if find(lead_managers_name,_list(i),"i")>0 then do;
			lead_managers_name = prxchange(cats("s/(\(?\b",_list(i),"\b\)?\s*)$//i"),-1,lead_managers_name);
		end;
	end;
	drop i;
run;

ed_sas_member · Posted 02-16-2020 05:37 AM

Hi @AbuChowdhury

Here is the update code. In particular:

-> I have update the word_list with one suffix missing.

-> I have put a length of 20 to the _list array to avoid truncation (by default it is 😎 : e.g. & Co., Inc and not & Co., I

-> I have also updated the regex expression

NB: once you have a "clean" version of table 'rank' with no suffix after company names, I think you will also need add a data step to remove expressions in parenthesis.

E.g; J.P. Morgan (JPM) -> should become J.P. Morgan.

Otherwise, it will be hard to match company names with the second dataset.

Best,

data word_list;
	input word $20.;
	datalines;
Inc.
Inc
& Co.
Ltd
Ltd.
& Co., Inc
& Co., Ltd.
& Co., Ltd
& Co, Inc.
& Co, Inc
& Co., Inc.
& Co
& Co Inc
& Co Inc.
& Co Ltd
& Co Ltd.
L.P.
PLC
LP
LLC
Corp
;
run;

data one;
	input lead_managers_name $80.;
	datalines;
A. G. Becker Paribas Inc.
A. J. Michaels & Co., Ltd.
A. L. Havens Securities
A. M. Levine
A. T. Brod & Co.
A.R. Baron & Co., Inc.
A.S. Goldmen & Company
AB Capital Markets
ABD Securities
ABN AMRO Chicago Corp
ABN AMRO Incorporated
ABN AMRO Rothschild
ABN-AMRO Holding NV
Access Securities
Acciones y Valores de Mexico
Adams Harkness & Hill Inc
Adams, Cohen
Adams, James & Foor
Advest Inc
Aegis Capital
AG Edwards & Sons Inc
Agean Group Inc
AIB Capital Markets
Akroyd & Smithers Inc
Alan-Bush Brokerage
Alex Brown & Sons Inc
;
run;

proc sql noprint;
	select quote(trim(word),"'")
	into:word_list separated by ","
	from word_list
	order by length(word) desc , word desc;
quit;

data want;
	set one;
	array _list(&sqlobs) $ 20 _temporary_ (&word_list);
	do i=1 to dim(_list);
		if find(lead_managers_name,strip(_list(i)),"i")>0 then do;
			lead_managers_name = prxchange(cats("s/\s",strip(_list(i)),"\s*$//i"),-1,lead_managers_name);
			leave;
		end;
	end;
	drop i;
run;

AbuChowdhury · Posted 02-16-2020 08:56 AM

Hi,

Thanks again. Before runninf your latest code, I manually removed company name suffix. But still when I run the code for merging two datasets, I see that observations with multiple names tied together are removed. For example, if the observation for lead_managers_name variable in the "main_dataset" is like Goldman Sach & Co IncCitigroup Global Markets IncBMO Capital Markets in a year, then rank_lead will be average rank of these three companies. But your code removes this kind of observations.

AbuChowdhury · Posted 02-17-2020 07:28 AM

Hi again,

Thanks. After running your code, company suffix is removed but comma is not removed after the name. For example, if the name is like Covey & Company, Inc. then after running your code the result is Covey & Company,. So how to remove , after Covey & Company?

Tom · Posted 02-15-2020 01:28 PM

@AbuChowdhury wrote:

Dear Experts,

I need to merge two datasets using the string variable (the name of the of the variable in the attached file is lead_managers_name) and year. Those names are clearly separated in the second dataset named "rank". Please note in the first dataset (named "main_dataset") that many observations have more than one names joined together. In those cases, the variable rank_lead will be the average of those names in that specific year.

Let me know if you need more information.

Best regards,

Abu

It is not clear what you are having trouble with. Are you have trouble matching the observations between the two dataset? Why? Are the names not spelled the same? How does YEAR enter into the question.

Please show the data in the problem description. Also describe the data. Again how does YEAR enter into this. Does one dataset have one observation per name and the other one observation per name/year combination?

Also show output you are trying to get for you example input. This will help clarify what you are trying to describe. For example cannot take an average of a name so it is not clear what you want for RANK_LEAD. Do you mean you want take the mean value of RANK_LEAD? Over what set of values do you want to take the mean?

AbuChowdhury · Posted 02-15-2020 03:39 PM

Hi Tom,

The names are not spelled similarly in the two datasets. Further, in some observations, there are more names of more companies and these names are tied together. But names of the companies are clear in "rank" dataset.

Year is just the year. "rank" dataset has the observations based on names and rank of the companies on a yearly basis. The names of companies are same every year but the rank may be different in different years.

For an observation in "main_dataset", if there is only one company name, then rank of that company will come from "rank" dataset. However, for an observation in the "main_dataset", if there are many company names, then rank for that observation will be the average rank of those companies in that year from "rank" dataset. Problem is that when there are many company names for an observation in the "main_dataset", those names are tied together. For example, if you go to row 13 of "main_dataset", then you will see the observation for lead_managers_name variable as JP Morgan & Co IncCitigroup Global Markets IncBMO Capital Markets. Actually there are 3 names here: JP Morgan, Citigroup Global Markets Inc and BMO Capital Markets (you will see these names separately by the variable "lead_managers_name in "rank" dataset and their respective ranks by the variable "rank_lead").

Patrick · Posted 02-16-2020 11:38 PM

@AbuChowdhury wrote:

Hi Tom,

The names are not spelled similarly in the two datasets. Further, in some observations, there are more names of more companies and these names are tied together.

You're actually lucky with the "tied together" bit. Looking into your data the names are separated by a control character. Below code is an option how you can split them.

data main_dataset_2(drop=_i);
  set main_dataset;
  if 0 then lead_managers_name_2=lead_managers_name;
  do _i=1 by 1;
    lead_managers_name_2=scan(lead_managers_name,_i,,'c');
    if missing(lead_managers_name_2) then leave;
    output;
  end;  
run;

You will still have to clean up the names (the INC, LTD etc stuff) in both tables before joining.

AbuChowdhury · Posted 02-17-2020 04:59 AM

Hi Patrick,

Thanks a lot. It works.

Merge Two Datasets using String Variable and Year

Re: Merge Two Datasets using String Variable and Year

Re: Merge Two Datasets using String Variable and Year

Re: Merge Two Datasets using String Variable and Year

Re: Merge Two Datasets using String Variable and Year

Re: Merge Two Datasets using String Variable and Year

Re: Merge Two Datasets using String Variable and Year

Re: Merge Two Datasets using String Variable and Year

Re: Merge Two Datasets using String Variable and Year

Re: Merge Two Datasets using String Variable and Year

Re: Merge Two Datasets using String Variable and Year

Re: Merge Two Datasets using String Variable and Year

Re: Merge Two Datasets using String Variable and Year

SAS Innovate 2025: Call for Content

Classroom Training Available!