Re: create two different datasets based on the original dataset

France · Posted 12-04-2018 03:08 PM

dear all,

how can I create dataset A and dataset B based on the original dataset (e.g., dataset C)?

for example,

for the original dataset (dataset C)

Company name	Country	Matched BvD ID	Matched company name
02 MICRO
02 MICRO	TW
02 MICRO	US
1...		GB04165791	BH (CITY FORUM) LIMITED (Previous name: 1)
1...	GB	GB04165791	BH (CITY FORUM) LIMITED (Previous name: 1)
1...	US
21(TWO-ONE) COMPANY
21(TWO-ONE) COMPANY	JP
21(TWO-ONE) COMPANY	US
3-D MATRIX		JP4010001087940	3-D MATRIX,LTD.
3-D MATRIX	JP	JP4010001087940	3-D MATRIX,LTD.
3-D MATRIX	KR
3-D MATRIX	US	US138675448L	MATRIX 3D LLC

I would like to have the dataset A like

Company name	Country	Matched BvD ID	Matched company name
1...		GB04165791	BH (CITY FORUM) LIMITED (Previous name: 1)
1...	GB	GB04165791	BH (CITY FORUM) LIMITED (Previous name: 1)
1...	US

in the dataset A each group of Company_name has only one distinct Matched_company_name (which is BH (CITY FORUM) LIMITED (Previous name: 1)).

I would like to also create the dataset B like,

Company name	Country	Matched BvD ID	Matched company name
3-D MATRIX		JP4010001087940	3-D MATRIX,LTD.
3-D MATRIX	JP	JP4010001087940	3-D MATRIX,LTD.
3-D MATRIX	KR
3-D MATRIX	US	US138675448L	MATRIX 3D LLC

in dataset B, each group of Company_name has at least two distinct Matched_company_name (which are 3-D MATRIX,LTD. and MATRIX 3D LLC).

I would like to exclude observations which Company_name are '02 MICRO' and '21(TWO-ONE) COMPANY' as none of them have Matched_company_name variables.

could you please give me some suggestion about this?

PeterClemmensen · Posted 12-04-2018 03:47 PM

How do you determine a ‘match’? How similar should the strings be?

The DATA to DATA Step Macro
Blog: SASnrd

mkeintz · Posted 12-04-2018 04:08 PM

As I see it, you plan to ignore any company for which the matchname is always blank. But otherwise blank matchname records are output to a dataset depending on the number of unique (non-blank) matchnames, right? If so:

data want1 want2;
  do until (last.company_name);
    set have;
    by company_name matchname notsorted;
    if last.matchname and matchname^=' ' then nmatches=sum(nmatches,1);
  end;
  do until (last.company_name);
    set have;
    by company_name ;
    if nmatches=1 then output want1; else
    if nmatches>1 then output want2;
  end;
  drop nmatches;
run;

Notes:

This assumes your dataset is sorted by company_name.
Within each company_name group, the data are sub-grouped (but not necessarily in sorted order) by matchname.
It also assumes that there is no blank matchname in the middle of a non-blank matchname group. I.e. it doesn't synthetically generate more matchname groups that actually exist.
Again, if matchname is always blank, then there is no output, per your example.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

France · Posted 12-05-2018 08:36 AM

Dear mkeintz,

thank you for your suggestion.

thanks for your description, that is what I need. however, some company_name variables which recorded with unique (non-blank) Matched_company_name variable are also included in the dataset 'want2'.

I add a sample in the attachment (include 1000 observations) would you like to check?

thanks in advance.

mkeintz · Posted 12-05-2018 10:00 AM

I think you're the right person to check with your new sample. See if the program produces what you intend.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ChrisNZ · Posted 12-04-2018 04:42 PM

Creating new tables is seldom needed and even less often a good idea.

In your case, you can probably do a BY processing

proc XXX;

by Company_Name;

where Matched_Company_Name ne ' ';

run;

Would that work for you?

Why do some matched records have no matched value?

High-Performance SAS Coding - Third Edition

mkeintz · Posted 12-04-2018 04:59 PM

@ChrisNZ

I think the OP wanted to distinguish companies with more than one non-blank matchname value. So a simple where statement would not likely capture it. Companies with nothing but blanks seem to ignored in the required sample output, but otherwise blank records go to the same destination as the non-blank records.

I suspect the OP has his/her own data set that is (fuzzy?) matched by name against company data from Bureau van Dijk (the BVD_ID column). Sometimes this yields multiple possibilities, and there likely needs to be a good deal of further "disambiguation", or some sort of data consolidation.

One of the problems with data from BvD, as I recall, was that (unlike many other vendors of corporate data) it did not provide tracking from year to year when there was a spin off or merger. So a user desiring a longer data history would have to try some sort of other ways (including historical name matching) to properly link different data "vintages". It's not a historic research friendly database.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

create two different datasets based on the original dataset