How to compare and Identify the similar characters or words such as Na...

anasmalik13 · Posted 10-21-2020 10:23 PM

Hello Experts,

I have a dataset comprised of four variables.

Country Code displays different cities.

Biography displays detail of several individuals working in those different cities.

Names display the names of individuals whose working status is known in next column 'Status.

For example for Country AB, I have Biography of four people (whose Employment Status is Unknown) and Names of other four people (whose Employement Status is Known). I want to find the city-wise matching strings from Names column to Biography column:If names are martched with any of the names presented in Biography column then resulting would be their status.

The output should be as:

City Matched Name Status

AB Joseph T. Andrew Lawyer

Considering the folowing data table, can anyone guide me how to code it in SAS please?

City Code Biography Names Status

AB Mr. Joseph Andrew is hardworking person. Blanc Angel Teacher

AB Ms. Angela Blanc is genius. Albert Crew Doctor

AB Mrs. Sarah Greene was absent. Rubin Ervin Engineer

AB Mr. Rubin D. Elvin cleans the room. Joseph T. Andrew Lawyer

AC Ms. Luis Fernandez is a teacher. Luis V. Bob Doctor

AC Mr. Jose Antonio works from home. J. Antonio Teacher

AC Mr. Luis Van Bob comes late. Luis Fernandez Lawyer

AD Ms. Maria Jose travelled last week. Accent Andiron Worker

AD Mr. Philip Bursa will help you. Maria Jose Antonio Student

jimbarbour · Posted 10-21-2020 10:50 PM

For the data that you have posted, what would the results that you want look like?

Jim

anasmalik13 · Posted 10-21-2020 11:11 PM

I want to code the employment status variable. For this, I have details of several persons having unknown employment status and names of some individual with their known employment status. I want to identify the common names and assign their respective employment status. It is like I have details of a person in Bio column, that name can or cannot be found in Names column. However, if it is found
then I can say for country AB, the Joseph T. Andrew would be working as Lawyer.

andreas_lds · Posted 10-22-2020 12:51 AM

Please post the data you have in usable form and exactly what you expect as result - again a data step using datalines is the preferred way, but a table will do also. Then explain for each observation in the expected result why that row was added.

anasmalik13 · Posted 10-22-2020 03:16 AM

Hello Experts, I would be grateful if anyone can help me solving this issue.

I have the following two datasets.

In DATASET-I, I have two variables, one is City Codes (ED=Edmonton, TO= Toronto and CA= Calgary) and the second is Biography of different persons.

In DATASET-II, I have three variables, one is City Code (same as dataset-I), second is the Names and third is their respective Status.

DATASET-I DATASET-II

Code Biography Code Names Status

ED Mr. Joseph Andrew is hardworking person. ED Blanc Angela Teacher

ED Ms. Angela Blanc is genius. ED Albert Crew Doctor

ED Mrs. Sarah Greene was absent. ED Joseph T. Andrew Lawyer

ED Mr. Rubin D. Elvin cleans the room. TO Luis V. Bob Doctor

TO Ms. Luis Fernandez is a teacher. TO J. Antonio Teacher

TO Mr. Jose Antonio works from home. CA Accent Andiron Worker

TO Mr. Luis Van Bob comes late. CA Maria Jose Antonio Student

CA Ms. Maria Jose travelled last week.

CA Mr. Philip Bursa will help you.

What I want?

First. I want to compare the names available inDATASET-II with the names available in Biography of DATASET-I and identify the city-wise names those sound similar or common in both Datasets. The unmatched names will be skipped.

Second, I want to extract those city wise common names and assign their respective status, e.g., the output can be as:

DATAOUTPUT

Code Common Names Status

ED Joseph T. Andrew Lawyer

ED Angela Blank Teacher

TO Luis Van Bob Doctor

TO Jose Antonio Teacher

CA Maria Jose Antonio Studnet

Can anyone guide me how to code it in SAS please?

andreas_lds · Posted 10-22-2020 03:27 AM

Please don't double post questions!

jimbarbour · Posted 10-22-2020 04:05 AM

It will be more difficult to match -- reliably -- something like "Jose Antonio" with "J Antonio". However, I have written some code that will do basic text matching. It still needs a bit of tuning. If you are a good SAS coder, you can adjust the code as you see fit. I'll in include the code below. Otherwise, I'll try to spend some more time with it on Thursday or Friday.

Jim

Results, immediately below. Luis Van Bob is incorrect. That case needs more work.

Code:

DATA	Biographies;
	INPUT	City_Code	&	:	$2.
			Biography	&	:	$256.
			;
DATALINES;
ED                      Mr. Joseph Andrew is hardworking person.          
ED                      Ms. Angela Blanc is genius.                       
ED                      Mrs. Sarah Greene was absent.                     
ED                      Mr. Rubin D. Elvin cleans the room.               
TO                      Ms. Luis Fernandez is a teacher.                  
TO                      Mr. Jose Antonio works from home.                 
TO                      Mr. Luis Van Bob comes late.                      
CA                      Ms. Maria Jose travelled last week.               
CA                      Mr. Philip Bursa will help you.                   
;
RUN;

DATA	Name_And_Status;
	DROP	_:;
	RETAIN	_Max_Names	0;

	INPUT	City_Cd		&	:	$2.
			Name		&	:	$64.
			Status		&	:	$32.
			;

	_Rec_Cnt								+	1;
	Name_Cnt								=	COUNTW(Name, ' ');
	IF	Name_Cnt							>	_Max_Names	THEN
		_Max_Names							=	Name_Cnt;

	CALL	SYMPUTX('Rec_Cnt',		_Rec_Cnt, 	'G');
	CALL	SYMPUTX('Max_Names',	_Max_Names, 'G');

DATALINES;
ED     Blanc Angela            Teacher
ED     Albert Crew             Doctor
ED     Rubin Elvin             Engineer
ED     Joseph T. Andrew        Lawyer
TO     Luis V. Bob             Doctor
TO     J. Antonio              Teacher
TO     Luis Fernandez          Lawyer
CA     Accent Andiron          Worker
CA     Maria Jose Antonio      Student
;
RUN;

DATA	Matched_Employment_Status;
	DROP	_:;
	DROP	City_Cd;
	DROP	Name;
	DROP	Status;
	DROP	Name_Cnt;

	ARRAY	Names		[&Rec_Cnt, &Max_Names]	$32	_TEMPORARY_;
	ARRAY	Statuses	[&Rec_Cnt]				$32	_TEMPORARY_;
	ARRAY	Codes		[&Rec_Cnt]				$2	_TEMPORARY_;
	ARRAY	Match		[&Rec_Cnt]				3	_TEMPORARY_;

	IF	_N_									=	1	THEN
		DO;
			DO	UNTIL	(_End_Of_Names);
				SET	Name_And_Status		END	=	_End_Of_Names;
				_Name						+	1;
				Statuses[_Name]				=	Status;
				Codes	[_Name]				=	City_Cd;
				DO	_i						=	1	TO	Name_Cnt;
					Names[_Name, _i]		=	COMPRESS(SCAN(Name, _i),,'KA');
*					PUTLOG	"&Nte1  "   _Name=   _i=  Names[_Name, _i]=  Statuses[_Name]=  Codes[_Name]=  ;
				END;
			END;
		END;

*	STOP;

	SET	Biographies;

	PUTLOG	"&Nte2  ";
	
	_Match_Cnt								=	0;
	_Token_Cnt								=	COUNTW(Biography, ' ');
	PUTLOG	"&Nte1  "  _N_=  _Token_Cnt=;

	DO	_i									=	1	TO	_Token_Cnt;
		DO	_j								=	1	TO	&Rec_Cnt;
			DO	_k							=	1	TO	&Max_Names;
				_Bio						=	COMPRESS(SCAN(Biography, _i),,'KA');
				IF	City_Code				=	Codes[_j]				AND
					SCAN(Biography, _i)		=	Names[_j, _k]			THEN
					DO;
						_Match_Cnt			+	1;
						Match[_Match_Cnt]	=	_j;
					END;
				PUTLOG	"&Nte2  "  _N_=  _i=  _j=  _k=  _Bio=  Statuses[_j]=  Names[_j, _k]=  _Match_Cnt=;
			END;
			IF	_Match_Cnt					>=	2						THEN
				DO;
					_Match					=	1;
					DO	_l					=	1	TO	(_Match_Cnt - 1);
						IF	Match[_l]		^=	Match[_l + 1]			THEN	
							_Match			=	1;
					END;
					IF	_Match											THEN
						DO;
							Employ_Status	=	Statuses[_j];
							_j				=	&Rec_Cnt;
							_i				=	_Token_Cnt;
						END;
				END;
		END;
	END;
RUN;

jimbarbour · Posted 10-22-2020 01:35 PM

OK, a little sleep, and I spotted the deficiency in my program. I now have it matching to the correct people:

As I say, though, when your Dataset II (the master definition of who has what status) has only a first initial, it's going to be difficult to match against. It will be difficult to distinguish a J. Antonio from a Maria Jose Antonio, although a secondary match on City Code helps tremendously. There is always the danger of a false positive, i.e. associating someone with a Status that is not really theirs.

I set my matching to require two matches. For example, for the name "Luis Van Bob," you could have up to three matches ("Luis", "Van", and "Bob"). There has to be a minimum of two matches for the program to consider the match to be a positive name match. When trying to match, for example, "Luis Fernandez" with "Luis Van Bob", the match will be = 1 because only "Luis" matches. The program does not consider this to be a name match. However "Luis V Bob" and "Luis Van Bob" would be = 2 ("Luis" and "Bob"), and therefore the program would consider this a name match.

You could still get false positives, but city matching helps much in this regard. For example, Jose Antonio and Maria Jose Antonio could easily be falsely matched, but since Jose Antonio is from Toronto and Maria Jose Antonio is from Calgary, they will not be matched.

WARNING: Even matching on both city and name may still give some false positives. For really common names, there absolutely could be two people with the same first and last names in a large city. Name matching is never 100%.

We can increase the number of matches if we use something like the COMPGED function or the SPEDIS function, but this also increases our chances of false positives. See https://support.sas.com/resources/papers/proceedings14/1674-2014.pdf

The SAS code is below.

Jim

DATA	Biographies;
	INPUT	City_Code	&	:	$2.
			Biography	&	:	$256.
			;
DATALINES;
ED   Mr. Joseph Andrew is hardworking person.          
ED   Ms. Angela Blanc is genius.                       
ED   Mrs. Sarah Greene was absent.                     
ED   Mr. Rubin D. Elvin cleans the room.               
TO   Ms. Luis Fernandez is a teacher.                  
TO   Mr. Jose Antonio works from home.                 
TO   Mr. Luis Van Bob comes late.                      
CA   Ms. Maria Jose travelled last week.               
CA   Mr. Philip Bursa will help you.                   
;
RUN;

**------------------------------------------------------------------------------**;

DATA	Name_And_Status;
	DROP	_:;
	RETAIN	_Max_Names	0;

	INPUT	City_Cd		&	:	$2.
			Name		&	:	$64.
			Status		&	:	$32.
			;

	_Rec_Cnt								+	1;
	Name_Cnt								=	COUNTW(Name, ' ');
	IF	Name_Cnt							>	_Max_Names	THEN
		_Max_Names							=	Name_Cnt;

	CALL	SYMPUTX('Rec_Cnt',		_Rec_Cnt, 	'G');
	CALL	SYMPUTX('Max_Names',	_Max_Names, 'G');

DATALINES;
ED     Blanc Angela            Teacher
ED     Albert Crew             Doctor
ED     Rubin Elvin             Engineer
ED     Joseph T. Andrew        Lawyer
TO     Luis V. Bob             Doctor
TO     J. Antonio              Teacher
TO     Luis Fernandez          Lawyer
CA     Accent Andiron          Worker
CA     Maria Jose Antonio      Student
;
RUN;

**------------------------------------------------------------------------------**;

DATA	Matched_Employment_Status;
	DROP	_:;
	DROP	City_Cd;
	DROP	Name;
	DROP	Status;
	DROP	Name_Cnt;

	ARRAY	Names		[&Rec_Cnt, &Max_Names]	$32	_TEMPORARY_;
	ARRAY	Statuses	[&Rec_Cnt]				$32	_TEMPORARY_;
	ARRAY	Codes		[&Rec_Cnt]				$2	_TEMPORARY_;
	ARRAY	Match		[&Rec_Cnt]				3	_TEMPORARY_;

	IF	_N_									=	1	THEN
		DO;
			DO	UNTIL	(_End_Of_Names);
				SET	Name_And_Status		END	=	_End_Of_Names;
				_j							+	1;
				Statuses[_j]				=	Status;
				Codes	[_j]				=	City_Cd;
				DO	_k						=	1	TO	Name_Cnt;
					Names[_j, _k]			=	COMPRESS(SCAN(Name, _k),,'KA');
				END;
			END;
		END;

	SET	Biographies;

	CALL	MISSING	(of	Match[*]);
	_Token_Cnt								=	COUNTW(Biography, ' ');

	DO	_i									=	1	TO	_Token_Cnt;
		DO	_j								=	1	TO	&Rec_Cnt;
			DO	_k							=	1	TO	&Max_Names;
				_Bio						=	COMPRESS(SCAN(Biography, _i),,'KA');
				IF	City_Code				=	Codes[_j]				AND
					SCAN(Biography, _i)		=	Names[_j, _k]			THEN
					DO;
						Match[_j]			+	1;
					END;
			END;
		END;
	END;

	_Max_Match								=	0;

	DO	_j									=	1	TO	&Rec_Cnt;
		IF	Match[_j]						>=	2						THEN
			DO;
				IF	Match[_j]				>	_Max_Match				THEN
					DO;
						Employ_Status		=	Statuses[_j];
						_Max_Match			=	Match[_j];
					END;
			END;
	END;
RUN;

anasmalik13 · Posted 10-23-2020 02:02 AM

Thank you for guiding. The code works for the variables you have created but unfortunately I am stuck in creation of temporary arrays for the solution of my dataset. I find the following message in the log.

ARRAY Names [&Rec_Cnt, &Max_Names] $32 _TEMPORARY_;
-
22
76
ERROR 22-322: Syntax error, expecting one of the following: an integer constant, *.

ERROR 76-322: Syntax error, statement will be ignored.

jimbarbour · Posted 10-23-2020 05:04 PM

@anasmalik13,

Please run the following code;

%PUT  NOTE:  &=Rec_Cnt  &=Max_Names;

Before the data step with this code runs:

ARRAY Names [&Rec_Cnt, &Max_Names] $32 _TEMPORARY_;

Rec_Cnt and Max_Names need to be set to positive integers for the arrays to work. It looks like Rec_Cnt is not being set.

Please post the log, and please use the following buttons here on the forum for posting the log:

Jim

anasmalik13 · Posted 10-23-2020 05:18 PM

Thank you again for your prompt support. I am using SAS 9.4 but unfortunately, the code is still not working on it. It looks like I am commmitting any mistake but whta is that I dont know. Please see the following log as reference.
1    DATA    Matched_Employment_Status;
2       DROP    _:;
3       DROP    City_Cd;
4       DROP    Name;
5       DROP    Status;
6       DROP    Name_Cnt;
WARNING: Apparent symbolic reference REC_CNT not resolved.
WARNING: Apparent symbolic reference MAX_NAMES not resolved.
7   %PUT  NOTE:  &=Rec_Cnt  &=Max_Names;
NOTE:  Rec_Cnt  Max_Names
WARNING: Apparent symbolic reference REC_CNT not resolved.
WARNING: Apparent symbolic reference MAX_NAMES not resolved.
8   ARRAY   Names       [&Rec_Cnt, &Max_Names]  $32 _TEMPORARY_;
                          -
                          22
                          76
ERROR 22-322: Syntax error, expecting one of the following: an integer constant, *.

ERROR 76-322: Syntax error, statement will be ignored.

jimbarbour · Posted 10-23-2020 05:27 PM

@anasmalik13,

You have to create those two macro variables when you read in the biographical information. They are created in this Data step:

DATA	Name_And_Status;
	DROP	_:;
	RETAIN	_Max_Names	0;

	INPUT	City_Cd		&	:	$2.
			Name		&	:	$64.
			Status		&	:	$32.
			;

	_Rec_Cnt								+	1;
	Name_Cnt								=	COUNTW(Name, ' ');
	IF	Name_Cnt							>	_Max_Names	THEN
		_Max_Names							=	Name_Cnt;

	CALL	SYMPUTX('Rec_Cnt',		_Rec_Cnt, 	'G');
	CALL	SYMPUTX('Max_Names',	_Max_Names, 'G');

DATALINES;
ED     Blanc Angela            Teacher
ED     Albert Crew             Doctor
ED     Rubin Elvin             Engineer
ED     Joseph T. Andrew        Lawyer
TO     Luis V. Bob             Doctor
TO     J. Antonio              Teacher
TO     Luis Fernandez          Lawyer
CA     Accent Andiron          Worker
CA     Maria Jose Antonio      Student
;
RUN;

Here are the two key statements:

	CALL	SYMPUTX('Rec_Cnt',		_Rec_Cnt, 	'G');
	CALL	SYMPUTX('Max_Names',	_Max_Names, 'G');

Without those two statements, the macro variables will not be created, and the code will not run in the main Data step. The arrays use the values stored in Rec_Cnt and Max_Names to dimension the arrays, and the subsequent SAS code uses them also to control the matching. Without those two macro variables, the code will never work.

ARRAY   Names       [&Rec_Cnt, &Max_Names]  $32 _TEMPORARY_;

Jim

anasmalik13 · Posted 10-23-2020 11:43 PM

Thank you for your support. I run the code and it did not show any error in the log nor it generated any result. It would be two hours that it is still processing it. I have two datasets with 1.6 million and 100k observations each. Is not any optimize solution?

jimbarbour · Posted 10-24-2020 12:11 AM

So your name and status dataset has 100K records? And you have 16 million biographies? How many words are in each biography?

These are very large files for this type of matching. I would want to break the files up otherwise it will take forever to run.

I would probably separate all files into individual files by city code. You don't want to match someone in Edmonton to someone in Vancouver. They're not one and the same. After you separate the files into files for each city, run a separate job for each major city. Small cities could be combined. Run a small test file first perhaps, but then run all of the jobs simultaneously (i.e. in parallel). Parallel processing is probably the only way to speed something of this magnitude up.

Again though, be careful. You are probably going to get multiple people with the exact same name in large cities. Matching by name is never 100%. It's always best to match on a unique ID number whenever possible. You probably already know this, but it bears repeating. Be careful and expect false matches.

Jim

anasmalik13 · Posted 10-25-2020 03:41 AM

Hey Jim, thank you for being so supportive. I have splitted the datasets into chunks of maximum 2-5k observations each, still SAS takes minimum 15 mins to process these observations. I have tried this code on three different datasets but unfortunately the required results are not generated. For example, I get two columns from the biography dataset (city code & biography) and the third matching column (employement status) which is found empty, as there is no match.

If you can try to look the issues, should I share my data?

How to compare and Identify the similar characters or words such as Names?

Re: How to compare and Identify the similar characters or words such as Names?

Re: How to compare and Identify the similar characters or words such as Names?

Re: How to compare and Identify the similar characters or words such as Names?

How to compare and identify similar strings (i.e., Names) from two datasets?

Re: How to compare and identify similar strings (i.e., Names) from two datasets?

Re: How to compare and identify similar strings (i.e., Names) from two datasets?

Re: How to compare and identify similar strings (i.e., Names) from two datasets?

Re: How to compare and identify similar strings (i.e., Names) from two datasets?

Re: How to compare and identify similar strings (i.e., Names) from two datasets?

Re: How to compare and identify similar strings (i.e., Names) from two datasets?

Re: How to compare and identify similar strings (i.e., Names) from two datasets?

Re: How to compare and identify similar strings (i.e., Names) from two datasets?

Re: How to compare and identify similar strings (i.e., Names) from two datasets?

Re: How to compare and identify similar strings (i.e., Names) from two datasets?

Registration is open