BookmarkSubscribeRSS Feed
anasmalik13
Fluorite | Level 6

Hello Experts,

I have a dataset comprised of four variables.

           Country Code displays different cities.

           Biography displays detail of several individuals working in those different cities. 

           Names display the names of individuals whose working status is known in next column 'Status.

For example for Country AB, I have Biography of four people (whose Employment Status is Unknown) and Names of other four people (whose Employement Status is Known). I want to find the city-wise matching strings from Names column to Biography column:If names are martched with any of the names presented in Biography column then resulting would be their status.

The output should be as:

               City                     Matched Name                Status

               AB                     Joseph T. Andrew             Lawyer

 

Considering the folowing data table, can anyone guide me how to code it in SAS please?

 

City Code                          Biography                                                         Names                    Status

AB                      Mr. Joseph Andrew is hardworking person.            Blanc Angel                  Teacher

AB                      Ms. Angela Blanc is genius.                                    Albert Crew                   Doctor

AB                      Mrs. Sarah Greene was absent.                             Rubin Ervin                    Engineer

AB                      Mr. Rubin D. Elvin cleans the room.                       Joseph T. Andrew          Lawyer

AC                      Ms. Luis Fernandez is a teacher.                            Luis V. Bob                    Doctor

AC                      Mr. Jose Antonio works from home.                       J. Antonio                      Teacher

AC                      Mr. Luis Van Bob comes late.                                 Luis Fernandez              Lawyer

AD                      Ms. Maria Jose travelled last week.                       Accent Andiron               Worker

AD                      Mr. Philip Bursa will help you.                                Maria Jose Antonio        Student

17 REPLIES 17
jimbarbour
Meteorite | Level 14

For the data that you have posted, what would the results that you want look like?

 

Jim

anasmalik13
Fluorite | Level 6
I want to code the employment status variable. For this, I have details of several persons having unknown employment status and names of some individual with their known employment status. I want to identify the common names and assign their respective employment status. It is like I have details of a person in Bio column, that name can or cannot be found in Names column. However, if it is found
then I can say for country AB, the Joseph T. Andrew would be working as Lawyer.
andreas_lds
Jade | Level 19

Please post the data you have in usable form and exactly what you expect as result - again a data step using datalines is the preferred way, but a table will do also. Then explain for each observation in the expected result why that row was added.

anasmalik13
Fluorite | Level 6

Hello Experts, I would be grateful if anyone can help me solving this issue.

I have the following two datasets.

In DATASET-I, I have two variables, one is City Codes (ED=Edmonton, TO= Toronto and CA= Calgary) and the second is Biography of different persons.

In DATASET-II, I have three variables, one is City Code (same as dataset-I), second is the Names and third is their respective Status.

                            DATASET-I                                                                          DATASET-II

Code              Biography                                                 Code           Names                      Status

ED       Mr. Joseph Andrew is hardworking person.        ED          Blanc Angela                Teacher

ED       Ms. Angela Blanc is genius.                                ED          Albert Crew                  Doctor

ED       Mrs. Sarah Greene was absent.                         ED          Joseph T. Andrew         Lawyer

ED       Mr. Rubin D. Elvin cleans the room.                    TO          Luis V. Bob                   Doctor

TO       Ms. Luis Fernandez is a teacher.                        TO          J. Antonio                     Teacher

TO       Mr. Jose Antonio works from home.                    CA          Accent Andiron             Worker

TO       Mr. Luis Van Bob comes late.                              CA         Maria Jose Antonio       Student

CA       Ms. Maria Jose travelled last week.                                      

CA       Mr. Philip Bursa will help you.                                               

 

What I want?

First. I want to compare the names available inDATASET-II with the names available in Biography of DATASET-I and identify the city-wise names those sound similar or common in both Datasets. The unmatched names will be skipped.

Second, I want to extract those city wise common names and assign their respective status, e.g., the output can be as:

                                                DATAOUTPUT

               Code                     Common Names                              Status

                ED                        Joseph T. Andrew                            Lawyer

                ED                        Angela Blank                                   Teacher

                TO                        Luis Van Bob                                   Doctor

                TO                        Jose Antonio                                   Teacher

                CA                        Maria Jose Antonio                         Studnet

 

Can anyone guide me how to code it in SAS please?

 

jimbarbour
Meteorite | Level 14

It will be more difficult to match -- reliably -- something like "Jose Antonio" with "J Antonio".  However, I have written some code that will do basic text matching.  It still needs a bit of tuning.  If you are a good SAS coder, you can adjust the code as you see fit.  I'll in include the code below.  Otherwise, I'll try to spend some more time with it on Thursday or Friday.

 

Jim

 

Results, immediately below.  Luis Van Bob is incorrect.  That case needs more work.

jimbarbour_0-1603353838558.png

 

Code:

DATA	Biographies;
	INPUT	City_Code	&	:	$2.
			Biography	&	:	$256.
			;
DATALINES;
ED                      Mr. Joseph Andrew is hardworking person.          
ED                      Ms. Angela Blanc is genius.                       
ED                      Mrs. Sarah Greene was absent.                     
ED                      Mr. Rubin D. Elvin cleans the room.               
TO                      Ms. Luis Fernandez is a teacher.                  
TO                      Mr. Jose Antonio works from home.                 
TO                      Mr. Luis Van Bob comes late.                      
CA                      Ms. Maria Jose travelled last week.               
CA                      Mr. Philip Bursa will help you.                   
;
RUN;

DATA	Name_And_Status;
	DROP	_:;
	RETAIN	_Max_Names	0;

	INPUT	City_Cd		&	:	$2.
			Name		&	:	$64.
			Status		&	:	$32.
			;

	_Rec_Cnt								+	1;
	Name_Cnt								=	COUNTW(Name, ' ');
	IF	Name_Cnt							>	_Max_Names	THEN
		_Max_Names							=	Name_Cnt;

	CALL	SYMPUTX('Rec_Cnt',		_Rec_Cnt, 	'G');
	CALL	SYMPUTX('Max_Names',	_Max_Names, 'G');

DATALINES;
ED     Blanc Angela            Teacher
ED     Albert Crew             Doctor
ED     Rubin Elvin             Engineer
ED     Joseph T. Andrew        Lawyer
TO     Luis V. Bob             Doctor
TO     J. Antonio              Teacher
TO     Luis Fernandez          Lawyer
CA     Accent Andiron          Worker
CA     Maria Jose Antonio      Student
;
RUN;

DATA	Matched_Employment_Status;
	DROP	_:;
	DROP	City_Cd;
	DROP	Name;
	DROP	Status;
	DROP	Name_Cnt;

	ARRAY	Names		[&Rec_Cnt, &Max_Names]	$32	_TEMPORARY_;
	ARRAY	Statuses	[&Rec_Cnt]				$32	_TEMPORARY_;
	ARRAY	Codes		[&Rec_Cnt]				$2	_TEMPORARY_;
	ARRAY	Match		[&Rec_Cnt]				3	_TEMPORARY_;

	IF	_N_									=	1	THEN
		DO;
			DO	UNTIL	(_End_Of_Names);
				SET	Name_And_Status		END	=	_End_Of_Names;
				_Name						+	1;
				Statuses[_Name]				=	Status;
				Codes	[_Name]				=	City_Cd;
				DO	_i						=	1	TO	Name_Cnt;
					Names[_Name, _i]		=	COMPRESS(SCAN(Name, _i),,'KA');
*					PUTLOG	"&Nte1  "   _Name=   _i=  Names[_Name, _i]=  Statuses[_Name]=  Codes[_Name]=  ;
				END;
			END;
		END;

*	STOP;

	SET	Biographies;

	PUTLOG	"&Nte2  ";
	
	_Match_Cnt								=	0;
	_Token_Cnt								=	COUNTW(Biography, ' ');
	PUTLOG	"&Nte1  "  _N_=  _Token_Cnt=;

	DO	_i									=	1	TO	_Token_Cnt;
		DO	_j								=	1	TO	&Rec_Cnt;
			DO	_k							=	1	TO	&Max_Names;
				_Bio						=	COMPRESS(SCAN(Biography, _i),,'KA');
				IF	City_Code				=	Codes[_j]				AND
					SCAN(Biography, _i)		=	Names[_j, _k]			THEN
					DO;
						_Match_Cnt			+	1;
						Match[_Match_Cnt]	=	_j;
					END;
				PUTLOG	"&Nte2  "  _N_=  _i=  _j=  _k=  _Bio=  Statuses[_j]=  Names[_j, _k]=  _Match_Cnt=;
			END;
			IF	_Match_Cnt					>=	2						THEN
				DO;
					_Match					=	1;
					DO	_l					=	1	TO	(_Match_Cnt - 1);
						IF	Match[_l]		^=	Match[_l + 1]			THEN	
							_Match			=	1;
					END;
					IF	_Match											THEN
						DO;
							Employ_Status	=	Statuses[_j];
							_j				=	&Rec_Cnt;
							_i				=	_Token_Cnt;
						END;
				END;
		END;
	END;
RUN;
jimbarbour
Meteorite | Level 14

OK, a little sleep, and I spotted the deficiency in my program.  I now have it matching to the correct people:

jimbarbour_0-1603387438586.png

 

As I say, though, when your Dataset II (the master definition of who has what status) has only a first initial, it's going to be difficult to match against.  It will be difficult to distinguish a J. Antonio from a Maria Jose Antonio, although a secondary match on City Code helps tremendously.  There is always the danger of a false positive, i.e. associating someone with a Status that is not really theirs.

 

I set my matching to require two matches.  For example, for the name "Luis Van Bob," you could have up to three matches ("Luis", "Van", and "Bob").  There has to be a minimum of two matches for the program to consider the match to be a positive name match.  When trying to match, for example, "Luis Fernandez" with "Luis Van Bob", the match will be = 1 because only "Luis" matches.  The program does not consider this to be a name match.  However "Luis V Bob" and "Luis Van Bob" would be = 2 ("Luis" and "Bob"), and therefore the program would consider this a name match. 

 

You could still get false positives, but city matching helps much in this regard.  For example, Jose Antonio and Maria Jose Antonio could easily be falsely matched, but since Jose Antonio is from Toronto and Maria Jose Antonio is from Calgary, they will not be matched.

 

WARNING:  Even matching on both city and name may still give some false positives.  For really common names, there absolutely could be two people with the same first and last names in a large city.  Name matching is never 100%.

 

We can increase the number of matches if we use something like the COMPGED function or the SPEDIS function, but this also increases our chances of false positives.  See https://support.sas.com/resources/papers/proceedings14/1674-2014.pdf

 

The SAS code is below.

 

Jim

 

DATA	Biographies;
	INPUT	City_Code	&	:	$2.
			Biography	&	:	$256.
			;
DATALINES;
ED   Mr. Joseph Andrew is hardworking person.          
ED   Ms. Angela Blanc is genius.                       
ED   Mrs. Sarah Greene was absent.                     
ED   Mr. Rubin D. Elvin cleans the room.               
TO   Ms. Luis Fernandez is a teacher.                  
TO   Mr. Jose Antonio works from home.                 
TO   Mr. Luis Van Bob comes late.                      
CA   Ms. Maria Jose travelled last week.               
CA   Mr. Philip Bursa will help you.                   
;
RUN;

**------------------------------------------------------------------------------**;

DATA	Name_And_Status;
	DROP	_:;
	RETAIN	_Max_Names	0;

	INPUT	City_Cd		&	:	$2.
			Name		&	:	$64.
			Status		&	:	$32.
			;

	_Rec_Cnt								+	1;
	Name_Cnt								=	COUNTW(Name, ' ');
	IF	Name_Cnt							>	_Max_Names	THEN
		_Max_Names							=	Name_Cnt;

	CALL	SYMPUTX('Rec_Cnt',		_Rec_Cnt, 	'G');
	CALL	SYMPUTX('Max_Names',	_Max_Names, 'G');

DATALINES;
ED     Blanc Angela            Teacher
ED     Albert Crew             Doctor
ED     Rubin Elvin             Engineer
ED     Joseph T. Andrew        Lawyer
TO     Luis V. Bob             Doctor
TO     J. Antonio              Teacher
TO     Luis Fernandez          Lawyer
CA     Accent Andiron          Worker
CA     Maria Jose Antonio      Student
;
RUN;

**------------------------------------------------------------------------------**;

DATA	Matched_Employment_Status;
	DROP	_:;
	DROP	City_Cd;
	DROP	Name;
	DROP	Status;
	DROP	Name_Cnt;

	ARRAY	Names		[&Rec_Cnt, &Max_Names]	$32	_TEMPORARY_;
	ARRAY	Statuses	[&Rec_Cnt]				$32	_TEMPORARY_;
	ARRAY	Codes		[&Rec_Cnt]				$2	_TEMPORARY_;
	ARRAY	Match		[&Rec_Cnt]				3	_TEMPORARY_;

	IF	_N_									=	1	THEN
		DO;
			DO	UNTIL	(_End_Of_Names);
				SET	Name_And_Status		END	=	_End_Of_Names;
				_j							+	1;
				Statuses[_j]				=	Status;
				Codes	[_j]				=	City_Cd;
				DO	_k						=	1	TO	Name_Cnt;
					Names[_j, _k]			=	COMPRESS(SCAN(Name, _k),,'KA');
				END;
			END;
		END;

	SET	Biographies;

	CALL	MISSING	(of	Match[*]);
	_Token_Cnt								=	COUNTW(Biography, ' ');

	DO	_i									=	1	TO	_Token_Cnt;
		DO	_j								=	1	TO	&Rec_Cnt;
			DO	_k							=	1	TO	&Max_Names;
				_Bio						=	COMPRESS(SCAN(Biography, _i),,'KA');
				IF	City_Code				=	Codes[_j]				AND
					SCAN(Biography, _i)		=	Names[_j, _k]			THEN
					DO;
						Match[_j]			+	1;
					END;
			END;
		END;
	END;

	_Max_Match								=	0;

	DO	_j									=	1	TO	&Rec_Cnt;
		IF	Match[_j]						>=	2						THEN
			DO;
				IF	Match[_j]				>	_Max_Match				THEN
					DO;
						Employ_Status		=	Statuses[_j];
						_Max_Match			=	Match[_j];
					END;
			END;
	END;
RUN;

 

anasmalik13
Fluorite | Level 6

Thank you for guiding. The code works for the variables you have created but unfortunately I am stuck in creation of temporary arrays for the solution of my dataset. I find the following message in the log.

ARRAY Names [&Rec_Cnt, &Max_Names] $32 _TEMPORARY_;
                           -
                                 22
                                 76
ERROR 22-322: Syntax error, expecting one of the following: an integer constant, *.

ERROR 76-322: Syntax error, statement will be ignored.

jimbarbour
Meteorite | Level 14

@anasmalik13,

 

Please run the following code;

%PUT  NOTE:  &=Rec_Cnt  &=Max_Names;

Before the data step with this code runs:

ARRAY Names [&Rec_Cnt, &Max_Names] $32 _TEMPORARY_;

 

Rec_Cnt and Max_Names need to be set to positive integers for the arrays to work.  It looks like Rec_Cnt is not being set.

 

Please post the log, and please use the following buttons here on the forum for posting the log:

jimbarbour_0-1603487315957.png

 

Jim

anasmalik13
Fluorite | Level 6

 

 

Thank you again for your prompt support. I am using SAS 9.4 but unfortunately, the code is still not working on it. It looks like I am commmitting any mistake but whta is that I dont know. Please see the following log as reference.
1    DATA    Matched_Employment_Status;
2       DROP    _:;
3       DROP    City_Cd;
4       DROP    Name;
5       DROP    Status;
6       DROP    Name_Cnt;
WARNING: Apparent symbolic reference REC_CNT not resolved.
WARNING: Apparent symbolic reference MAX_NAMES not resolved.
7   %PUT  NOTE:  &=Rec_Cnt  &=Max_Names;
NOTE:  Rec_Cnt  Max_Names
WARNING: Apparent symbolic reference REC_CNT not resolved.
WARNING: Apparent symbolic reference MAX_NAMES not resolved.
8   ARRAY   Names       [&Rec_Cnt, &Max_Names]  $32 _TEMPORARY_;
                          -
                          22
                          76
ERROR 22-322: Syntax error, expecting one of the following: an integer constant, *.

ERROR 76-322: Syntax error, statement will be ignored.

 

 

jimbarbour
Meteorite | Level 14

@anasmalik13,

 

You have to create those two macro variables when you read in the biographical information.  They are created in this Data step:

DATA	Name_And_Status;
	DROP	_:;
	RETAIN	_Max_Names	0;

	INPUT	City_Cd		&	:	$2.
			Name		&	:	$64.
			Status		&	:	$32.
			;

	_Rec_Cnt								+	1;
	Name_Cnt								=	COUNTW(Name, ' ');
	IF	Name_Cnt							>	_Max_Names	THEN
		_Max_Names							=	Name_Cnt;

	CALL	SYMPUTX('Rec_Cnt',		_Rec_Cnt, 	'G');
	CALL	SYMPUTX('Max_Names',	_Max_Names, 'G');

DATALINES;
ED     Blanc Angela            Teacher
ED     Albert Crew             Doctor
ED     Rubin Elvin             Engineer
ED     Joseph T. Andrew        Lawyer
TO     Luis V. Bob             Doctor
TO     J. Antonio              Teacher
TO     Luis Fernandez          Lawyer
CA     Accent Andiron          Worker
CA     Maria Jose Antonio      Student
;
RUN;

Here are the two key statements:

	CALL	SYMPUTX('Rec_Cnt',		_Rec_Cnt, 	'G');
	CALL	SYMPUTX('Max_Names',	_Max_Names, 'G');

Without those two statements, the macro variables will not be created, and the code will not run in the main Data step.  The arrays use the values stored in Rec_Cnt and Max_Names to dimension the arrays, and the subsequent SAS code uses them also to control the matching.  Without those two macro variables, the code will never work.

ARRAY   Names       [&Rec_Cnt, &Max_Names]  $32 _TEMPORARY_;

 

Jim

 

anasmalik13
Fluorite | Level 6

Thank you for your support. I run the code and it did not show any error in the log nor it generated any result. It would be two hours that it is still processing it. I have two datasets with 1.6 million and 100k observations each. Is not any optimize solution?

jimbarbour
Meteorite | Level 14

So your name and status dataset has 100K records?  And you have 16 million biographies?  How many words are in each biography?

 

These are very large files for this type of matching.  I would want to break the files up otherwise it will take forever to run.

 

I would probably separate all files into individual files by city code.  You don't want to match someone in Edmonton to someone in Vancouver.  They're not one and the same.  After you separate the files into files for each city, run a separate job for each major city.  Small cities could be combined.  Run a small test file first perhaps, but then run all of the jobs simultaneously (i.e. in parallel).  Parallel processing is probably the only way to speed something of this magnitude up.

 

Again though, be careful.  You are probably going to get multiple people with the exact same name in large cities.  Matching by name is never 100%.  It's always best to match on a unique ID number whenever possible.  You probably already know this, but it bears repeating.  Be careful and expect false matches.

 

Jim

 

 

 

anasmalik13
Fluorite | Level 6

Hey Jim, thank you for being so supportive. I have splitted the datasets into chunks of maximum 2-5k observations each, still SAS takes minimum 15 mins to process these observations. I have tried this code on three different datasets but unfortunately the required results are not generated. For example, I get two columns from the biography dataset (city code & biography) and the third matching column (employement status) which is found empty, as there is no match.

If you can try to look the issues, should I share my data? 

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 17 replies
  • 2084 views
  • 11 likes
  • 3 in conversation