Hello Experts,
I have a dataset comprised of four variables.
Country Code displays different cities.
Biography displays detail of several individuals working in those different cities.
Names display the names of individuals whose working status is known in next column 'Status.
For example for Country AB, I have Biography of four people (whose Employment Status is Unknown) and Names of other four people (whose Employement Status is Known). I want to find the city-wise matching strings from Names column to Biography column:If names are martched with any of the names presented in Biography column then resulting would be their status.
The output should be as:
City Matched Name Status
AB Joseph T. Andrew Lawyer
Considering the folowing data table, can anyone guide me how to code it in SAS please?
City Code Biography Names Status
AB Mr. Joseph Andrew is hardworking person. Blanc Angel Teacher
AB Ms. Angela Blanc is genius. Albert Crew Doctor
AB Mrs. Sarah Greene was absent. Rubin Ervin Engineer
AB Mr. Rubin D. Elvin cleans the room. Joseph T. Andrew Lawyer
AC Ms. Luis Fernandez is a teacher. Luis V. Bob Doctor
AC Mr. Jose Antonio works from home. J. Antonio Teacher
AC Mr. Luis Van Bob comes late. Luis Fernandez Lawyer
AD Ms. Maria Jose travelled last week. Accent Andiron Worker
AD Mr. Philip Bursa will help you. Maria Jose Antonio Student
For the data that you have posted, what would the results that you want look like?
Jim
Please post the data you have in usable form and exactly what you expect as result - again a data step using datalines is the preferred way, but a table will do also. Then explain for each observation in the expected result why that row was added.
Hello Experts, I would be grateful if anyone can help me solving this issue.
I have the following two datasets.
In DATASET-I, I have two variables, one is City Codes (ED=Edmonton, TO= Toronto and CA= Calgary) and the second is Biography of different persons.
In DATASET-II, I have three variables, one is City Code (same as dataset-I), second is the Names and third is their respective Status.
DATASET-I DATASET-II
Code Biography Code Names Status
ED Mr. Joseph Andrew is hardworking person. ED Blanc Angela Teacher
ED Ms. Angela Blanc is genius. ED Albert Crew Doctor
ED Mrs. Sarah Greene was absent. ED Joseph T. Andrew Lawyer
ED Mr. Rubin D. Elvin cleans the room. TO Luis V. Bob Doctor
TO Ms. Luis Fernandez is a teacher. TO J. Antonio Teacher
TO Mr. Jose Antonio works from home. CA Accent Andiron Worker
TO Mr. Luis Van Bob comes late. CA Maria Jose Antonio Student
CA Ms. Maria Jose travelled last week.
CA Mr. Philip Bursa will help you.
What I want?
First. I want to compare the names available inDATASET-II with the names available in Biography of DATASET-I and identify the city-wise names those sound similar or common in both Datasets. The unmatched names will be skipped.
Second, I want to extract those city wise common names and assign their respective status, e.g., the output can be as:
DATAOUTPUT
Code Common Names Status
ED Joseph T. Andrew Lawyer
ED Angela Blank Teacher
TO Luis Van Bob Doctor
TO Jose Antonio Teacher
CA Maria Jose Antonio Studnet
Can anyone guide me how to code it in SAS please?
Please don't double post questions!
It will be more difficult to match -- reliably -- something like "Jose Antonio" with "J Antonio". However, I have written some code that will do basic text matching. It still needs a bit of tuning. If you are a good SAS coder, you can adjust the code as you see fit. I'll in include the code below. Otherwise, I'll try to spend some more time with it on Thursday or Friday.
Jim
Results, immediately below. Luis Van Bob is incorrect. That case needs more work.
Code:
DATA Biographies;
INPUT City_Code & : $2.
Biography & : $256.
;
DATALINES;
ED Mr. Joseph Andrew is hardworking person.
ED Ms. Angela Blanc is genius.
ED Mrs. Sarah Greene was absent.
ED Mr. Rubin D. Elvin cleans the room.
TO Ms. Luis Fernandez is a teacher.
TO Mr. Jose Antonio works from home.
TO Mr. Luis Van Bob comes late.
CA Ms. Maria Jose travelled last week.
CA Mr. Philip Bursa will help you.
;
RUN;
DATA Name_And_Status;
DROP _:;
RETAIN _Max_Names 0;
INPUT City_Cd & : $2.
Name & : $64.
Status & : $32.
;
_Rec_Cnt + 1;
Name_Cnt = COUNTW(Name, ' ');
IF Name_Cnt > _Max_Names THEN
_Max_Names = Name_Cnt;
CALL SYMPUTX('Rec_Cnt', _Rec_Cnt, 'G');
CALL SYMPUTX('Max_Names', _Max_Names, 'G');
DATALINES;
ED Blanc Angela Teacher
ED Albert Crew Doctor
ED Rubin Elvin Engineer
ED Joseph T. Andrew Lawyer
TO Luis V. Bob Doctor
TO J. Antonio Teacher
TO Luis Fernandez Lawyer
CA Accent Andiron Worker
CA Maria Jose Antonio Student
;
RUN;
DATA Matched_Employment_Status;
DROP _:;
DROP City_Cd;
DROP Name;
DROP Status;
DROP Name_Cnt;
ARRAY Names [&Rec_Cnt, &Max_Names] $32 _TEMPORARY_;
ARRAY Statuses [&Rec_Cnt] $32 _TEMPORARY_;
ARRAY Codes [&Rec_Cnt] $2 _TEMPORARY_;
ARRAY Match [&Rec_Cnt] 3 _TEMPORARY_;
IF _N_ = 1 THEN
DO;
DO UNTIL (_End_Of_Names);
SET Name_And_Status END = _End_Of_Names;
_Name + 1;
Statuses[_Name] = Status;
Codes [_Name] = City_Cd;
DO _i = 1 TO Name_Cnt;
Names[_Name, _i] = COMPRESS(SCAN(Name, _i),,'KA');
* PUTLOG "&Nte1 " _Name= _i= Names[_Name, _i]= Statuses[_Name]= Codes[_Name]= ;
END;
END;
END;
* STOP;
SET Biographies;
PUTLOG "&Nte2 ";
_Match_Cnt = 0;
_Token_Cnt = COUNTW(Biography, ' ');
PUTLOG "&Nte1 " _N_= _Token_Cnt=;
DO _i = 1 TO _Token_Cnt;
DO _j = 1 TO &Rec_Cnt;
DO _k = 1 TO &Max_Names;
_Bio = COMPRESS(SCAN(Biography, _i),,'KA');
IF City_Code = Codes[_j] AND
SCAN(Biography, _i) = Names[_j, _k] THEN
DO;
_Match_Cnt + 1;
Match[_Match_Cnt] = _j;
END;
PUTLOG "&Nte2 " _N_= _i= _j= _k= _Bio= Statuses[_j]= Names[_j, _k]= _Match_Cnt=;
END;
IF _Match_Cnt >= 2 THEN
DO;
_Match = 1;
DO _l = 1 TO (_Match_Cnt - 1);
IF Match[_l] ^= Match[_l + 1] THEN
_Match = 1;
END;
IF _Match THEN
DO;
Employ_Status = Statuses[_j];
_j = &Rec_Cnt;
_i = _Token_Cnt;
END;
END;
END;
END;
RUN;
OK, a little sleep, and I spotted the deficiency in my program. I now have it matching to the correct people:
As I say, though, when your Dataset II (the master definition of who has what status) has only a first initial, it's going to be difficult to match against. It will be difficult to distinguish a J. Antonio from a Maria Jose Antonio, although a secondary match on City Code helps tremendously. There is always the danger of a false positive, i.e. associating someone with a Status that is not really theirs.
I set my matching to require two matches. For example, for the name "Luis Van Bob," you could have up to three matches ("Luis", "Van", and "Bob"). There has to be a minimum of two matches for the program to consider the match to be a positive name match. When trying to match, for example, "Luis Fernandez" with "Luis Van Bob", the match will be = 1 because only "Luis" matches. The program does not consider this to be a name match. However "Luis V Bob" and "Luis Van Bob" would be = 2 ("Luis" and "Bob"), and therefore the program would consider this a name match.
You could still get false positives, but city matching helps much in this regard. For example, Jose Antonio and Maria Jose Antonio could easily be falsely matched, but since Jose Antonio is from Toronto and Maria Jose Antonio is from Calgary, they will not be matched.
WARNING: Even matching on both city and name may still give some false positives. For really common names, there absolutely could be two people with the same first and last names in a large city. Name matching is never 100%.
We can increase the number of matches if we use something like the COMPGED function or the SPEDIS function, but this also increases our chances of false positives. See https://support.sas.com/resources/papers/proceedings14/1674-2014.pdf
The SAS code is below.
Jim
DATA Biographies;
INPUT City_Code & : $2.
Biography & : $256.
;
DATALINES;
ED Mr. Joseph Andrew is hardworking person.
ED Ms. Angela Blanc is genius.
ED Mrs. Sarah Greene was absent.
ED Mr. Rubin D. Elvin cleans the room.
TO Ms. Luis Fernandez is a teacher.
TO Mr. Jose Antonio works from home.
TO Mr. Luis Van Bob comes late.
CA Ms. Maria Jose travelled last week.
CA Mr. Philip Bursa will help you.
;
RUN;
**------------------------------------------------------------------------------**;
DATA Name_And_Status;
DROP _:;
RETAIN _Max_Names 0;
INPUT City_Cd & : $2.
Name & : $64.
Status & : $32.
;
_Rec_Cnt + 1;
Name_Cnt = COUNTW(Name, ' ');
IF Name_Cnt > _Max_Names THEN
_Max_Names = Name_Cnt;
CALL SYMPUTX('Rec_Cnt', _Rec_Cnt, 'G');
CALL SYMPUTX('Max_Names', _Max_Names, 'G');
DATALINES;
ED Blanc Angela Teacher
ED Albert Crew Doctor
ED Rubin Elvin Engineer
ED Joseph T. Andrew Lawyer
TO Luis V. Bob Doctor
TO J. Antonio Teacher
TO Luis Fernandez Lawyer
CA Accent Andiron Worker
CA Maria Jose Antonio Student
;
RUN;
**------------------------------------------------------------------------------**;
DATA Matched_Employment_Status;
DROP _:;
DROP City_Cd;
DROP Name;
DROP Status;
DROP Name_Cnt;
ARRAY Names [&Rec_Cnt, &Max_Names] $32 _TEMPORARY_;
ARRAY Statuses [&Rec_Cnt] $32 _TEMPORARY_;
ARRAY Codes [&Rec_Cnt] $2 _TEMPORARY_;
ARRAY Match [&Rec_Cnt] 3 _TEMPORARY_;
IF _N_ = 1 THEN
DO;
DO UNTIL (_End_Of_Names);
SET Name_And_Status END = _End_Of_Names;
_j + 1;
Statuses[_j] = Status;
Codes [_j] = City_Cd;
DO _k = 1 TO Name_Cnt;
Names[_j, _k] = COMPRESS(SCAN(Name, _k),,'KA');
END;
END;
END;
SET Biographies;
CALL MISSING (of Match[*]);
_Token_Cnt = COUNTW(Biography, ' ');
DO _i = 1 TO _Token_Cnt;
DO _j = 1 TO &Rec_Cnt;
DO _k = 1 TO &Max_Names;
_Bio = COMPRESS(SCAN(Biography, _i),,'KA');
IF City_Code = Codes[_j] AND
SCAN(Biography, _i) = Names[_j, _k] THEN
DO;
Match[_j] + 1;
END;
END;
END;
END;
_Max_Match = 0;
DO _j = 1 TO &Rec_Cnt;
IF Match[_j] >= 2 THEN
DO;
IF Match[_j] > _Max_Match THEN
DO;
Employ_Status = Statuses[_j];
_Max_Match = Match[_j];
END;
END;
END;
RUN;
Thank you for guiding. The code works for the variables you have created but unfortunately I am stuck in creation of temporary arrays for the solution of my dataset. I find the following message in the log.
ARRAY Names [&Rec_Cnt, &Max_Names] $32 _TEMPORARY_;
-
22
76
ERROR 22-322: Syntax error, expecting one of the following: an integer constant, *.
ERROR 76-322: Syntax error, statement will be ignored.
Please run the following code;
%PUT NOTE: &=Rec_Cnt &=Max_Names;
Before the data step with this code runs:
ARRAY Names [&Rec_Cnt, &Max_Names] $32 _TEMPORARY_;
Rec_Cnt and Max_Names need to be set to positive integers for the arrays to work. It looks like Rec_Cnt is not being set.
Please post the log, and please use the following buttons here on the forum for posting the log:
Jim
Thank you again for your prompt support. I am using SAS 9.4 but unfortunately, the code is still not working on it. It looks like I am commmitting any mistake but whta is that I dont know. Please see the following log as reference.
1 DATA Matched_Employment_Status;
2 DROP _:;
3 DROP City_Cd;
4 DROP Name;
5 DROP Status;
6 DROP Name_Cnt;
WARNING: Apparent symbolic reference REC_CNT not resolved.
WARNING: Apparent symbolic reference MAX_NAMES not resolved.
7 %PUT NOTE: &=Rec_Cnt &=Max_Names;
NOTE: Rec_Cnt Max_Names
WARNING: Apparent symbolic reference REC_CNT not resolved.
WARNING: Apparent symbolic reference MAX_NAMES not resolved.
8 ARRAY Names [&Rec_Cnt, &Max_Names] $32 _TEMPORARY_;
-
22
76
ERROR 22-322: Syntax error, expecting one of the following: an integer constant, *.
ERROR 76-322: Syntax error, statement will be ignored.
You have to create those two macro variables when you read in the biographical information. They are created in this Data step:
DATA Name_And_Status;
DROP _:;
RETAIN _Max_Names 0;
INPUT City_Cd & : $2.
Name & : $64.
Status & : $32.
;
_Rec_Cnt + 1;
Name_Cnt = COUNTW(Name, ' ');
IF Name_Cnt > _Max_Names THEN
_Max_Names = Name_Cnt;
CALL SYMPUTX('Rec_Cnt', _Rec_Cnt, 'G');
CALL SYMPUTX('Max_Names', _Max_Names, 'G');
DATALINES;
ED Blanc Angela Teacher
ED Albert Crew Doctor
ED Rubin Elvin Engineer
ED Joseph T. Andrew Lawyer
TO Luis V. Bob Doctor
TO J. Antonio Teacher
TO Luis Fernandez Lawyer
CA Accent Andiron Worker
CA Maria Jose Antonio Student
;
RUN;
Here are the two key statements:
CALL SYMPUTX('Rec_Cnt', _Rec_Cnt, 'G');
CALL SYMPUTX('Max_Names', _Max_Names, 'G');
Without those two statements, the macro variables will not be created, and the code will not run in the main Data step. The arrays use the values stored in Rec_Cnt and Max_Names to dimension the arrays, and the subsequent SAS code uses them also to control the matching. Without those two macro variables, the code will never work.
ARRAY Names [&Rec_Cnt, &Max_Names] $32 _TEMPORARY_;
Jim
Thank you for your support. I run the code and it did not show any error in the log nor it generated any result. It would be two hours that it is still processing it. I have two datasets with 1.6 million and 100k observations each. Is not any optimize solution?
So your name and status dataset has 100K records? And you have 16 million biographies? How many words are in each biography?
These are very large files for this type of matching. I would want to break the files up otherwise it will take forever to run.
I would probably separate all files into individual files by city code. You don't want to match someone in Edmonton to someone in Vancouver. They're not one and the same. After you separate the files into files for each city, run a separate job for each major city. Small cities could be combined. Run a small test file first perhaps, but then run all of the jobs simultaneously (i.e. in parallel). Parallel processing is probably the only way to speed something of this magnitude up.
Again though, be careful. You are probably going to get multiple people with the exact same name in large cities. Matching by name is never 100%. It's always best to match on a unique ID number whenever possible. You probably already know this, but it bears repeating. Be careful and expect false matches.
Jim
Hey Jim, thank you for being so supportive. I have splitted the datasets into chunks of maximum 2-5k observations each, still SAS takes minimum 15 mins to process these observations. I have tried this code on three different datasets but unfortunately the required results are not generated. For example, I get two columns from the biography dataset (city code & biography) and the third matching column (employement status) which is found empty, as there is no match.
If you can try to look the issues, should I share my data?
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.