Solved: Re: How to delete observation with specific character and know where S...

Phil_NZ · Posted 01-19-2021 09:50 PM

Hi SAS Users!

Today while I try to delete some observations from my data, I face a filtering screen that I need to delete the observation if part of the NAME variable of the observation contains "PREF" or "TST UNITS" or "XXXXX". Can you please give me a hint to delete these observations or better to keep the observations of which NAME does not contain these words?

It is because I want the code to fit in the current code of mine as below:

data screen1;
   set arg_merge2;
   if TYPE_1 = "EQ" 
   /*SCREEN 1*/
   and INDC3 not in ('UTILS', 'BANKS', 'FINSV', 'RLEST', 'INSUR')
   /*SCREEN 2*/
   and MAJOR="Y" 
   /*SCREEN 3*/
   and 
/* deleting name condition*/
;
run;

And these words can exist in any place of the name, for example, APPLE PREF JAB or APPLE JAB PREF or PREF APPLE JAB.

And another further question if it is resolvable, is there any code to see what observation from dataset arg_merge2 has been deleted by SAS after a specific condition for example after the condition if TYPE_1 = "EQ"

Warmest regards.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

ballardw · Posted 01-21-2021 06:12 PM

1. How come I retain the observations NOT contain any of these words rather than delete the observations containing any of these words (to make it compatible with my previous code)

I am not sure I understand this question.

I would have to see the other code for direct discussion.

The NOT in might work. But you will have to parse every single word and test that it is not in your list of values to reject. The stopping rule for when they are found and what do might be interesting to "not delete".

2. In the code above, I did not use upcase function because all the observations of ENAME are in upcase already. Apart from that, I also did not use function strip because the 'r' modifier in FINDW also satisfies the "removing leading or trailing blank" already. I am not sure if they are correct?

If you test the code on data and it works, go for it. The different functions are provided for different reasons. If you need word boundaries honored then those are the functions to use and needed options.

3. Regarding the word 'PF' in the array, can I use a double quotation to search for it? for example, in the data, I have such an example

I think what you are asking is "how do I search for a literal value that contains quotes".

The ancient and generally easiest is to place the value with quotes inside different quotes. I assume by your example you meant to search for 'B' as I find not example of 'PF'. You could have the value in the array definition as double quotes outside single quotes to include single quoted value: "'B'" Hard to see in the main message window.

data junk;
   x = "'B'";
   z = "A longer string containing 'B' in the middle.";
   q = findw(z,x);
run;

A slightly more esoteric version is to use 3 single quotes on either side:

data junk;
   x = '''B''';
   z = "A longer string containing 'B' in the middle.";
   q = findw(z,x);
   put "An example of using two quotes "" to display one";
run;

In many places two adjacent quotes will be treated as one for some purposes. The above will write a line to the log with one double quote character.

This technique is sometimes needed when you get into strings with mixed quotes that might otherwise create unbalanced quotes.

In the following code I would suggest adding a LEAVE instruction in the loop. If you have multiple words that match the current would write the same record multiple times to the deleted data set.

data screen1    (drop=row)
     deleted (keep=row)
;
   set arg_merge2;
   row=_n_;
   array delwords {84} $ 15 _temporary_ ('DUPLICATE', 'DUPL', 'DUP', 'DUPE','DULP', 'DUPLI',
'1000DUPL','XSQ','XET','ADR','GDR','PREFERRED','PF','PFD','PREF','PF','WARRANT','WARRANTS','WTS','WTS2','WARRT',
'DEB','DB','DCB','DEBT','DEBENTURES','DEBENTURE','RLST IT','INVESTMENT TRUST','INV TST','UNIT TRUST','UNT TST',
'TRUST UNITS','TST UNITS','TRUST UNIT','TST UNIT','UT','IT.','IT','500','BOND','DEFER','DEP','DEPY','ELKS','ETF','FUND','FD','IDX','INDEX','LP','MIPS','MITS','MITT','MPS','NIKKEI','NOTE',
'PERQS','PINES','PRTF','PTNS','PTSHP','QUIBS','QUIDS','RATE','RCPTS',
'RECEIPTS','REIT','RETUR','SCORE','SPDR','STRYPES','TOPRS','UNIT','UNT',
'UTS','WTS','XXXXX','YIELD','YLD','EXPIRED','EXPD','EXPIRY','EXPY');

do i= 1 to dim(delwords);
 if findw(ENAME,delword[i],'','eir') >0 then do; /* e:return the order of word in the string     i: ignore case letter     r: removes traling or leading delimiters*/
      output deleted;
      delete;
      Leave;
   end;
   else output screen1;
run;

View solution in original post

Reeza · Posted 01-19-2021 10:47 PM

Have you looked into the FIND and/or INDEX functions?

data screen1;
set arg_merge2;
if TYPE_1 = "EQ" 
/*SCREEN 1*/
and INDC3 not in ('UTILS', 'BANKS', 'FINSV', 'RLEST', 'INSUR')
/*SCREEN 2*/
and MAJOR="Y" 
/*SCREEN 3*/
and 
find(name, "PREF")=0 & find(name, "TST UNITS")=0 & find(name, "XXXXX")=0
;
run;

Phil_NZ · Posted 01-20-2021 12:12 AM

Hi @Reeza !

Thank you for your answer!

Can I ask is there any way to shorten the code like find (name, "PREF", "TST UNITS", "XXXXX")=0. I asked this question because there are around 40 words like that I need to delete.

Warm regards!

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

ballardw · Posted 01-20-2021 01:37 AM

I'll show one approach to the second bit first to "track" which ones are deleted.

First simplify your requirement because I'm not sure I understand exactly what your first requirement actually may be.

Suppose for simplicity sake you want to delete when the value of X is 1 , 3 or 8.

data have;
   input x;
datalines;
1
27
44
16
22
8
;

data want    (drop=row)
     deleted (keep=row)
;
   set have;
   row=_n_;
   if x in (1,3,8) then do;
      output deleted;
      delete;
   end;
   else output want;
run;

_n_ is a SAS automatic temporary variable that keeps track of the iterations of the data step. In a simple Set situation like this it counts rows.

By explicitly using an output <dataset name> we can specify which of multiple data sets to write to.

The above will write only the Row number to the Deleted data set when the delete condition is met and everything else to the Want data set.

Optionally you could write all the data by removing either the drop or keep data set options., or subset the variables by adding to the Keep= or Drop= .

Now why I am not sure I understand your requirement.

First is the phrase : "if part of the NAME variable of the observation contains". Does this mean it must occur in the first 10 characters, middle 15 to 20th, or some other specific 'part' of the variable? If not, then the word "part" is not needed.

Second is when you start talking about the order it starts looking like you think that would be important. If order is not important than just membership if fine for meeting your requirement.

Third does word case matter? Is 'PREF' to cause deletion but not 'Pref" or 'pREF';

Fourth does just the string or an exact word matter: If PREF is found in "Preference" should it be deleted?

Since you show at least one condition that contains a SPACE in the middle, do you have any other non-letter characters? This goes into some of the possible choices of functions because if you require a word, only the target value, to be found we need to know which delimiters may be used to indicate words. For example your "TST UNITS" means that a very common delimiter of space cannot be used as a delimiter.

Some actual examples that behave like your real data would be nice.

Below is an example that assumes 1) the target string is of mixed case 2) you want find the target strings anywhere in the variable, not as an exact word and demonstrates finding any of multiple possible strings.

The extension of this particular approach would be increase the number of elements in the search array, make sure the length of the character setting, the $ 15, is long enough to hold the longest search string, and add all the search targets to the temporary array values as quoted comma delimited strings.

Use of the STRIP function with the array value s[i] is needed because otherwise trailing spaces to the defined length of the array variable would be used in the search criteria, i.e. "NOW " not just "NOW".

data example;
   x='Now is the time for all good men to come to the aid of their country.';
   array s {5} $ 15 _temporary_ ('NOW', 'TIME', 'FRED', 'CO','ALL GOOD');
   do i= 1 to dim(s);
      if index(upcase(x),strip(s[i]) ) > 0 then do;
         word=s[i];
         put Word= 'found in x';
      end;
   end;
run;

This example shows searching with the entire list. If you only need to know that once one is found you are done then instead of

      if index(upcase(x),strip(s[i]) ) > 0 then do;
         word=s[i];
         put Word= 'found in x';
      end;

you could use

      if index(upcase(x),strip(s[i]) ) > 0 then do;
           output deleted;
           delete;
           leave;
      end;

following the sort of output to the Deleted data set from the first example.

The Leave instruction in a do loop means to quit early and resume outside of the loop.

I included the example value of "CO" to show that partial matches are possible. So your exact values, both the target and searched string need to be examined if that is not desirable for CO to match Country. If not then more specific examples of what your data is may be needed. The INDEX function ignores any rules like "words" or punctuation, it just finds the string if present.

Phil_NZ · Posted 01-20-2021 03:18 PM

Dear @ballardw

First of all, by the highest respect, thank you for your dedicated explanation! I found a lot of new things from your suggestion.

1. It is the first time I know that we deal with two datasets in one data statement.

data want    (drop=row)
     deleted (keep=row)
;

and the way you track the deleted row by creating the dataset deleted and using the output statement to track is very straightforward to me.

However, I have a concern about the dataset deleted. I ran your code and it works well, just of my curiosity. Why the variable "row" appears in the dataset "want" (in case I don't active the function drop)? whether it is because when we run any data step, there are two automatically created variables called "_n_" AND "_ERROR_" which are hidden if you don't call them? And in this case, you assign them as variable "row". Please correct me if I understand you incorrectly.

2. Regarding the ambiguous description. Firstly, I am truly sorry that the way I deliver my idea is not that clear.

Regarding your first bullet, there is no fixed order of the "NAME". The word "part" here means that I want to exclude the observation where the name of this observation contains some specific "word".

Second, you are right, I just want to interpret it clearly that I did not confuse others about my idea, so yeah, just membership it then

Third, the word case matter, however, the situation can be simplified for the uppercase only because all observations of variable NAME are presented under uppercase format.

Fourth, just the string or exact word, meaning that "PREF" is found in "preference" will not be deleted.

Fifth, thank you for pointing it out about non-letter character point @ballardw . Yes, I have even number, quotation marks, as quick picture as below (column ENAME)

As mentioned by @ballardw , I also attached the data file, the variable NAME is at the "ENAME" column

I am wondering if you can do me a favor by adjusting the code based on the updated information.

Thank you in advance and best regards.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

ballardw · Posted 01-20-2021 03:56 PM

There are actually quite a number of potential automatic temporary variables depending on the code you use.

Why the variable "row" appears in the dataset "want" (in case I don't active the function drop)? whether it is because when we run any data step, there are two automatically created variables called "_n_" AND "_ERROR_" which are hidden if you don't call them? And in this case, you assign them as variable "row". Please correct me if I understand you incorrectly.

By default all data sets would have all of the non-temporary variables used in a data step in the output.

I use data set options to control which set has which variables. If you were to use the Keep or Drop statement then those statements apply to all the data sets, again by default. If you do not create a regular variable with the current value of any temporary variable then you can't see the values in the data sets.

Fourth, just the string or exact word, meaning that "PREF" is found in "preference" will not be deleted.

Fifth, thank you for pointing it out about non-letter character point @ballardw . Yes, I have even number, quotation marks, as quick picture as below

The statement that substrings aren't to be considered, the "PREF" in "PREFERENCE" means that the relatively simple INDEX function isn't going to work. INDEXW and FINDW functions find things on "word boundaries". So when you mix in values in the "word" with spaces or other delimiters other headaches arise.

Please check the documentation on functions FINDW and INDEXW and try making some changes your self.

I don't open XLSX files from sources I don't know.

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.

Phil_NZ · Posted 01-21-2021 05:30 PM

Hi @ballardw and SAS Users, I am back

Thank you for your suggestion yesterday, I did quite a bit of searches and end up with something as below

First off, regarding FINDW and INDEXW, I prefer to use FINDW because it is more updated compared to INDEXW.

So, I have a simple code to delete the observation of ENAME that contains any of specific words as below

data screen1;
set arg_merge2;

 array delwords {84} $ 15 _temporary_ ('DUPLICATE', 'DUPL', 'DUP', 'DUPE','DULP', 'DUPLI',
'1000DUPL','XSQ','XET','ADR','GDR','PREFERRED','PF','PFD','PREF',
''PF''
/*note here*/
,'WARRANT','WARRANTS','WTS','WTS2','WARRT',
'DEB','DB','DCB','DEBT','DEBENTURES','DEBENTURE','RLST IT','INVESTMENT TRUST','INV TST','UNIT TRUST','UNT TST',
'TRUST UNITS','TST UNITS','TRUST UNIT','TST UNIT','UT','IT.','IT','500','BOND','DEFER','DEP','DEPY','ELKS','ETF',
'FUND','FD','IDX','INDEX','LP','MIPS','MITS','MITT','MPS','NIKKEI','NOTE',
'PERQS','PINES','PRTF','PTNS','PTSHP','QUIBS','QUIDS','RATE','RCPTS',
'RECEIPTS','REIT','RETUR','SCORE','SPDR','STRYPES','TOPRS','UNIT','UNT',
'UTS','WTS','XXXXX','YIELD','YLD','EXPIRED','EXPD','EXPIRY','EXPY');

do i= 1 to dim(delwords);
  if findw(ENAME,delword[i],'','eir') >0 then delete;
end;
run;

And how to track

data screen1    (drop=row)
     deleted (keep=row)
;
   set arg_merge2;
   row=_n_;
   array delwords {84} $ 15 _temporary_ ('DUPLICATE', 'DUPL', 'DUP', 'DUPE','DULP', 'DUPLI',
'1000DUPL','XSQ','XET','ADR','GDR','PREFERRED','PF','PFD','PREF','PF','WARRANT','WARRANTS','WTS','WTS2','WARRT',
'DEB','DB','DCB','DEBT','DEBENTURES','DEBENTURE','RLST IT','INVESTMENT TRUST','INV TST','UNIT TRUST','UNT TST',
'TRUST UNITS','TST UNITS','TRUST UNIT','TST UNIT','UT','IT.','IT','500','BOND','DEFER','DEP','DEPY','ELKS','ETF','FUND','FD','IDX','INDEX','LP','MIPS','MITS','MITT','MPS','NIKKEI','NOTE',
'PERQS','PINES','PRTF','PTNS','PTSHP','QUIBS','QUIDS','RATE','RCPTS',
'RECEIPTS','REIT','RETUR','SCORE','SPDR','STRYPES','TOPRS','UNIT','UNT',
'UTS','WTS','XXXXX','YIELD','YLD','EXPIRED','EXPD','EXPIRY','EXPY');

do i= 1 to dim(delwords);
 if findw(ENAME,delword[i],'','eir') >0 then do;
 /* e:return the order of word in the string
     i: ignore case letter
     r: removes traling or leading delimiters
*/
      output deleted;
      delete;
   end;
   else output screen1;
run;

I am not sure if the written code is correct?

If they are correct, can I ask three further questions?

1. How come I retain the observations NOT contain any of these words rather than delete the observations containing any of these words (to make it compatible with my previous code)

2. In the code above, I did not use upcase function because all the observations of ENAME are in upcase already. Apart from that, I also did not use function strip because the 'r' modifier in FINDW also satisfies the "removing leading or trailing blank" already. I am not sure if they are correct?

3. Regarding the word 'PF' in the array, can I use a double quotation to search for it? for example, in the data, I have such an example

should I search for 'B' or ''B''.

4. I also add the data under the SAS data file under sas7bdat format, hopefully, that this dataset can satisfy your requirement

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

ballardw · Posted 01-21-2021 06:12 PM

1. How come I retain the observations NOT contain any of these words rather than delete the observations containing any of these words (to make it compatible with my previous code)

I am not sure I understand this question.

I would have to see the other code for direct discussion.

The NOT in might work. But you will have to parse every single word and test that it is not in your list of values to reject. The stopping rule for when they are found and what do might be interesting to "not delete".

2. In the code above, I did not use upcase function because all the observations of ENAME are in upcase already. Apart from that, I also did not use function strip because the 'r' modifier in FINDW also satisfies the "removing leading or trailing blank" already. I am not sure if they are correct?

If you test the code on data and it works, go for it. The different functions are provided for different reasons. If you need word boundaries honored then those are the functions to use and needed options.

3. Regarding the word 'PF' in the array, can I use a double quotation to search for it? for example, in the data, I have such an example

I think what you are asking is "how do I search for a literal value that contains quotes".

The ancient and generally easiest is to place the value with quotes inside different quotes. I assume by your example you meant to search for 'B' as I find not example of 'PF'. You could have the value in the array definition as double quotes outside single quotes to include single quoted value: "'B'" Hard to see in the main message window.

data junk;
   x = "'B'";
   z = "A longer string containing 'B' in the middle.";
   q = findw(z,x);
run;

A slightly more esoteric version is to use 3 single quotes on either side:

data junk;
   x = '''B''';
   z = "A longer string containing 'B' in the middle.";
   q = findw(z,x);
   put "An example of using two quotes "" to display one";
run;

In many places two adjacent quotes will be treated as one for some purposes. The above will write a line to the log with one double quote character.

This technique is sometimes needed when you get into strings with mixed quotes that might otherwise create unbalanced quotes.

In the following code I would suggest adding a LEAVE instruction in the loop. If you have multiple words that match the current would write the same record multiple times to the deleted data set.

data screen1    (drop=row)
     deleted (keep=row)
;
   set arg_merge2;
   row=_n_;
   array delwords {84} $ 15 _temporary_ ('DUPLICATE', 'DUPL', 'DUP', 'DUPE','DULP', 'DUPLI',
'1000DUPL','XSQ','XET','ADR','GDR','PREFERRED','PF','PFD','PREF','PF','WARRANT','WARRANTS','WTS','WTS2','WARRT',
'DEB','DB','DCB','DEBT','DEBENTURES','DEBENTURE','RLST IT','INVESTMENT TRUST','INV TST','UNIT TRUST','UNT TST',
'TRUST UNITS','TST UNITS','TRUST UNIT','TST UNIT','UT','IT.','IT','500','BOND','DEFER','DEP','DEPY','ELKS','ETF','FUND','FD','IDX','INDEX','LP','MIPS','MITS','MITT','MPS','NIKKEI','NOTE',
'PERQS','PINES','PRTF','PTNS','PTSHP','QUIBS','QUIDS','RATE','RCPTS',
'RECEIPTS','REIT','RETUR','SCORE','SPDR','STRYPES','TOPRS','UNIT','UNT',
'UTS','WTS','XXXXX','YIELD','YLD','EXPIRED','EXPD','EXPIRY','EXPY');

do i= 1 to dim(delwords);
 if findw(ENAME,delword[i],'','eir') >0 then do; /* e:return the order of word in the string     i: ignore case letter     r: removes traling or leading delimiters*/
      output deleted;
      delete;
      Leave;
   end;
   else output screen1;
run;

How to delete observation with specific character and know where SAS delete these observations

Re: How to delete observation with specific character and know where SAS delete these observations

Re: How to delete observation with specific character and know where SAS delete these observations

Re: How to delete observation with specific character and know where SAS delete these observations

Re: How to delete observation with specific character and know where SAS delete these observations

Re: How to delete observation with specific character and know where SAS delete these observations

Re: How to delete observation with specific character and know where SAS delete these observations

Re: How to delete observation with specific character and know where SAS delete these observations

Re: How to delete observation with specific character and know where SAS delete these observations

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!