CountW not working correctly across string

Accepted Solution Solved
Reply
Super Contributor
Posts: 418
Accepted Solution

CountW not working correctly across string

Hello everyone. I am trying to write a program that will read the first row of data within a csv and then prodcue all of the variable names into a dataset. I have come up with the following code (with help from other users).

data aswer;

infile "g:\yourdatafile" dsd lrecl=32000 truncover obs=1;

length varnumCSV 8 name $2000.;

input @;do varnumcsv=1 to countw(_infile_,',','MQ');

input name @;

output;

end;

run;

The issue is the following. I have some csv files that have quotation marks within their values. Since I have the Q modifier specified, these files are not reading the correct number of distinct observations and thus are wrong.

Example

Data file Line 1:

Loannumber, John's comments, processed answer,client,type

The value of the countw(_infile_,',','MQ') in the above example is only giving 1 because we have an unmatched quotation mark.  I am not sure how to get around this problem, because if I have a comma within  comment I need it to NOT be seperated, however if I have a quotation mark I need it to not count... AkA I need the below results for my data.

Data file Line 1:

Loannumber, John's comments, processed answer,client,type

Answer 5

Data file 2 Line 1:

Loannumber, Johns didnt do this, comments, processed answer,client,type

Answer 5

The second portion of the data is all within one Cell in Excel. Therefore I want to ignore that comma.

Basically I need to ignore commas inside of single excel (csv) cells, however I need to also remove all quotations marks so the 'Q' option in the countw works.

Does anyone have any ideas on this that could prove helpful?

Thanks!

Brandon


Accepted Solutions
Solution
‎11-06-2013 04:23 PM
Super Contributor
Posts: 253

Re: CountW not working correctly across string

Posted in reply to Anotherdream

Why not use DSD to do the work for you?  That's supposed to be how countw/MQ works, but if it doesn't, DSD certainly does.

data test;

infile datalines dlm=',' dsd;

input @@;

words = countw(_infile_,',','MQ');

input a  b $ c d e;

put _all_;

datalines;

1,Hello,3,4,5

1,"Hi, Jack",3,4,5

1,Don't,3,4,5

;;;;

run;

So, use the first row (instead of normally firstobs=2, obs=1) and just input it iteratively - like so:

data test;

infile datalines dlm=',' dsd end=eof ; *normally put obs=1 here, not doing so for demonstration purposes;

input name $ @@;

put _all_;

datalines;

1,Hello,3,4,5

1,"Hi, Jack",3,4,5

1,Don't,3,4,5

;;;;

run;

Also, if you're confident you know how MQ options work on countw (I think so, but don't have time to check), open a track with tech support - that doesn't look like proper behavior to me.

View solution in original post


All Replies
Super User
Posts: 5,499

Re: CountW not working correctly across string

Posted in reply to Anotherdream

Do you have the option of saving your spreadsheet as a tab-delimited file instead of a comma-delimited file?

PROC Star
Posts: 7,468

Re: CountW not working correctly across string

Posted in reply to Anotherdream

If the answer to your second example is really 6, then I think that the following code accomplishes what you expect:

data aswer;

infile cards dsd lrecl=32000 truncover;

length varnumCSV 8 name $2000.;

input @;

do varnumcsv=1 to countw(_infile_,',','T');

  input name @;

  output;

end;

cards;

Loannumber, John's comments, processed answer,client,type

Loannumber, Johns didnt do this, comments, processed answer,client,type

;

run;

Super Contributor
Posts: 418

Re: CountW not working correctly across string

Hello Arthur. THe answer to the second question is actually 5, the comment in bold (Johns didn't do this, comment) is actually one field within the csv file.

Hiya astounding. Sadly no I cannot change the data inputs to tab delimited.

The only solution I have found is to compress ' out of the first line of the file in the read in, and then do the analysis the way I have it.... I'm not sure I like this solution tho

PROC Star
Posts: 7,468

Re: CountW not working correctly across string

Posted in reply to Anotherdream

Please explain!  How can one know that comments is attached to the earlier part of the string?

Solution
‎11-06-2013 04:23 PM
Super Contributor
Posts: 253

Re: CountW not working correctly across string

Posted in reply to Anotherdream

Why not use DSD to do the work for you?  That's supposed to be how countw/MQ works, but if it doesn't, DSD certainly does.

data test;

infile datalines dlm=',' dsd;

input @@;

words = countw(_infile_,',','MQ');

input a  b $ c d e;

put _all_;

datalines;

1,Hello,3,4,5

1,"Hi, Jack",3,4,5

1,Don't,3,4,5

;;;;

run;

So, use the first row (instead of normally firstobs=2, obs=1) and just input it iteratively - like so:

data test;

infile datalines dlm=',' dsd end=eof ; *normally put obs=1 here, not doing so for demonstration purposes;

input name $ @@;

put _all_;

datalines;

1,Hello,3,4,5

1,"Hi, Jack",3,4,5

1,Don't,3,4,5

;;;;

run;

Also, if you're confident you know how MQ options work on countw (I think so, but don't have time to check), open a track with tech support - that doesn't look like proper behavior to me.

Super Contributor
Posts: 418

Re: CountW not working correctly across string

Posted in reply to snoopy369

Sorry ARthur, the lines (in the csv file when opened in a text) are as below.

Loannumber, John's comments, processed answer,client,type

Loannumber, "Johns didnt do this, comments", processed answer,client,type

So SAS is smart enough to know the second comma isn't actually a comma because it's in double quotes. However the system is NOT putting the John's comments in double quotes. Therefore it is thinking that the single quote is an invalid quote string, and causing the program to fail.

Interestingly enough,even if I do a proc import on this datafile it crashes my SAS.

Snoopy I have dsd specified within my infile statement.

Super Contributor
Posts: 339

Re: CountW not working correctly across string

Posted in reply to Anotherdream

You could use a clever (aka difficult to write) PERL Regex and loop with call prxnext to read all "cells" as one row of your output. It should work for so long as unbalanced quotations are different quotes (so you could have "John's" not cause failures but "He said: "hey you!"" would get tedious)

trying to think of the regex and I'll get back to you with something to try later

ok regex try1

myregex=PRXPARSE('m/([^",]*,)|("[^"]*",)/');

logic: "capture anything that's not a dquote or a comma followed by a comma" OR "capture a dquote, followed by anything that's not a dquote(including commas) followed by a dquote and then by a comma".

How to use it - first, you will need to pad a comma at the end of your input string or else the PRX will fail to read the last variable and it's an easier fix than doing some additionnal conditionning

data aswer;

infile "g:\yourdatafile" dsd lrecl=32000 truncover obs=1;

length varnumCSV 8 name $2000.;

input @;

name=trim(name)||","; /* padded comma for last variable name*/

/* initiate PRXNEXT params and parse the regex */

myregex=PRXPARSE('m/([^",]*,)|("[^"]*",)/');

start=1;

stop=-1;

/* */

do until(pos=0);

     call prxnext(myregex, start, stop, name, pos, len);

     varname=strip(substr(name, pos, len-1)); /* len-1 to remove comma */

     output;

end;

run;

Above is my first attempt at providing a regex but it's likely missing some tweaks as it felt too simple. I'll test it real quick with some datalines set.

Super Contributor
Posts: 253

Re: CountW not working correctly across string

Posted in reply to Vince28_Statcan

Rather than add an additional comma, just add [,$] to check for end of string.

I don't have time to flesh out the full regex solution, as the general solution is really complex.  It uses lookbehind to deal with the embedded commas, if I remember correctly.  It should be available in a google search, though.

Super Contributor
Posts: 339

Re: CountW not working correctly across string

Posted in reply to snoopy369

[,$] shouldn't work as within [] classes, special character constructs lose their sense and are just plain character (although maybe if $ is specifically the last char of the class it works but I wouldn't think so since ^ isn't a start of string check but a NOT operator if it's the very first char of a class).

Furthermore, my issue is that of using PRXNEXT so I'd have to add a third and 4rth OR groups with the same matching groups () as above but with a $ ending instead of a , further complicating understanding of the regex to non-familiar users. I.e. for educational purposes, I think adding a comma is simpler -_-.

CSV files are more standard than just unformated files so the approach to handling DQuotes/commas is somewhat simplified. My PRX might not cover all the cases as I mentionned but it shouldn't need too crazy tweaks to handle the OPs task.

Super Contributor
Posts: 339

Re: CountW not working correctly across string

Posted in reply to Vince28_Statcan

Alright, after testing a few things and improving to remove the double quotes from output and not outputting and empty row here's how I would do it:

data have;
input @1 varnum 1. @3 name $75.;
datalines;
1 Loannumber, John's comments, processed answer,client,type                 
2 Loannumber, "Johns didnt do this, comments", processed answer,client,type 
;
run;

data answer;
*infile "g:\yourdatafile" dsd lrecl=32000 truncover obs=1;
*length varnumCSV 8 name $2000.;
*input @;
set have;
name=trim(name)||","; /* padded comma for last variable name*/
/* initiate PRXNEXT params and parse the regex */
myregex=PRXPARSE('m/([^",]*,)|("[^"]*",)/');
start=1;
stop=-1;
/* */
do until(pos=0);
     call prxnext(myregex, start, stop, name, pos, len);
     varname=strip(substrn(name, pos, len-1)); /* len-1 to remove comma */
  if prxmatch('m/("[^"]*")/', trim(varname))>0 then varname=substr(varname, 2, length(varname)-2);
  if varname NE "" then output;
end;
run;

You will probably need to play with empty spacing for the datalines statement to be read appropriately but the output turns out great. If you have any case that fails, just provide a string example that fails and I can modify the regex

Vince

Super Contributor
Posts: 253

Re: CountW not working correctly across string

Posted in reply to Anotherdream

Did you read my answer?  I'm not saying 'include DSD in your infile', I'm saying to use DSD's properties to parse your strings.  Your use of DSD didn't accomplish anything from the point of view of string parsing, since you're parsing _infile_, which means DSD was pointless (it only affects things when you actually use input to read in your variables).

Super Contributor
Posts: 418

Re: CountW not working correctly across string

Posted in reply to Anotherdream

Hiya Snoopy, I did read your answer, however it doesn't make sense to me. When I try to run your first section of code, I get 2 as the answer for words in the third line. THis is the exact problem I am trying to avoid as I want the answer to be 5 here, so I don't see how that helps.

When I run the second block I don't get a count at all, so not sure what you are trying to do here.

So I am confused by what you mean "use the properties of dsd to parse your string" .

The properties of DSD are, It allows two consectutive delimiters to represent a missing value, and it allows you to use delimiters within a double quoted field.

Aka "George Bush, Jr" would still recognize as one value.

The exact problem I have here is that I have a field that has a single quote ('), but the OVERALL field within the csv is not getting double quotes. THerefore sas things the single quote is the start of a quoted field, and all of the delimiters inside of it are no longer remaining as delimiters because of the DSD option.

however I need the DSD option because I do have fields that also have delimieters within it.


Hope that makes sense.

I will look into the regex solutions!

Super Contributor
Posts: 253

Re: CountW not working correctly across string

Posted in reply to Anotherdream

Why do you care about the count?  My solution  (the first is not a 'solution', it's just a comparison showing countw doesn't work) gives you your result dataset without needing to count.

Super Contributor
Posts: 418

Re: CountW not working correctly across string

Posted in reply to Anotherdream

Oh I got ya. Sorry I am still a very novice at SAS, so I frankly don't understand how your second code is working. I didn't know that dsd would only work with an input statement and not an infile statement (especially since I had an input statement in my original code, so I'm still not sure why this is any different).

It does appear that this is working, all I would need to do is add an ordering variables to the process (the order matters, but that's easy).

Your answer is 100% correct, I will mark it and look into how this works (I'm guessing it has to do with the @@, but again am new to this still).

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 25 replies
  • 660 views
  • 0 likes
  • 6 in conversation