BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
jacko1801
Obsidian | Level 7

I am trying to count words within the 'name' column of some data that i have already cleaned up. 

 

I know i need to use COUNTW but all the examples use datalines and not infile or set commands. 

 

I am new to all this and this is for a university assessment. 

 

Can anyone give me an idea on how i can count the words within a singular column within the table instead of all the words within the table. 

 

This is the code i have so far and it does work but shows me all the columns together including the headings and i cannot work out why. 

 

 

 PROC IMPORT 
  DATAFILE = '/home/cw2/imdb-videogames.csv'
  OUT = Videogame_data
  DBMS = CSV 
  REPLACE;
RUN;

DATA Videogame_data;
  SET Videogame_data;
  IF NOT MISSING(rating) AND NOT MISSING(votes) AND NOT MISSING(certificate);
RUN;

PROC SORT DATA=Videogame_data OUT=temp_videogamedata NODUPKEY;
  BY name;
RUN;

DATA Videogame_data;
  SET temp_videogamedata;
RUN;

data Title_Word_Count;
infile 'temp_Videogamedata';
   length blanks mp 8;
   input string $char60.;
   Number_of_Words=countw(string);
   blanks=countw(string, ' ');
   mp=countw(string, , 'mp');
;
run;

proc print noobs data=Title_Word_Count;
run;

 

The whole point to this is to count the number of words for each row of the column 'name' then create a scatterplot and find the average number of words in name (we have to use a graph for every section we create)

 

Thank you in advance for your help

 CountW function 

1 ACCEPTED SOLUTION

Accepted Solutions
jacko1801
Obsidian | Level 7

Thank you, i don't know how really but this seems to now work and shows me the correct data. 

 

Again all we have been taught is simple PROC steps and that is it and the lecturer does not aid us if we get stuck on aspects he is expecting from us.  

 

just fyi this is what i now have and works perfectly for the assessment, i want to thank you all @PaigeMiller @Kurt_Bremser @Tom for your help. Everyone has to start somewhere right?

 

PROC IMPORT 
  DATAFILE = '/home/cw2/imdb-videogames.csv'
  OUT = Videogame_data
  DBMS = CSV 
  REPLACE;
RUN;

DATA Videogame_data;
  SET Videogame_data;
  IF NOT MISSING(rating) AND NOT MISSING(votes) AND NOT MISSING(certificate);
RUN;

PROC SORT DATA=Videogame_data OUT=temp_videogamedata NODUPKEY;
  BY name;
RUN;

DATA Videogame_data;
  SET temp_videogamedata;
RUN;


data titles;
set Videogame_data; 
words = countw(name,' ');
run;

PROC PRINT data=titles;
var name words;
run;

Ive got to admit that i think the example on the SAS helpbook is what confused me the most so again thank you for your help 🙂

View solution in original post

12 REPLIES 12
PaigeMiller
Diamond | Level 26

When code doesn't work, show us the log for the step where it isn't working. Please copy the log for that step (all the lines in the log for that step) and paste it into the window that appears when you click on the </> icon.

 

PaigeMiller_0-1663012019648.png

--
Paige Miller
jacko1801
Obsidian | Level 7

Screenshot 2023-04-15 at 12.05.10.png

It isn't giving any errors however it is counting all strings across all columns of the table its meant to be giving me here are the two images of the output data - the first image is the cleaned up data (below this text) and the second is the CountW (above this text) - i just want it to count the name column not all rows of data. if that makes sense. 

Screenshot 2023-04-15 at 12.06.48.png

 

Thank you again 

PaigeMiller
Diamond | Level 26

Your explanation still isn't sufficient.

 

I do not understand how the 2nd screen capture and the first screen capture are related to one another. You don't even have the same variables in the two screen captures.

 

Regarding the first screen capture, where you show a variable "Number_of_Words", what is the problem? What is wrong? Please explain.

--
Paige Miller
jacko1801
Obsidian | Level 7

Okay so for our assessment we need to do an analysis that has to be fully automated (so he said that you click run once and all the data, graphs and tables have to show up), this is group work and we are all doing different analysis's on the same data set.

 

The first aspect we had to do was clean up the data using SAS, so the first image is the video game data cleaned up - basically all empty cells and duplication's deleted.

 

From then I need to analyse the average number of words of a video game title so, i need to do a COUNTW (from my understanding) to count the amount of words within the string. I only need to count the words within the 'name' column not the entire table. The variable of "Number_of_Words" is the string column where the amount of words within the string is presented.

 

I need to know how I can select the name column from the cleaned up data called "temp_videogamedata" and do a COUNTW only for the column called "name", as then i will create a scatterplot from the number of words within each video game name and find the average number of words per name.

 

The images are the cleaned up data and the image with "Number_of_Words" is the result i'm getting back from the COUNTW coding

 

I hope this now makes a bit more sense, i'm new to all of this and not something i intend to pursue after my university career and my university lecturer is no help with this at all.

 

PaigeMiller
Diamond | Level 26

The images are the cleaned up data and the image with "Number_of_Words" is the result i'm getting back from the COUNTW coding

 

I previously requested you explain what is wrong with "Number_of_Words". I don't see an explanation of what is wrong.

 

In addition, it does not clear things up, it simply confuses me more, when your code uses

 

Number_of_Words=countw(string);

 

but your text says you want to count the number of words for variable NAME.

--
Paige Miller
jacko1801
Obsidian | Level 7

I don't see anything wrong with "Number_of_Words" i just have used the example within the SAS helpbook and changed it from default to Number_of_Words  

 

My understanding of where it says Number_of_Words was it is just the title for that column? i thought this as it still gives the result that is partially correct - it give me the number of words it's seeing in the string column.

 

I'm not saying the process is wrong as it is displaying the right thing however:

 

Within the String Column is has the data from all the columns of the cleaned up data therefore instead of just counting the words within the name column its counting all the words across all columns per row  

 

It is even counting the headings of the columns as having 9 words (the first result of the image below)  

 

I need this to just count the column name within temp_videogamedata - where this is showing me all the columns of temp_videogamedata  

 

Number_of_Words from my understanding is just the heading of the column - within the helpbook it was called "default"  

 

Again i apologise im really trying to explain this the best i can and i can see it becoming frustrating.

Screenshot 2023-04-15 141643.png

 

This is the example i am using to help me:

data test;
   length default blanks mp 8;
   input string $char60.;
   default=countw(string);
   blanks=countw(string, ' ');
   mp=countw(string, , 'mp');
   datalines;
The quick brown fox jumps over the lazy dog.
2+2=4
//unix/path/names/use/slashes
\\Windows\Path\Names\Use\Backslashes
;
run;

proc print noobs data=test;
run;

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lefunctionsref/p18xi2516ihygyn1qg1b1nby326k.h...

 

i have just taken out the bits i didn't think i need and change bits - there are no example of how to do a COUNTW that isn't from datalines and then focusing only on one column within a dataset  

 

PaigeMiller
Diamond | Level 26

How about this:

 

data Title_Word_Count;
   set temp_Videogamedata;
   length blanks mp 8;
   input string $char60.;
   Number_of_Words=countw(string);
   blanks=countw(string, ' ');
   mp=countw(string, , 'mp');
run;
--
Paige Miller
jacko1801
Obsidian | Level 7

i'm getting this in the log: 

 

69         data Title_Word_Count;
 70            set temp_Videogamedata;
 ERROR: File WORK.TEMP_VIDEOGAMEDATA.DATA does not exist.
 71            length blanks mp 8;
 72            input string $char60.;
 73            Number_of_Words=countw(string);
 74            blanks=countw(string, ' ');
 75            mp=countw(string, , 'mp');
 76         run;
 
 NOTE: The SAS System stopped processing this step because of errors.
 WARNING: The data set WORK.TITLE_WORD_COUNT may be incomplete.  When this step was stopped there were 0 observations and 4 
          variables.
 NOTE: DATA statement used (Total process time):
       real time           0.00 seconds
       user cpu time       0.00 seconds
       system cpu time     0.00 seconds
       memory              641.96k
       OS Memory           22692.00k
       Timestamp           04/15/2023 01:25:20 PM
       Step Count                        38  Switch Count  2
       Page Faults                       0
       Page Reclaims                     105
       Page Swaps                        0
       Voluntary Context Switches        12
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           264

and this in the results:

Screenshot 2023-04-15 142902.png

 

So it still is collecting all data from every column instead of just the "name" column   

Kurt_Bremser
Super User
infile 'temp_Videogamedata';

This will search for an external file (not a SAS dataset) called temp_Videogamedata (no extension) in the current working directory of your SAS session (most likely the directory from where SAS was started by the WorkspaceServer script). In the context of your other code, I am quite confident that this has never worked.

You cannot use a SAS dataset in this way (using INFILE); if you want to use the physical name instead of library.dataset (in a SET statement or DATA= option!), you need to supply the fully qualified path and the .sas7bdat extension.

jacko1801
Obsidian | Level 7

okay i get what you mean, however it has been working otherwise it wouldnt give me any results as im only running the coding for the COUNTW -

 

Even if i change it to infile  '/home/cw2/imdb-videogames.csv';

 

It still is putting all the columns data in the string column - how do i just make it collect the column "name"

 

as ive tried the select method with variables 

and ive tried just putting it in brackets after the infile statements

and ive tried the COL= method

 

None just focus on the name column 

 

So how would i write a COUNTW Code using the infile statement of the original csv file but only counting the words in the "name" column OR how would i write the code for temp_videogamedata only counting the words in the "name" column.

 

Thank you for your help again  

Tom
Super User Tom
Super User

@jacko1801 wrote:

okay i get what you mean, however it has been working otherwise it wouldnt give me any results as im only running the coding for the COUNTW -

 

Even if i change it to infile  '/home/cw2/imdb-videogames.csv';

 

It still is putting all the columns data in the string column - how do i just make it collect the column "name"

 

as ive tried the select method with variables 

and ive tried just putting it in brackets after the infile statements

and ive tried the COL= method

 

None just focus on the name column 

 

So how would i write a COUNTW Code using the infile statement of the original csv file but only counting the words in the "name" column OR how would i write the code for temp_videogamedata only counting the words in the "name" column.

 

Thank you for your help again  


You seem to be obsessing on the wrong issue.

First get the data into a SAS dataset.  If you have CSV file then you can LOOK at the file and see what variables it should have because a CSV file is just a plain old text file.  While looking at it make sure it is actually a comma separated values file and not some other type of delimited file (or perhaps not a delimited file at all).

 

Once you have a dataset with a variable named NAME then you can use the COUNTW() function on it.  Let's assume that have succeeded in created a dataset named HAVE that contains the character variable NAME then you could use a data step like this to create a new dataset named WANT with a new variable name WORDS that counts how many words are in NAME when using only spaces as the delimiter between the words.

 

data want;
  set have;
  words =countw(name,' ');
run;
jacko1801
Obsidian | Level 7

Thank you, i don't know how really but this seems to now work and shows me the correct data. 

 

Again all we have been taught is simple PROC steps and that is it and the lecturer does not aid us if we get stuck on aspects he is expecting from us.  

 

just fyi this is what i now have and works perfectly for the assessment, i want to thank you all @PaigeMiller @Kurt_Bremser @Tom for your help. Everyone has to start somewhere right?

 

PROC IMPORT 
  DATAFILE = '/home/cw2/imdb-videogames.csv'
  OUT = Videogame_data
  DBMS = CSV 
  REPLACE;
RUN;

DATA Videogame_data;
  SET Videogame_data;
  IF NOT MISSING(rating) AND NOT MISSING(votes) AND NOT MISSING(certificate);
RUN;

PROC SORT DATA=Videogame_data OUT=temp_videogamedata NODUPKEY;
  BY name;
RUN;

DATA Videogame_data;
  SET temp_videogamedata;
RUN;


data titles;
set Videogame_data; 
words = countw(name,' ');
run;

PROC PRINT data=titles;
var name words;
run;

Ive got to admit that i think the example on the SAS helpbook is what confused me the most so again thank you for your help 🙂

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 1560 views
  • 1 like
  • 4 in conversation