DATA Step, Macro, Functions and more

Import not cleaned text string column into a dataset

Reply
Frequent Contributor
Posts: 84

Import not cleaned text string column into a dataset

Hi,

 

How to import special characters (HIdden quotes,double quotes, next line) string around 2000 length in one of the column in csv file.(attached) 

 

Cust_id trans_type gender trans_date trans_amt Comments
11201 Entertainment F ######## $50.65 I Talked to customer, but he is not interested
11205 Grocery m ######## $90.00 1. he is not wiling
but he is will continue in future
2. he is good
11206 Travel m ######## $23.00 "not wiiling to continue" regards Diip
11206 Travel m ######## $23.00 "not talked


@@ regards hiip
Super User
Posts: 17,840

Re: Import not cleaned text string column into a dataset

Yuck. 

This is one of the few cases where I say keep the data in Excel since it will keep,relevant content in a single cell. 

 

Note this assumes you started with an Excel file since going to Excel from a CSV will have the same issues. 

Frequent Contributor
Posts: 84

Re: Import not cleaned text string column into a dataset

unfortunately we dont have access to import excel files. we need to convert either csv or text file. 

 

this is just sample data i have unstructured data with so much cooments wecan't controal that column..i need to load this data into a dataset.do you have any recommendations how can i import?

 

 

Super Contributor
Posts: 376

Re: Import not cleaned text string column into a dataset

Hi,

 

I created the attached txt file, simulating an export from Excel to CSV.

 

I imported it with this code:

 

data test;
   infile "\\UNC path to my file\test.txt" delimiter="09"x termstr="|" dsd;
   length col1 col2 $1000;
   input col1 col2;
run;
proc print;
run;

If your data has embedded CRLF, then use termstr to change the character that SAS thinks marks the end-of-record.  

 

Note:  usually Excel doesn't use CRLF to embed line feeds, but only LF (I think it's "0A"x).  So, you may not have to use termstr= or the pipe symbol to mark your end-of-record.

 

I have a Powershell script which runs a query against an Excel file and returns it as a CSV stream.  You can use an infile pipe to dynamically read the stream in a data step.  

 

You could modify it to write out your custom termstr character at the end of each line (something like SELECT [*,"|"] FROM [Sheet1$]).  

 

Or, use another technique, such as adding the end-of-record character in your Excel file itself.  

 

If you want to try the Powershell script, I don't have time to support further questions, but Google and the SAS documentation are your friend.

 

Here's the link:  https://github.com/scottbass/Powershell/blob/master/Scripts/Query-Excel.ps1

 

Hope this helps...

Super User
Posts: 9,682

Re: Import not cleaned text string column into a dataset

You are luck. The termstr is CRLF . Others are CR . after I checked by NotePad++ .




filename x '/folders/myfolders/comments.csv'  termstr=crlf;
proc import datafile=x out=want dbms=csv replace;run;

Super User
Posts: 9,682

Re: Import not cleaned text string column into a dataset

You are luck. The termstr is CRLF . Others are CR . after I checked by NotePad++ .




filename x '/folders/myfolders/comments.csv'  termstr=crlf;
proc import datafile=x out=want dbms=csv replace;
 guessingrows=32767;
run;

Super User
Super User
Posts: 6,502

Re: Import not cleaned text string column into a dataset

I don't see any attached file.

Why do you say that you cannot read the Excel file?  SAS can directly read both XLS and XLSX files.

libname mydata xlsx 'myfile.xlsx';
proc copy inlib=mydata outlib=work;
run;

If you have a CSV file generated by Excel then you can probably read it into SAS by making sure to tell SAS to use CRLF as the end of line markers.  That way it will read the embedded CR that Excel normally uses for line breaks inside of cells.

data want ;
  infile 'myfile.csv' dsd truncover lrecl=100000 truncover firstobs=2 termstr=crlf;
  length col1-col6 $32767;
  input col1-col6;
run;
Frequent Contributor
Posts: 84

Re: Import not cleaned text string column into a dataset

i have successfully able to import the file. now i want to strip the words in text. below code works fine for small datset.

 

however i have large datset it's taking too much time.. anyone have any suggestions to improve the code. Thanks

 

data have1;
set have(keep=comments);
count =countw(comments," ");
run;

 

data have2(drop=count i);
set have1;
do i=1 to count;
words = scan(comments,i ," ");
output;
end;
run;

 

Super User
Posts: 9,682

Re: Import not cleaned text string column into a dataset

You can combine these two data step into one :


 
data have2(drop=count i);
set have;
do i=1 to countw(comments," ");
words = scan(comments,i ," ");
output;
end;
run;

Frequent Contributor
Posts: 84

Re: Import not cleaned text string column into a dataset

This is not reducing my run time.

 

Can we write a similar program using PRX fncitons. that would help in running time?

Super User
Posts: 9,682

Re: Import not cleaned text string column into a dataset

I don't think PRX functions could reduce your time. The fast efficient way I could think is using IML code. But that need lots of memory if you have big table.
Super User
Posts: 17,840

Re: Import not cleaned text string column into a dataset

Define takes too much time?

Is this a process that needs to be repeated every week? Month?

Frequent Contributor
Posts: 84

Re: Import not cleaned text string column into a dataset

the dataset i have around 500k comments...i want to strip the comments into separate words..then i need to count how many times each word apperead. the code used above is taking too much time and generating around 100gb data..so i want  to reduce the time and space.

 

after some research i came to know that i have intially assigend length for coments variable $5000 so that all comments is loaded proerly into my dataset. i think this is taking my dataset around 100gb

 

below is the the output i am getting after running the code, lf you see for every word its taking comments ..is there anyway we can improve the code. or how can i know max length of the variable before loading to dataset so that i can assign that length to the variable? or can i get the similar results another way? I hope you understand the issue. Thanks

 

comments words
I Talked to customer I
I Talked to customer Talked
I Talked to customer to
I Talked to customer customer
Super User
Posts: 17,840

Re: Import not cleaned text string column into a dataset

Have you considered SAS text analytics rather than using Base SAS?

Frequent Contributor
Posts: 84

Re: Import not cleaned text string column into a dataset

dont have license to SAS text analytics Smiley Sad

Ask a Question
Discussion stats
  • 19 replies
  • 945 views
  • 9 likes
  • 5 in conversation