Re: Import not cleaned text string column into a dataset

kumarK · Posted 09-13-2016 07:58 PM

Hi,

How to import special characters (HIdden quotes,double quotes, next line) string around 2000 length in one of the column in csv file.(attached)

Cust_id	trans_type	gender	trans_date	trans_amt	Comments
11201	Entertainment	F	########	$50.65	I Talked to customer, but he is not interested
11205	Grocery	m	########	$90.00	1. he is not wiling but he is will continue in future 2. he is good
11206	Travel	m	########	$23.00	"not wiiling to continue" regards Diip
11206	Travel	m	########	$23.00	"not talked @@ regards hiip

Reeza · Posted 09-13-2016 08:10 PM

Yuck.

This is one of the few cases where I say keep the data in Excel since it will keep,relevant content in a single cell.

Note this assumes you started with an Excel file since going to Excel from a CSV will have the same issues.

kumarK · Posted 09-13-2016 08:36 PM

unfortunately we dont have access to import excel files. we need to convert either csv or text file.

this is just sample data i have unstructured data with so much cooments wecan't controal that column..i need to load this data into a dataset.do you have any recommendations how can i import?

ScottBass · Posted 09-13-2016 10:19 PM

Hi,

I created the attached txt file, simulating an export from Excel to CSV.

I imported it with this code:

data test;
   infile "\\UNC path to my file\test.txt" delimiter="09"x termstr="|" dsd;
   length col1 col2 $1000;
   input col1 col2;
run;
proc print;
run;

If your data has embedded CRLF, then use termstr to change the character that SAS thinks marks the end-of-record.

Note: usually Excel doesn't use CRLF to embed line feeds, but only LF (I think it's "0A"x). So, you may not have to use termstr= or the pipe symbol to mark your end-of-record.

I have a Powershell script which runs a query against an Excel file and returns it as a CSV stream. You can use an infile pipe to dynamically read the stream in a data step.

You could modify it to write out your custom termstr character at the end of each line (something like SELECT [*,"|"] FROM [Sheet1$]).

Or, use another technique, such as adding the end-of-record character in your Excel file itself.

If you want to try the Powershell script, I don't have time to support further questions, but Google and the SAS documentation are your friend.

Here's the link: https://github.com/scottbass/Powershell/blob/master/Scripts/Query-Excel.ps1

Hope this helps...

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

Ksharp · Posted 09-13-2016 10:03 PM

You are luck. The termstr is CRLF . Others are CR . after I checked by NotePad++ .




filename x '/folders/myfolders/comments.csv'  termstr=crlf;
proc import datafile=x out=want dbms=csv replace;run;

Ksharp · Posted 09-13-2016 10:04 PM

You are luck. The termstr is CRLF . Others are CR . after I checked by NotePad++ .




filename x '/folders/myfolders/comments.csv'  termstr=crlf;
proc import datafile=x out=want dbms=csv replace;
 guessingrows=32767;
run;

Tom · Posted 09-13-2016 10:07 PM

I don't see any attached file.

Why do you say that you cannot read the Excel file? SAS can directly read both XLS and XLSX files.

libname mydata xlsx 'myfile.xlsx';
proc copy inlib=mydata outlib=work;
run;

If you have a CSV file generated by Excel then you can probably read it into SAS by making sure to tell SAS to use CRLF as the end of line markers. That way it will read the embedded CR that Excel normally uses for line breaks inside of cells.

data want ;
  infile 'myfile.csv' dsd truncover lrecl=100000 truncover firstobs=2 termstr=crlf;
  length col1-col6 $32767;
  input col1-col6;
run;

kumarK · Posted 09-14-2016 08:51 PM

i have successfully able to import the file. now i want to strip the words in text. below code works fine for small datset.

however i have large datset it's taking too much time.. anyone have any suggestions to improve the code. Thanks

data have1;
set have(keep=comments);
count =countw(comments," ");
run;

data have2(drop=count i);
set have1;
do i=1 to count;
words = scan(comments,i ," ");
output;
end;
run;

Ksharp · Posted 09-14-2016 10:24 PM

You can combine these two data step into one :


 
data have2(drop=count i);
set have;
do i=1 to countw(comments," ");
words = scan(comments,i ," ");
output;
end;
run;

kumarK · Posted 09-15-2016 01:29 PM

This is not reducing my run time.

Can we write a similar program using PRX fncitons. that would help in running time?

Ksharp · Posted 09-15-2016 09:50 PM

I don't think PRX functions could reduce your time. The fast efficient way I could think is using IML code. But that need lots of memory if you have big table.

Reeza · Posted 09-15-2016 10:57 PM

Define takes too much time?

Is this a process that needs to be repeated every week? Month?

kumarK · Posted 09-17-2016 12:09 PM

the dataset i have around 500k comments...i want to strip the comments into separate words..then i need to count how many times each word apperead. the code used above is taking too much time and generating around 100gb data..so i want to reduce the time and space.

after some research i came to know that i have intially assigend length for coments variable $5000 so that all comments is loaded proerly into my dataset. i think this is taking my dataset around 100gb

below is the the output i am getting after running the code, lf you see for every word its taking comments ..is there anyway we can improve the code. or how can i know max length of the variable before loading to dataset so that i can assign that length to the variable? or can i get the similar results another way? I hope you understand the issue. Thanks

comments	words
I Talked to customer	I
I Talked to customer	Talked
I Talked to customer	to
I Talked to customer	customer

Reeza · Posted 09-17-2016 06:55 PM

Have you considered SAS text analytics rather than using Base SAS?

kumarK · Posted 09-17-2016 11:32 PM

dont have license to SAS text analytics 😞

Register Today!

SAS Training: Just a Click Away