BookmarkSubscribeRSS Feed
kumarK
Quartz | Level 8

Hi,

 

How to import special characters (HIdden quotes,double quotes, next line) string around 2000 length in one of the column in csv file.(attached) 

 

Cust_id trans_type gender trans_date trans_amt Comments
11201 Entertainment F ######## $50.65 I Talked to customer, but he is not interested
11205 Grocery m ######## $90.00 1. he is not wiling
but he is will continue in future
2. he is good
11206 Travel m ######## $23.00 "not wiiling to continue" regards Diip
11206 Travel m ######## $23.00 "not talked


@@ regards hiip
19 REPLIES 19
Reeza
Super User

Yuck. 

This is one of the few cases where I say keep the data in Excel since it will keep,relevant content in a single cell. 

 

Note this assumes you started with an Excel file since going to Excel from a CSV will have the same issues. 

kumarK
Quartz | Level 8

unfortunately we dont have access to import excel files. we need to convert either csv or text file. 

 

this is just sample data i have unstructured data with so much cooments wecan't controal that column..i need to load this data into a dataset.do you have any recommendations how can i import?

 

 

ScottBass
Rhodochrosite | Level 12

Hi,

 

I created the attached txt file, simulating an export from Excel to CSV.

 

I imported it with this code:

 

data test;
   infile "\\UNC path to my file\test.txt" delimiter="09"x termstr="|" dsd;
   length col1 col2 $1000;
   input col1 col2;
run;
proc print;
run;

If your data has embedded CRLF, then use termstr to change the character that SAS thinks marks the end-of-record.  

 

Note:  usually Excel doesn't use CRLF to embed line feeds, but only LF (I think it's "0A"x).  So, you may not have to use termstr= or the pipe symbol to mark your end-of-record.

 

I have a Powershell script which runs a query against an Excel file and returns it as a CSV stream.  You can use an infile pipe to dynamically read the stream in a data step.  

 

You could modify it to write out your custom termstr character at the end of each line (something like SELECT [*,"|"] FROM [Sheet1$]).  

 

Or, use another technique, such as adding the end-of-record character in your Excel file itself.  

 

If you want to try the Powershell script, I don't have time to support further questions, but Google and the SAS documentation are your friend.

 

Here's the link:  https://github.com/scottbass/Powershell/blob/master/Scripts/Query-Excel.ps1

 

Hope this helps...


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
Ksharp
Super User
You are luck. The termstr is CRLF . Others are CR . after I checked by NotePad++ .




filename x '/folders/myfolders/comments.csv'  termstr=crlf;
proc import datafile=x out=want dbms=csv replace;run;

Ksharp
Super User
You are luck. The termstr is CRLF . Others are CR . after I checked by NotePad++ .




filename x '/folders/myfolders/comments.csv'  termstr=crlf;
proc import datafile=x out=want dbms=csv replace;
 guessingrows=32767;
run;

Tom
Super User Tom
Super User

I don't see any attached file.

Why do you say that you cannot read the Excel file?  SAS can directly read both XLS and XLSX files.

libname mydata xlsx 'myfile.xlsx';
proc copy inlib=mydata outlib=work;
run;

If you have a CSV file generated by Excel then you can probably read it into SAS by making sure to tell SAS to use CRLF as the end of line markers.  That way it will read the embedded CR that Excel normally uses for line breaks inside of cells.

data want ;
  infile 'myfile.csv' dsd truncover lrecl=100000 truncover firstobs=2 termstr=crlf;
  length col1-col6 $32767;
  input col1-col6;
run;
kumarK
Quartz | Level 8

i have successfully able to import the file. now i want to strip the words in text. below code works fine for small datset.

 

however i have large datset it's taking too much time.. anyone have any suggestions to improve the code. Thanks

 

data have1;
set have(keep=comments);
count =countw(comments," ");
run;

 

data have2(drop=count i);
set have1;
do i=1 to count;
words = scan(comments,i ," ");
output;
end;
run;

 

Ksharp
Super User
You can combine these two data step into one :


 
data have2(drop=count i);
set have;
do i=1 to countw(comments," ");
words = scan(comments,i ," ");
output;
end;
run;

kumarK
Quartz | Level 8

This is not reducing my run time.

 

Can we write a similar program using PRX fncitons. that would help in running time?

Ksharp
Super User
I don't think PRX functions could reduce your time. The fast efficient way I could think is using IML code. But that need lots of memory if you have big table.
Reeza
Super User

Define takes too much time?

Is this a process that needs to be repeated every week? Month?

kumarK
Quartz | Level 8

the dataset i have around 500k comments...i want to strip the comments into separate words..then i need to count how many times each word apperead. the code used above is taking too much time and generating around 100gb data..so i want  to reduce the time and space.

 

after some research i came to know that i have intially assigend length for coments variable $5000 so that all comments is loaded proerly into my dataset. i think this is taking my dataset around 100gb

 

below is the the output i am getting after running the code, lf you see for every word its taking comments ..is there anyway we can improve the code. or how can i know max length of the variable before loading to dataset so that i can assign that length to the variable? or can i get the similar results another way? I hope you understand the issue. Thanks

 

comments words
I Talked to customer I
I Talked to customer Talked
I Talked to customer to
I Talked to customer customer
Reeza
Super User

Have you considered SAS text analytics rather than using Base SAS?

kumarK
Quartz | Level 8

dont have license to SAS text analytics 😞

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 19 replies
  • 2335 views
  • 9 likes
  • 5 in conversation