Hello SAS Community,
I was trying to import the below file after saving it as .xlsx but kept getting the below error while importing the file to SAS Enterprise Miner 14.1. I have tried the following but did not help:
1. Tried saving to older version of excel, .xls.
2. Copy & paste the data to a new worksheet.
3. imported the file as .csv but only got 7 observations.
The original csv file was downloaded from Kaggle webpage: https://www.kaggle.com/nikhileswarkomati/suicide-watch
Could anybody help? Many thanks in advanced for this. 🙂
I would suggest starting over, downloading the file as CSV and then reading that file.
If you open CSV files in Excel and save them you can change contents, sometimes quite significantly.
Thank you for your reply. i have tried that too but did not work neither. 😣
Do NOT open a CSV file with EXCEL. It will change the contents on you.
That file is not data. From the header line there appear to be only three fields, but the longest lines are over 70 thousands bytes long. Also it looks like the first field is supposed to be a record number or identifier, but some of the lines (see lines 6,000 and 20,000 below) do not start with a digit string followed by a comma. So most likely there are extra line feeds inserted into some of the fields.
101 data _null_; 102 infile 'c:\downloads\Suicide_Detection.csv.zip' zip member='*' lrecl=1000000 end=eof termstr=lf; 103 input; 104 if _N_ in (1,2,3,6000,20000) or eof then list; 105 run; NOTE: The infile 'c:\downloads\Suicide_Detection.csv.zip' is: Filename=c:\downloads\Suicide_Detection.csv.zip, Member Name=* RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 1 ,text,class 11 2 2,"Ex Wife Threatening SuicideRecently I left my wife for good because she has cheated on me twice a 101 nd lied to me so much that I have decided to refuse to go back to her. As of a few days ago, she beg 201 an threatening suicide. I have tirelessly spent these paat few days talking her out of it and she ke 301 eps hesitating because she wants to believe I'll come back. I know a lot of people will threaten thi 401 s in order to get their way, but what happens if she really does? What do I do and how am I supposed 501 to handle her death on my hands? I still love my wife but I cannot deal with getting cheated on aga 601 in and constantly feeling insecure. I'm worried today may be the day she does it and I hope so much 701 it doesn't happen.",suicide 727 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 3 3,Am I weird I don't get affected by compliments if it's coming from someone I know irl but I feel r 101 eally good when internet strangers do it,non-suicide 152 6000 chest pain 10 20000 0 1025022 348110,I still haven't beaten the first boss in Hollow Knight. I've only fought it a few times and I 101 always die really early in the fight. I'm terrible at this game y'all. :(,non-suicide 186 NOTE: 1025022 records were read from the infile 'c:\downloads\Suicide_Detection.csv.zip'. The minimum record length was 0. The maximum record length was 72599.
So the first thing to do is fix the embedded linefeeds. You can use this %replace_crlf() macro.
Then perhaps the data mining tool can read really long strings? Does it use VIYA or SAS?
filename csv temp;
%replace_crlf('c:\downloads\Suicide_Detection.csv.zip' zip member='*',csv)
If you have to read it with SAS then you will need to read the lines into multiple records of shorter strings. The maximum length of a character variable in SAS is only 32,767 bytes. So you cannot read that text into a single field.
To tease out the "TYPE" field at the end of those long lines will be a little tricky.
data suicide ;
infile csv dsd truncover length=ll firstobs=2 termstr=lf lrecl=1000000;
length recno length index 8 string $500 ;
input recno @;
length=ll ;
n=ceil(length/500);
do index=1 to n;
input string $char500. @;
output;
end;
run;
data type;
set suicide;
where index>=(n-1);
by recno;
length type $12 string2 $1000;
string2=cats(lag(string),string);
if last.recno;
type=scan(string2,-1,',');
keep recno type;
run;
data suicide;
merge type suicide;
by recno;
run;
Results:
NOTE: The infile CSV is: Filename=C:\Users\ABERNA~1\AppData\Local\Temp\1\SAS Temporary Files\_TD12972_AMRL20L6F1E4992_\#LN00061, RECFM=V,LRECL=1000000, File Size (bytes)=167696307, Last Modified=13Feb2022:23:26:56, Create Time=13Feb2022:23:26:23 NOTE: 232074 records were read from the infile CSV. The minimum record length was 17. The maximum record length was 117797. NOTE: The data set WORK.SUICIDE has 465570 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.68 seconds cpu time 0.54 seconds 104 105 data type; 106 set suicide; 107 where index>=(n-1); 108 by recno; 109 length type $12 string2 $1000; 110 string2=cats(lag(string),string); 111 if last.recno; 112 type=scan(string2,-1,','); 113 keep recno type; 114 run; NOTE: There were 321711 observations read from the data set WORK.SUICIDE. WHERE index>=(n-1); NOTE: The data set WORK.TYPE has 232074 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.47 seconds cpu time 0.45 seconds 115 116 data suicide; 117 merge type suicide; 118 by recno; 119 run; NOTE: There were 232074 observations read from the data set WORK.TYPE. NOTE: There were 465570 observations read from the data set WORK.SUICIDE. NOTE: The data set WORK.SUICIDE has 465570 observations and 6 variables. NOTE: DATA statement used (Total process time): real time 0.61 seconds cpu time 0.50 seconds 120 121 122 123 proc freq data=suicide; 124 where index=1; 125 tables type; 126 run; NOTE: There were 232074 observations read from the data set WORK.SUICIDE. WHERE index=1; NOTE: PROCEDURE FREQ used (Total process time): real time 0.40 seconds cpu time 0.17 seconds
The FREQ Procedure Cumulative Cumulative type Frequency Percent Frequency Percent ---------------------------------------------------------------- non-suicide 116037 50.00 116037 50.00 suicide 116037 50.00 232074 100.00
Now you at least have the strings in SAS variables. How to make any sense of them is up to you.
I will leave as an exercise how to remove the type from the ends of the last one or two values of STRING for each RECNO.
Thank you Tom for the sharing. Let me try to clean those rows and see if it works. 🤞
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.