Hello everyone,
I am trying to do some regression on datafile from here: https://www.kaggle.com/tsarkov90/crime-in-russia-20032020
However, I encounter a problem. I have tried to import the file using:
proc import datafile="<path>"
out=crime
dbms=csv
replace;
getnames=yes;
run;
The problem is, that it seems to swype month and day. I mean there are first 12 days of January and then next year. How can I fix this problem?
Also, for further analysis I'd like to retain only month/year. How can I do this?
Many thanks, cheers
@svzplayer wrote:
Alright, I have followed your suggestion writing:
data crime;
infile "<path>" DLM=',' FIRSTOBS=2;
input month $ Total_crimes Serious Huge_damage Ecological Terrorism
Extremism Murder Harm_to_health Rape Theft Vehicle_theft
Fraud_scam Hooligan Drugs Weapons;
format month DDMMYY10.;
run;And... Another problem.
I attach two photos - first results with proc import and 2nd with this data step procedure. Seems like all years are cut on the first 2 digits.
<sorry for such a long reply, however I could not upload either .png or .jpg>
You did not tell SAS to read the value as a date and a simple $ input such as : input month $ only reads 8 characters.
Try this data step:
data crime; infile "<path>" DLM=',' FIRSTOBS=2; informat month ddmmyy10. input month Total_crimes Serious Huge_damage Ecological Terrorism Extremism Murder Harm_to_health Rape Theft Vehicle_theft Fraud_scam Hooligan Drugs Weapons; format month DDMMYY10.; run;
And as I mention in my other post, you may want to use the YYMMN6. format .
Alright, I have followed your suggestion writing:
data crime;
infile "<path>" DLM=',' FIRSTOBS=2;
input month $ Total_crimes Serious Huge_damage Ecological Terrorism
Extremism Murder Harm_to_health Rape Theft Vehicle_theft
Fraud_scam Hooligan Drugs Weapons;
format month DDMMYY10.;
run;
And... Another problem.
I attach two photos - first results with proc import and 2nd with this data step procedure. Seems like all years are cut on the first 2 digits.
<sorry for such a long reply, however I could not upload either .png or .jpg>
@svzplayer wrote:
Alright, I have followed your suggestion writing:
data crime;
infile "<path>" DLM=',' FIRSTOBS=2;
input month $ Total_crimes Serious Huge_damage Ecological Terrorism
Extremism Murder Harm_to_health Rape Theft Vehicle_theft
Fraud_scam Hooligan Drugs Weapons;
format month DDMMYY10.;
run;And... Another problem.
I attach two photos - first results with proc import and 2nd with this data step procedure. Seems like all years are cut on the first 2 digits.
<sorry for such a long reply, however I could not upload either .png or .jpg>
You did not tell SAS to read the value as a date and a simple $ input such as : input month $ only reads 8 characters.
Try this data step:
data crime; infile "<path>" DLM=',' FIRSTOBS=2; informat month ddmmyy10. input month Total_crimes Serious Huge_damage Ecological Terrorism Extremism Murder Harm_to_health Rape Theft Vehicle_theft Fraud_scam Hooligan Drugs Weapons; format month DDMMYY10.; run;
And as I mention in my other post, you may want to use the YYMMN6. format .
Thank you very much!
The default behavior of reading dates of xx/yy/zz and whether XX is treated as month or day of month and YY the other is based on your current setting of your DATESTYLE option.
You can check what your current setting is with
proc options option=datestyle; run;
The log will show something like
DATESTYLE=MDY Specifies the sequence of month, day, and year when ANYDTDTE, ANYDTDTM, or ANYDTTME informat data is ambiguous.
Or you might see DMY for the order.
If the order in the data is different than your setting then Import would swap the order of day and month from what is intended.
You can fix this by a couple of methods.
Set the option to the desired order with an Options datestyle= MDY(or DMY), which ever is needed.
Don't forget to set it the Datestyle option back to your current afterwards or other things may misbehave.
Or Proc import would have created data step code in the log. Copy the code and clean it up removing line numbers and such. Then
find the INFORMAT statement for the variable(s) of interest and change them to read the data properly. I can't see the data a the link you provided so I would guess that an informat of either MMDDYY10. or DDMMYY10. (depending on whether month or day comes first) might work.
There is not need to change the date values once created to do analysis by year and month only. You can assign a format to a variable that will create groups usable by any of the SAS analysis procedures. Likely candidates would by YYMMN6. to create groups like 201906 (June 2019) or YYMON7. if you want something like 2019JUN displayed.
Sorry if the link is inaccessible 😞
Yes, date is in format dd/mm/yyyy so I guess it's DDMMYY10. . Here is a funny thing, I followed your instructions and log shows that I have set dmy. However, when I run proc contents, it shows MMDDYY10. in the particular cell. In a reply for a previous answer, I have attached results from proc import and data step. I hope it can help somehow, because I run out of ideas...
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.