My data set is very messy such as the data starts from 8 row, lots space , lots comas, 55 variables and 7 lines for a title.
Before I use import the data , do I have to clean it? or I import it first and then clean the data in SAS.
and I have lots errors.
proc import datafile ='/home/inkyung/repos/lasso/data/0001.csv'
out=tax00
dbms=csv replace;
delimiter=',';
getnames=yes;
guessingrows=1000;
run;
proc print data=tax00;
run;
This depends on your data and how much you want SAS to do. For one, you can use DSD and MISSOVER if you use a DATA step. The DSD option will help if you have several commas in a row that signify missing values. This paper is a good intro to some options.
How does the code you have now work? If you still need help, can you post a small example of 0001.csv, including some of the "worst" examples you are seeing? Maybe including what looks wrong in the output would help.
It doesn't work.
The data set is as follows:
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,Table 3: STATE TAX COLLECTIONS BY STATE AND TYPE OF TAX,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,(thousands of dollars),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,Year: 2000,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,"Quarter: 1 (January, February, March)",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,Revision 1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,U.S. Total,Alabama,Alaska,Arizona,Arkansas,California,Colorado,Connecticut,Delaware,Florida,Georgia,Hawaii,Idaho,Illinois,Indiana,Iowa,Kansas,Kentucky,Louisiana,Maine,Maryland,Massachusetts,Michigan,Minnesota,Mississippi,Missouri,Montana,Nebraska,Nevada,New Hampshire,New Jersey,New Mexico,New York,North Carolina,North Dakota,Ohio,Oklahoma,Oregon,Pennsylvania,Rhode Island,South Carolina,South Dakota,Tennessee,Texas,Utah,Vermont,Virginia,Washington,West Virginia,Wisconsin,Wyoming,Wash. DC
,,,Excludes D.C.,,,Revised,Revised,,,Revised,Revised,Revised,,,Revised,Revised,Revised,,,Revised,Revised,,,,,,,Revised,Est.,,Revised,,,Revised,,,,,Revised,,,,,Revised,,,,Revised,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,Total Taxes,,133987401,1554648,204757,1922171,1135907,20206217,1614872,2219791,533972,6664802,3295666,860653,539018,6142447,2492170,1363449,1104084,1726561,1494620,583932,2267078,4055999,6827675,3719413,930371,1879749,317540,722445,626221,287934,4029776,831934,11762936,3695397,274030,5757836,1354611,1307216,6049117,508208,1325561,190918,1925580,6503198,867827,341961,2786879,3131704,765184,3069464,213902,628645
Thanks for providing the sample. Will you be reading in many of these, or just the one?
You can certainly do this with a data step, though it will probably be best to read in line-by-line. You can use the scan function, for example, to parse the input strings. The DATA step will allow you to output to different data sets, so you could output state names to a temporary set, the numeric values to another, and append them. This is difficult because your data doesn't follow typical CSV patterns.
However, it may be easier to reformat the set, especially if you are only reading in the one set. I copied and pasted the data you posted into Excel. I went to Data >> Text to Columns and selected the options of delimeter, comma. Then I formatted it like this:
State Note Total Taxes Year Quarter Revision U.S. Total Excludes D.C. 133987401 2000 1 1 Alabama 1554648 2000 1 1 Alaska 204757 2000 1 1 Arizona Revised 1922171 2000 1 1 Arkansas Revised 1135907 2000 1 1 California 20206217 2000 1 1 Colorado 1614872 2000 1 1 Connecticut Revised 2219791 2000 1 1 Delaware Revised 533972 2000 1 1 Florida Revised 6664802 2000 1 1 Georgia 3295666 2000 1 1 Hawaii 860653 2000 1 1 Idaho Revised 539018 2000 1 1 Illinois Revised 6142447 2000 1 1 Indiana Revised 2492170 2000 1 1 Iowa 1363449 2000 1 1 Kansas 1104084 2000 1 1 Kentucky Revised 1726561 2000 1 1 Louisiana Revised 1494620 2000 1 1 Maine 583932 2000 1 1 Maryland 2267078 2000 1 1 Massachusetts 4055999 2000 1 1 Michigan 6827675 2000 1 1 Minnesota 3719413 2000 1 1 Mississippi 930371 2000 1 1 Missouri Revised 1879749 2000 1 1 Montana Est. 317540 2000 1 1 Nebraska 722445 2000 1 1 Nevada Revised 626221 2000 1 1 New Hampshire 287934 2000 1 1 New Jersey 4029776 2000 1 1 New Mexico Revised 831934 2000 1 1 New York 11762936 2000 1 1 North Carolina 3695397 2000 1 1 North Dakota 274030 2000 1 1 Ohio 5757836 2000 1 1 Oklahoma Revised 1354611 2000 1 1 Oregon 1307216 2000 1 1 Pennsylvania 6049117 2000 1 1 Rhode Island 508208 2000 1 1 South Carolina 1325561 2000 1 1 South Dakota Revised 190918 2000 1 1 Tennessee 1925580 2000 1 1 Texas 6503198 2000 1 1 Utah 867827 2000 1 1 Vermont Revised 341961 2000 1 1 Virginia 2786879 2000 1 1 Washington 3131704 2000 1 1 West Virginia 765184 2000 1 1 Wisconsin 3069464 2000 1 1 Wyoming 213902 2000 1 1 Wash. DC 628645 2000 1 1
The formatting looks off here, but you should be able to copy into Excel. Then PROC import is trivial. Or format it yourself to suit your purpose.
I have to merge 20 files, so I tried to do, I failed it. I don't want to use EXCEL. Usually, I use the R, I call the text itself from raw data, I can edit them such as delete several rows and column, fix the space etc. I want to find that function in SAS to edit dataset.
If you go through one of the import wizards you will note that you have an option to indicate which line the data actually starts on. The option DATAROW can say to start reading the data on a specific row. Add
datarow=8;
The issue you may have is that we don't have a similar option to tell which row to read the variable names. Getnames assumes the variable names are on row 1. So you may gett messages about invalid variable names and the first row of your title will likely get truncated to 32 characters and used as a varaible name.
Sometimes the best thing to do use is proc import to generate code you can modify. The log will contain a data step that attempts to read the data using the information you provided to Proc Import. Copy it into the editor and modify to your hearts content. You may do things like search and replace to change the variable names, modify informats and formats, add labels and even the first step of any data modifications.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.