05-16-2013 03:34 PM
Sorry for the noobie question, I have always struggled with input statements and the combation of IF, then statements.
Okay basically I want to reaad in a datafile, IF and only IF the second variable of this datafile (determined by a delimiter) is equal to a specific value. Example data and output desired is below. In the example I only want to keep the row if the variable=1999.
1 ,1999,JIBBERISH YAY
3 ,1999,Im doing great thanks.
I only want to keep Key1 of 1 and 3, and not include key 2. I know this can be done with an if statemnet after the input statement (like below) however from my understanding this would actually read in the entire row into the PDV, and then only write it if the condition is met. I remember reading somewhere there is a way to only read in the second observation and then determine if you would like to read in the other observations.
Inefficient (I believe).
infile "mydatalocation" dsd delimiter=',' MISSOVER lrecl=32000;
informat key1 8.;
informat variable 8.;
informat other $2000.;
format key1 8.;
format variable 8.;
format other $2000.;
If anyone can link the proper answer, or a paper that explains this ability I would greatly appreciate it.
05-16-2013 03:40 PM
One thing I need to add, is that the first column can be (and is) of varying length (its actually a varchar field). So sometimes it can be 1 long, sometimes it can be 20 long, etc.. So that needs to be taken into account.
05-16-2013 04:24 PM
Assuming your program works as is, only a slight change is needed. These statements get replaced:
input key1 variable other $;
Instead, switch to:
input key1 variable @;
input other $;
05-16-2013 04:30 PM
Okay astounding you brought up a question I had on this. Since I am using list input (delimited data, not column input) do I have to read in all variables BEFORE my key variable in order for this to work?
AKa if I had 50 variables, and THEN my key variable, (And each variable is varying length), then I have to read in those first 51 variables before determining if I need to read in the other 250 +? This is still more efficient than reading in all 250+ of course, but is there any way to read in JUST that second (on in this case 51st variable) by creating use of the scan function or something like that?
Example, maybe I could read in the entire line in an input statement, and then search for the 52 variable using a scan function, and then if that variable=1999 then read in all variables? This is more along the lines of what I was thinking and wondering if possible.
05-16-2013 04:41 PM
Well, list input does increase the work. If you were using column input, for example, you could read in just the one variable you need for subsetting:
input @6 variable @;
input @1 .....;
If you are certain that you will never see this phrase elsewhere, you could try:
if scan(_infile_, ',1999,');
input key variable other $;
But this only works if the characters ,1999, would never appear elsewhere in the data other than as a value for variable.
If you had to read to the 51st variable, you could try:
length dummy $ 1;
input dummy dummy dummy ... dummy dummy dummy variable @;
There's probably a way to abbreviate that.
05-16-2013 04:47 PM
Do you know of a paper that would explain the use of the _infile_ option? I have seen several members of this website solve answers using this method, and I never learned it (I did teach myself from 1 book however, so there is LOTS I am still missing).
Example, if I wanted to read an entire line as 1 varaible, could I just say
05-16-2013 04:52 PM
I don't know of a paper, but it can't be that hard to dig up. Technically, _infile_ is not an option ... it's a copy of the entire raw data line that gets updated each time an INPUT statement executes. The key option here is the @ at the end of the input statement. It holds the current line of data so that subsequent input statements can read from the same line. Furthermore, it releases the line when hitting the end of the programming statements within the DATA step.
What is relatively new (not sure in what release of the software) is the ability to use _infile_ in functions and to actually change its value within the DATA step.
05-16-2013 05:05 PM
ah so the _infile_ statement basically reads in the entire data line, ignoring delimiters etc... So Using that knowledge I was able to write the following code...
infile "MYDATA.txt" dlm='|' missover dsd lrecl=32000 firstobs=2;
Var1 : $12.
Var2 : $200.
And this appears to have worked. however it only resulted in a 30% time savings, even when I am only reading in 10% of the total Data. I assume this is because by using the _infile_ option SAS has to still read the entire line of data, which takes a large amount of time. (however when using the _infile_ option sas does not write this to the PDV correct I would assume, which is why it is still speed up a bit).
Can anyone let me know if my questions make sense, or if in general I am understanding this correctly.
Thanks again community!
05-16-2013 05:12 PM
Not sure you would expect any improvement in time. You still have to move the whole input file from disk to memory for SAS to able to read the lines. The improvement from skipping a couple of input statements will be trivial compared to the I/O time.
Any improvement in time you see is probably from not writing as many records to the output dataset. So your original program is probably the simplest.
05-16-2013 05:16 PM
_infile_ is still just a copy of the data line ... I could be wrong here but I've never seen it described as part of the PDV. You might save a small amount of time by getting rid of one function:
Glad you found the right syntax ... I was thinking INDEX but typing SCAN earlier.
05-16-2013 05:13 PM
So Astounding, my above query worked 100%, however I tried it with a new dataset (where the Key IS in the second column) and it stil worked. however I tried the step you had above of only reading in the first two variables, but when this happens all of the data appears to be reading incorrectly. Below I will paste my code, along with example data. Can anyone see what I am doing incorrectly? is it the combination of the input @ value along with the colon modifier?
infile "MYDATA.txt" dlm='|' missover dsd lrecl=32000 firstobs=2;
input ID_loan : $12.
period : 8. @;
ID_Loan : $12.
period : 8.
Var3 : $300.
When this is run it returns zero data, and sas throws an error for every row, saying "INvalid data for Loan_age".
Hiya Tom. SO it's not possible for sas to only grab the first section of a line (say the first two variables) and then from that determine if it needs to read in the rest of the line in order to save efficiency? I had thought the first problem had to read ALL of the varaibles into the PDV, which if we could avoid would save a ton of time (since its very very long comment data).
05-16-2013 05:18 PM
In the program that is not working, the key concept is that the INPUT statement tracks your current location on the line. When the first INPUT statement uses the trailing @, there is a pointer showing where the INPUT statement left off reading the data. So the second INPUT statement should merely continue where the first one left off:
input Var3 : $300 ...;
05-16-2013 05:25 PM
Moving 300 bytes from one piece of memory (_INFILE_) to another piece of memory (VAR3) is such a trivial amount of time. You could that thousands of times in a fraction of the time it takes to read one block of data off the disk........
In your example I think that perhaps FIRSTOBS=2 might be skipping the one row of data with 201206 as the period.
Also use TRUNCOVER over instead of MISSOVER. I have never seen a situation where MISSOVER is the correct option.
05-17-2013 04:11 PM
I guess my question Tom is, shouldn't it be possible for sas to only read in the very first two observations (not the entire _infile_ line) and then Based upon that data determine how to read in the rest of the line?
Thus basically if you had 1,000 variables you would save the processing time of reading in 998 of them over 90% of the time (if only 10% of your data had 201206).
I had thought this was an ability within sas based upon prior postings.
05-17-2013 05:18 PM
To read a text file you have to read the file sequentially. Think of it like a cassette tape (or a spool of thread) to get to the next part of the tape you have to read past everything in between. Whether SAS does anything with the in between parts or not cannot prevent the operating system (and SAS internal file buffers) from having to read all of the file off the disk.
You could skip then 1,000 conversions of the characters in the line into variables (say converting '100' to the equivalent floating point representation), but the processor time saved would be trivial to the total time it takes to read the file off the disk. If you were running on a multi-user system that was CPU bound you might see a significant wall clock time improvement, but most SAS jobs as limited by input/output speed, not CPU speed.