About Anotherdream

Anotherdream · ‎01-29-2015

Hello everyone. I have multiple sas codes that need to be able to modify the same data-set at the same time. One process might take 10-15 seconds to update the dataset, while the other takes 5-7 seconds. however if process A is running, and then process B tries to run, Process B fails on it's update. I would like to be able to have some piece of base code that would allow me to check if the Dataset is currently being modified by anything else, and if so to wait for X seconds, and then check again. I tried the code below which I found in a PDF, however it did not work and whenever I run it my BASE sas just crashes (shuts down). %MACRO ACCESS ; %LET ERR= 99 ; %DO %WHILE(&ERR>4) ; PROC SORT DATA=LIB.DAT1 OUT=LIB.DAT1; BY KEY ; RUN; %LET ERR=&SYSERR ; %END; %MEND; %ACCESS ; Does anyone know how I would accomplish this goal without making this into a sql tabl, and then doing update / insert statements with isolation level set to "serializable"? Thanks all

Anotherdream · ‎01-22-2015

DC I think you might have missed the point of my question. I don't want to have to know what the format of the string is coming in at the start of the function... I actually want this file to error out when the value is "41974", or to be smart enough to convert it to the correct date. My question was more along the line of "is there a date informat that is smart enough to not try to read this number the way that sas is doing it." There might not be. Arthur Tabachneck Unfortunately this is part of a production process, so I really need a specific informat so I can grab any format errors into the log I am outputting.. The reason being, I have code that actually outputs the SAS log to a text file, and then I am scanning that text file for any formatting errors found through the infile statement. Theoritically I could make the change that you noted; if I could figure out how to flag errors into a log (or data_set) somehow. Example: If someone put the value "heyman" into the date field, I would want sas to error on and for me to be able to flag the error. I might have to just read in all of the dates as varchars in the informats as step 1, and then do your trick with added error flagging as a secondary data-step..... I will do a little bit of testing to see if this works for me (regarldess of if it does or not, that's a very nifty solution. Thanks!.... I also especially like the trick of subtracting 21916 (days between 12/31/1899 and 1/1/1960)).

Anotherdream · ‎01-22-2015

Hello everyone. I recently ran into a 'feature' of SAS that I actually never noticed in a little over 2 years using the system. Noting the error, I am actually not sure how to handle it going forward, so any help and ideas would be greatly appreciated! Basically I am reading from a csv file one field, called "date1". I am reading the field as an Informat Date1 MMDDYY10. as I am expecting the date to come in a known date format of 03/18/2015. However, one day the users sent me a file with the following in the date field. "41974". What happened is the users got their data from Excel, and accidentally formatted the date field as "numeric", and 41,974 is the number of days between 1-1-1900 and "12-1-2014" However when SAS read this number in, it actually produced the value "April 19 1974". I was pretty shocked at this, but looking back I realize why it would do this.. However, what informat would I use to get around this problem? I can't use MMDDYY10. anymore, since it incorrectly reads in numbers above as dates.... Any and all help would be appreciated!

Anotherdream · ‎12-30-2014

Gergely Bathó I have marked your answer as helpful. If I could give multiple correct answers I would, as your works exactly as needed! Reeza I am marking your answer as correct. What I ended up doing was actually building a powershell script, and then calling it from within SAS using a "call system" command. It got quite a bit tricky because the actual string I was trying to replace (one of them) was "LoanNumber" (with the double quotes) so figuring out how to escape the quotes to the poweshell script was no small task, but I got it figured out (I think). I might post a seperate question on this one if I find my solution isn't working. Thanks all!

Anotherdream · ‎12-29-2014

Would it be possible to do something liek the following? Since borrower name is a string of length 13, could I loop over a file moving one "space (or byte, i'm not sure what to call it, but basically the same as using lrecl=1 and recfm=N)" at a time, but using a "moving window" of 13 length? Aka if the file looked like below.... Example File: Hey how are you doing? I'm good. Did you happen to get that BorrowerName: I did it's Shannon FakeGirl. Could I create a loop that would look at the first 13 character first. Aka 'Hey how are y' and then compare it to "BorrowerName:". If the 13 length string isn't equal to BorrowerName: then keep the first record within it. Else replace it with nothing. Aka the code would next move onto "ey how are yo". Since this is not equal to BorrowerName: it would keep the "e" that it started with... However if the string ever was equal to "BorrowerName:" then replace that string with empty space and jump the pointer forward 13 'bytes' and continue thte loop? Does this even sound possible in SAS? Unfortunately my skills relative to this portion of sas are pretty newbie, and i'm not sure how to even begin. I know another route could be too break an _infile_ statement into multiple variables dynamically, making each one equal to length $32767. until you reach the size of the file and then stop.... however I have no idea how to control for the string occuring in the break point between variables. Meaning the first variable ending at $32767 might cut the required string in half, and if that occurs I can't replace the string.. Thanks again, any other ideas are greatly appreciated.

Anotherdream · ‎12-29-2014

Hello everyone. I have been given files which are in the xml format, and are quite large. A very important note about these files is that they are ONE giant string, that is much longer than 32,000 characters in length. Because an Xml doesn't need line feeds / carriage returns to make sense, the maker of the files didn't put any in the file. However they did put some information within the files that I need to remove before the files are processed into my Production Database (not allowed legally to include them). How would I open a file in SAS, and remove all occurance of a string and replace it with a blank if the string itself is too long to be made into a variable? I tried the following method below; however the output file actually has nothing produced. (It makes a blank output file). I do note from the log that the input variable has a length of 79,293. I note this is much larger than the $32,000 newvar variable so I'm guessing something is going wrong that is causing the error based upon the string being too long.... The log has no other errors noted; Please note that I put the general "Borrowername:" as the string I want to replace, but this can be replaced with anything. It's actually not important to the question. data _null_; infile "c:\xmlfiles\inputxml.xml" end=eof lrecl=1000000; file "c:\xmlfiles\temp.xml" lrecl=1000000 RECFM=N; informat newvar $32000.; input; newvar=(_infile_,' BorrowerName:',trimn('')); put newvar; run; I also tried the same thing but without making a new variable, and just over-writing the _infile_ variable itself, like below. When this happens the new file has data, however it is a 100% match to the original file. The transtrn doesn't actually do anything as the output file still has the "BorrowerName:" string within the file. It looks like that step was basically skipped. data _null_ infile "c:\xmlfiles\inputxml.xml" end=eof lrecl=1000000; file "c:\xmlfiles\temp.xml" lrecl=1000000 RECFM=N; input; _infile_=transtrn(_infile_,' BorrowerName:',trimn('')); put _infile_; run; Please note that BOTH methodology show a 100% clean log, and I am not sure what to do next. Thanks all and let me know if I can add any info to help!

Anotherdream · ‎12-12-2014

Hello everyone. Sorry for the long delay. This question was basically tabled due to other priorities, but figured now would be a good time to open it back up. Richard: My purpose in trying to dynamically determine a date informat is basically to save people who are new sas programmers from trying to figure out what Date informat to use, because there are many of them and they can be confusing to people. This also would result in more code re-usability. If we could always read a csv file with all variables being Varchar, and then determine which variables are dates, this code could dynamically correctly read in dates. This would drop A LOT of coding time, as having to know the correct date informat for every variable is a HUGE time-sink for new SAS users. Espeically those coming from sql where date conversion is done inherently for you (in sql '2014-01-31 and 01/31/2014 and Jan 1 2014' are all the same thing and are converted by default for you... This is basically what i'm trying to replicate). I agree with what you're saying and I think this is the path I'm going to ahve to go down. I will have to look at a dataset, and keep all the variables that are 'dates' (in string form) and then try to determine how the strings are set up. My next logical question would be: what is the best way to do this? I'm thinking I would loop over each variable and then for any one variable, keep the first observation that is Non Null. Then attempt several date informat types using an input statement. ANy inputs that work and give correct numeric values will be the one used.. (Note I realize this is a problem with european dates of 01/05/2014 vs 05/01/2014). The problem I have next is what do we do about string dates that are not in any kind of reconizable sas format. Dates such as "jan 1 2014:04:03:023.203" or "2014-01-03 1:25 PM" or "2014-01-03 12:32:35.025 AM". I imagine we would have to create out own "informats" (strings of code that would read in these values) and test if they are converted correctly, and then if so stop at that point? This one is a fun brain teaser. Thanks all for your continued help.

Anotherdream · ‎10-13-2014

Hey again @Reeza and @ballardw Thanks very much for all of your help! I feel like I will have taken an undergraduate class in statistics by this time and your patience is extremely appreciated. I did some research into the Bonferronni method and the problem of multiple comparsion in statistical testing and I had a quick question. Doesn't the problem of multiple comparsion theoretically apply to any groups of statistical tests regardless if they came from the same population or not? Example: if you wanted to test 100 null hypothesis, all of which were un-related, you could either sample one population and ask 100 different questions, or sample 100 independent populations and ask each one question. In both cases, each test has a 5% chance of a type one error (assuming 95% confidence), so on average you'd expect 5 of the tests to incorrectly identify the null hypothesis, and the probability of at least 1 test incorrectly rejecting the null would be over 99.3% (I just used the Poisson distribution here and 1- P(0 events given mean of 5)). So from what I read, I think everything implies that the entire idea of the 'the correction factors applies equally to one sample or many independent samples... Is that a correct statement to make in your opinion? Thanks again

Anotherdream · ‎10-10-2014

Hey Reeza. So you're saying if you take a group of 437 randomly selected people and ask them two questions you cannot make two independent hypothesis tests (one for each question) at a pre determined confidence level? Example if we came in with the following two independent null hypothesis First Null Hypothesis "less than 10% of the population likes tv" Second Null Hypothesis "less than 21% of the population likes red hair" You're saying we shouldn't design a survey that asks people the two questions 1) "do you like tv" and 2) "do you like red hair" because we couldn't use the results of these questions to test both null hypothesis separately, but each at a 95% confidence level (95% confidence level on each test). That's what the company is trying to do, but you're saying it's wrong, correct? I'll try to research the algorithm you suggested however can you provide any more in-sight to make sure I understood you correctly? Thanks very much!

Anotherdream · ‎10-10-2014

Ah, so your counter point is that we accepted a 5% chance of an error occuring on every test (which is very true). So by definition an error occuring on one test could affect the results of another test but we have to be okay with that.... I guess that's just hard for me to understand for the following reason. If you randomly sampled 437 loans for test 1, and 437 loans for test #2 out of 100,000,000 loans, and test #1 was a bad sample (one of the samples where the confidence interval doesn't contain the true mean) which will happen 5% of the time... It doesn't imply that test #2 is also an error, even if they are perfectly correlated because different loans will likely be sampled for test #1 and #2. However in our example they are both an error by definition. (the same loans in both tests with perfect correlation). To me this implies this methodology is wrong because the 5% chance of error is actually greater than 5% because you are sharing data between tests.... Do you see where I am coming from?

Anotherdream · ‎10-10-2014

Hey BallardW. Your responses are helping very much so thanks in advance. I do want to test the sentence "among all respondents yy percent dislike radio" and independently the sentence "among all respondents XX percent dislike television", and specifically not those who dislike radio who also dislike Television (or vise versa). And I understand that if the sample is truly random then the sample mean approaches the true population mean as n gets larger. The part I'm struggling with is as follows. By definition a random sample can be 'abnomral', meaning it was an outlier (using a 95% confidence level to build a confidence interval from a random sample gives you a bounded estimate of the true population mean. However it implies only 95% of all sample bounded estimates of this range would actually contain the true population paramater, therefore it's quite possible we got one of the 5% samples.). Lets say for example our sample says that 33% of people like television, with a 95% confidence interval of 31-35 % like television, but the TRUE value in the population was 36.8%. Then our sample that gave us 33% was one of the 2.5% abnormalities and the sample was just 'unluckly' (while still being random). Now assume that everyone who like television also like radio. Our sample estimate was 3.8% under what was expected for television likers, therefore wouldn't we also be 3.8% under estimated on the population proportion of people who like radio (since the one test is perfectly correlated with the other test?) Does the statement I made above make sense? That is the confusion I am struggling with. Thanks a bunch

Anotherdream · ‎10-10-2014

Got ya. I actually see you point about survey research. I guess my "gut-reaction" was as follows. If 1,000 people are all asked 6 quesitons, you technically have 6 statistical tests, each of which has a sample size of 1,000. Ex: If one question was 'do you like television', you should be able to take the 1,000 people sampled and then say "X% of the population likes telvision with such and such confidence interval.." you could then do the same for the other questions being asked. However if you got a 'bad sample' (sometimes the true population mean doesn't fit within your confidence interval by the fundamental nature of statistics), and other questions are correlated to this question then a problem might arrise. Example: What if another question asked is "do you like radio"? Maybe there is a high correlation between not liking radio and television, and maybe the 1,000 people you sampled was a collection of people that was not 'normal' from the population and didn't like television. Well then your question on Radio is now also biased.... This is what went through my brain... Am I wrong in an assumption I've made above? (by the way thank you very much for your help)... Or perhaps is my assumption correct and this is a fundamental flaw in not performing 6 independent random samples for each survey question.... You get a much smaller sample size, but accept some un-reported Correlation between the questions being asked?

Anotherdream · ‎10-10-2014

Hey Ballard. There is a very large cost with performing work on the loans sampled. Meaning the auditor has to audit each loan, and that costs hundreds if not thousands of dollars per loan. So yes, the minimum sample is of vast importance. I agree with question 2. I don't know if that matters in this case however which is kinda why I'm stuck.. Can you explain your answer to question 5 a bit more? I know that of the 15 buckets, a loan WILL usually contain information for about 14-15 of the buckets. So the density of sharing is VERY large. I "know" I can get less than 1,000 loans (probably closer to 500 or so) sampled if loans are allowed to fall into more than once bucket however my question is "is it acceptable statistically to sample 500 loans and then push each of the 500 sampled into 15 different tests IF the original request was to perform a statistically significant random sample for 15 different criteria (buckets) on loans that can be put into each bucket. Each of which will use the two tailed normal, with the above assumptions (which gives 437 sample size required)"..... So basically my question is. If I have 15 tests to perform, all of which require a sample size of 437, but 1 Item "loan" can satisy all 15 tests can I simply sample 437 loans randomly from the population, and then perform each of the 15 tests on these loans? .I can do this form a process stand point, and it works conceptually and it will result in a much smaller sample size, but is there any problem statistically with doing this? Hope that makes sense

Anotherdream · ‎10-10-2014

Hello. I was presented with a 'sampling solution' by someone else and was asked if there was anything wrong with the solution. The issue is I have no idea. The solution "feels" wrong, but I cannot find any problem with it logically. Can anyone take a peek and let me know if there are any issues with the designed solution. Background Basically a company want to perform 15+ statistical tests on a population, and each test they want to use a specific distribution (lets assume normal) with a 5% assumed error rate, a 2% margin of error and 95% confidence (pop size of 10000 with finite population correction factor. Each test will then require 437 loans. HOWEVER one loan can be used in multiple tests. An example of a test would be Test1 = "% of loans who person's name was not misspelled" and Test 2="% of loans with balance under 50,000". One loan CAN have both a balance and a name, so it can fit into both buckets. However not all loans will fit into all buckets (some loans might not have person's name for example). The company that came to the auditors sees the number of loans required is 15 * 437 or 6355 (15 independent random samples of the 437 noted above). They are only willing to do the work if the sample size is ~ 1,000 loans at maximum. To allow for this the auditing company comes up with the following solution. Solution that I question: First, take a random sample of X loans (437). Since one loan can fall into multiple buckets, look at each of the 437 loans and split them into the required buckets if they have the attributes associated with that bucket (ex: person name and loan balance). In our example above, the one loan would go into both the Name test, and the Balance test. Therefore this loan is 1 out of the 437 needed in EACH bucket. Then per bucket once you get 437 loans you are done. However, many of the 15 buckets will not have 437 loans (Because it is highly likely that less than 437 of the 437 selected loans will apply to the particular test bucket. An example could be... Maybe only 200 loans have a recorded person's name, therefore the other 237 cannot be used in the Test1 labeled bucket). At this point, find out how many loans you need to fill out 437 per test, and simply sample that many more loans per test. Meaning if Test 1 was under sampled by 237 loans, sample 237 more loans in the population that meet your requirements for Test 1. Then repeat the process for Test #2, etc.. By doing this, your original sample of 437 loans can be used across multiple tests and you would only fill in missing loans by definition. In addition the company says each test is still a statistically random sample and would hold up to third party scrutiny. Questions 1) What mistakes (if any) did the auditing company make. Question 2) Is the sample a simple random sample? Is it a random sample at all? Question 3) Is there anything mistaken with this methodology, assuming the company just needs a random sample and not a simple random sample? Question 4) Do the associated samples still obtain the required 95% confidence, 2% margin of error for each of the 15 tests, even though loans were shared between tests? Question 5) If the solution given is incorrect, is there any solution that will allow for 15 tests at the required specifications with a total of less than 1,000 loans? Please let me know if my question did not make sense!

Anotherdream · ‎10-02-2014

Ah astounding you beat me right to it! I also think what you are asking for is exactly accomplished by Astounding... I actually came to the exact same solution, a little late tho.

Online Status	Offline
Date Last Visited	‎11-29-2018 11:29 AM

Re: How to get rid of negative in front of zero

Re: How to get rid of negative in front of zero

Re: consecutive flagging

Re: Using retain an multiply down a collumn

Re: How do you run a stored procedure using PROC SQL?

Re: How do you run a stored procedure using PROC SQL?

Re: X command and call system not working correctly

Re: X command and call system not working correctly

Re: X command and call system not working correctly

Re: X command and call system not working correctly

Re: Macro paramater has special characters in it '('

Re: Urgent question

Get data into sql server more efficiently?

Re: Date from SAS showing 1/1/1960

Re: Urgent question

Re: Proc SQL Delete, using a join

Locking of SAS dataset from multiple processes

Re: Date informat MMDDYY10. incorrect?

Date informat MMDDYY10. incorrect?

Re: Replace all occurance of string in File with string > 32,000

Re: Replace all occurance of string in File with string > 32,000

Replace all occurance of string in File with string > 32,000

Re: Dynamically determine date informat from file

Re: Question on Sampling broken into groups After Sample

Re: Question on Sampling broken into groups After Sample

Re: Question on Sampling broken into groups After Sample

Re: Question on Sampling broken into groups After Sample

Re: Question on Sampling broken into groups After Sample

Re: Question on Sampling broken into groups After Sample

Question on Sampling broken into groups After Sample

Re: How to create year series?