Why do 30% of all positive integers start with 1?

Forums: Science And Math
Email this Topic • Print this Page

A test for the number theorists that may lurk here!

Given a positive integer may start with 1, 2, 3, ... or 9 you might think that there'd be a 1/9 or approximately 11% chance any random positive integer would start with a first digit of 1. Or that there would be just as many numbers starting with 9 as there are numbers starting with 1, but numbers starting with 9 as a first digit are the rarest occuring only 4.5% of the time.

PS

Something that Insurance acturaries and forensic auditors were slow to catch onto as they looked for evidence that folk were cooking the books. But eventually number theorists told them of this result and why it occurs and now they have a brand new series of tests for spotting cases of fraud.

PPS

Number frequencies of the first digit of all positive integers is to one decimal place:

1 as the starting digit occurs 30.1% of times in all positive integers
2 as the starting digit occurs 17.6%
3 as the starting digit occurs 12.5%
4 as the starting digit occurs 9.7%
5 as the starting digit occurs 7.9%
6 as the starting digit occurs 6.7%
7 as the starting digit occurs 5.8%
8 as the starting digit occurs 5.1%
9 as the starting digit occurs 4.6%

You'd see this trend if you (correctly wrote a C program like)

Code:



#include <math.h>

#include <stdio.h>

#include <stdlib.h>

#include <math.h>

#include <constants.h>

#include <signal.h>



#define Prime1 7234567897987978.....89091L

#define Prime2 63476324789324........87343L

#define Prime3 578457843879435.....909003L

#define Prime4 1382387248727498...348347L

#define Prime5 123427834832743....324224L



external unsigned long rand(), srand();   /* standarised random number generator and seed */

unsigned long i = 0L, counter[10] = {0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L};



unsigned long very_rand()              /* even more random number generator */

{

     return (rand() % Prime1 % Prime2 % Prime3 % Prime4 % Prime5);

}



VOID print_results()

{

     int      j;



     signal(SIG_INT, SIG_IGN); /* ignore stacked interrupts */ 



     printf("\nResults\n");

     for (j = 0; j < 10; j++)

          printf("  Digit %1d" occurs with frequency %2f \n", j, (float)(counter[j] * 100 / i) );



     signal(SIG_INT,print_results);     /* re-set interrupt handler to be this function */

}



main()

{

     char   buffer[64];



     srand( (unsigned)time( NULL ) );   /* Seed random number generator */



     signal(SIG_INT, print_results); /* if user presses interupt CTRL-C print  interim results */



     for (i = 0L; i < ULONG_MAX; i++, counter[(int)(buffer[0] - '0')]++ )

           ssprinf(buffer, "%ld", very_rand() );     /* count the first digit of a random number*/





     print_results();    /* print out your results */

}

Note very_rand() returns an unsigned long int that has been modded (%) with five big primes to make it extremely random! The modding (remainder from an integer division) makes teh random number act as if it was generated randomly form a pool of about 4 trillion raised to the fifth power!

Topic Stats
Top Replies
Link to this Topic

Type: Discussion • Score: 1 • Views: 4,055 • Replies: 53

No top replies

if you are talking about a subset of the positive integers, such as those between 1 and 150, yes, more will start with 1 than with any other digit. The same is true for any set of integers that is limited in value.

0 Replies

I'm curious for an explaination. I'm missing the logic. I understand Terry's point, but not your G__day. What set of numbers are we looking at?

0 Replies

An infinite set; it works best with a very, very large -> infinite set of positive numbers actually!

If you sample 1,000 random positive integers you will see this trend developing very quickly - converging to these values with very great accuracy.

PS

I found it hard to swallow when I first heard it (1982) so I went to the Pure maths department to get a really good random number generator - simply take a decent random number generator that creates very big numbers and mod it by 5 very big primes (in decreasing order). This trick gives you a random number appearing within the frequency of the product of all 5 primes! Well generate alot of random numbers this way and start counting the first digits of each say 30 digit number appearing - the trend I gave you will start appearing within a thousand numbers. By 10,000 numbers it will be accurate to 3 decimal places!

PPS

The proof that it doesn't converge to be 1/9 for any digit is to suppose it does. Then you start counting numbers in trenches (groups) - of say 100 at a time and you get the expected convergence happen pretty quickly - for a while. But then if you start counting by trenches (groups) of 50 numbers at a time you see a very different pattern of convergence and if you count by 2s or 3s or 491s or groups of 777 etc you get wildly different patterns of convergence.

Only a logarithmic grouping shows stable convergence.

Number frequency for our (HINT) base 10 numbers system that starts at (bigger Hint) from 0 and spreads out to infinity shows that expected frequency of k being the first digit in any truly random integer has likihood

Probability k is the first digit of any positive integer = log((k+1)/k) as per the table I posted above. So digit 1 appears log((1+1)/1) = log 2 = 30.1...%

Think of it like an odometer of a car - each number appearing stay longest with a 1 before it flips to the next biggest sequence.

0 Replies

I agree with Terry. The frequency of numbers that start with 1 depends on the maximum value allowed for the random numbers. For example, if the range goes from 0 to 199, more than half of all numbers will start with a one. If it goes from 0 to 999, all numbers are equally likely to occur in the first digit. There is always _some_ maximum number, and depending on what it is, it will be somewhere in between -- as it has turned out in your experiment.

Note that the logic of this argument doesn't change if you increase size of the maximum possible number by an order of magnitude. If your maximum is 1999, you have over 50% ones again. If it's 9999, you have a constant distribution. Your routine ensures that your maximum random number is a large order of magnitude, but that doesn't matter in this context.

0 Replies

Both Thomas and Terry miss the point through use of calculated maximums instead of random ones.

With a randomly defined maximum g__day is right.

Let's use a very easy example to make it clear. We'll base it off of Thomas's easiest set.

Picking a random number out of 9999 as the maximum will in the overwhelming majortity of cases result in a maximum in which the spirit (not the 30% because we are using Thomas's very limited set) g__day's point is proven.

Terry and Thomas fail to consider the aleatory ceilings itself.

The whole point of this stat is about compatibility with diverse enough datasets that the axiom is rendered true.

To intentionally pick out the ceilings that disprove the axiom is silly.

Yes, as Thomas states, it "depends on the maximum". And define it randomly (instead of intentionally defining it to dispute it) and it will probably (quick calculation is 90% chance) result in a maximum that makes the spirit of the axiom (one being more probable than the others) true.

Over enough datasets it comes to about 30%

0 Replies

Check the thread title - "Why do ALL positive integers..."

Not why do a finite subset of positive integers counted some random number of times. Your answering the wrong question!

You need to show that for all positive integers (an infinite set) the first digit frequency within this infinite set actually converges and what it converges to exactly for each digit from 1 to 9. I have given you the formulea but not the complete working.

And Thomas you logic is partially correct but you've only considered some finite set of numbers - which avoids the core problem! If you calculate the distribution of the first digits in groups of N as N heads to infinity you get a very different answer than if you count the first digits in groups of 3N numbers as N heads towards infinity or the first 2N numbers or the first 17N as N heads towards infinity.

By your POV the count size should potentially significantly vary my final result. I have run this program on a mainframe for a week of computer time and the results exactly match number theory to 12 decimal places.

Look at my random number generator again ULONG_MAX is 4,294,967,295 - modded by similar sized primes - I am generating a number that should be random within a set of range of 1.4615016356294910843912741403576e+48 (remember there are only 10 ^ 23 stars in the Entire Universe), so its not a bad set for a program that a PC can run. A super comptuer in gigaflops might only do hundred billion calculations a second * 80K secs in a day * 7 days in a week or order 10 ^ 16 samples - the barest sliver of my set of 10 ^ 48 numbers.

If you run this test for 10 minutes you start seeing convergence precisely to theory - check back in a hour, then two hours and a day then two days nad all you are see is changes in the 4th, 5th, and 6th decimal places the longer you let this run!

PS

I was expecting someone to quote Benford's Law on lograthmic distribution of numbers apparently first discovered by astronomer/mathematician S. Newcomb in 1881!

http://plus.maths.org/issue9/features/benford/

0 Replies

Craven de Kere wrote:

Both Thomas and Terry miss the point through use of calculated maximums instead of random ones. With a randomly defined maximum g__day is right.T erry and Thomas fail to consider the aleatory ceilings itself.

Actually I wrote:

There is always _some_ maximum number, and depending on what it is, it will be somewhere in between -- as it has turned out in your experiment.

Where is the disagreement? If every maximum number produces something between an even distribution and one in which 5x % of all numbers start with 1, this is consistent with g_day's observation, whatever maximum number he uses, and whether he averages over randomly chosen maxima or not. Which was my point, and I think it was also Terry's.

0 Replies

As I said - this doesn't even show there will be convergence to any number - as the necessary first part of the proof as your maximum grows to infinity.

The actual proof shows you need exactly a logarthmic size bucket as you're counting towards infinity to ever get convergence in a infinte series sum or proof by induction. And its the logarthmic size bucket that shows how the log rule applies!

0 Replies

g__day wrote:

Check the thread title - "Why do ALL positive integers..."

You didn't run a simulation on all positive integers -- however large your sample is, it is vanishingly tiny compared to the set of all positive integers. The computer program implies that you stop counting at some very large number. But that neglects an essential feature of the set of all integers: you never stop counting. It's what they teach you in algebra 101: Be very, very careful with generalizations about infinite sets!

0 Replies

Ah but I already had the mathematical proof! I wanted to see if a simple program would validate this counter intuitive theorm and its proof and it did!

I asked if anyone knew what to exect and why - so far folk have commented mostly off topic without explaining the why - or point out that there is a mathematical theorem - proved - that shows this unlikely law is true!

The proof stands with or without my little program.

Be very, very careful arguing mathematics with a mathematician Smile

You made a point above about this being the case with constant or normal distributions - but this is incorrect - its the case only with probablistic distributions

Quote:

Note that for numbers drawn from many distributions, for example IQ scores, human heights or other variables following normal distributions, the law is not valid. . However, if one "mixes" number from those distributions, for example by taking numbers from newspaper articles, Benford's law reappears. This can be proven mathematically: if one repeatedly "randomly" chooses a probability distribution and then randomly chooses a number according to that distribution, the resulting list of numbers will obey Benford's law.

from http://en.wikipedia.org/wiki/Benford%27s_law

You should relaise that a high statistical likihood can be achieved if you simply hit the generators of a set not the entire infinite set. If a systematic pattern applies to the generator sequence it holds for the entire, otherwise programs like Cayley couldn't hope to model groups with 10 ^ 68 elements accurately for over 25 years now!

0 Replies

You have given a proof that it doesn't converge to an even distribution, which is true. You have not given any proof that it converges towards your distribution. Which isn't surprising, because the series doesn't converge at all as you increase the maximum random number allowed. Given that it doesn't converge, you have no one distribution that describes the whole set of the integers.

0 Replies

Wrong again! Sorry if I sound offensive - but remember I asked if folk here knew the theorem and its results.

I mentioned an outline of the proof - I haven't given it - that would take 3 full scape pages!

I mentioned it doesn't converge with an linear size group that heads towards infinity - but its absolutely does converge with a logarithmic size number group - you seem to miss that critical point (and no I didn't cite the proof I merely outlined how it could be shown positively to converge).

To put you out of your misery look up Ted Hill - his 1996 proof of this theorem is about the best I have seen. But my lecturer Andrew Glass in 1982 had a decent (but little recognised) proof of this law way earlier!

0 Replies

From your source:

Wikipedia wrote:

Benford's law, also called the first digit law, states that in lists of numbers from many real-life sources of data, the leading digit 1 occurs much more often than the others (namely about 30% of the time).

The key here is "from many real-life sources of data". In real life sources of data, we know there is a limit to the maximum possible number, but we don't know what it is. An appropriately weighted average of the maximum numbers, with the weight designed to reflect our ignorance about what the maximum is, will give you the distribution you outlined in your initial posts. No disagreement so far.

But if you generalize this to _all_ integers, without any constraint, the distributions just don't converge towards anything as you raise the maximum allowed number. The series of distributions you get as you increase this maximum one by one just isn't a Cauchy series -- not under any standard metric for comparing distributions. Is it your opinion that it is? In that case what is the metric you use to compare those distributions?

0 Replies

The many real life sources - simply shows empherical evidence (versus necessary and sufficient evidence). The theorem has been exhaustively proved for years now. I politely suggest you read it rather than make consistent stabs in the wrong direction:

Thomas wrote:

But if you generalize this to _all_ integers, without any constraint, the distributions just don't converge towards anything as you raise the maximum allowed number.

Sigh, the distributions do converge counted logarthmically as you head to infinity - this is how the proof works. I simply don't know how to say that to you in any simpler way! This is a fact - not an assumption - its called Hill's Law!

The proof is about a second year Pure maths level subject - of about the same difficulty as show any number can be factorised into only one unique multication of primes. Statistics perhaps surprisingly doesn't even come into the proof.

There are no distribution curves, f or p tests, its an absolute proof as part of Number Theory in Pure Mathematics! Or more correctly "Hill's Law is really the 'Distribution of Distributions'" http://www.fortunecity.com/emachines/e11/86/one.html

You want to see a Cauchy series - count by groups of powers towards infinity - let the group size be even the simplest power series and you see rapid convergence - showing convergence is occuring and around a logarthmic base.

I can't understand why you'd make this assumption when its an absolute mathematical proof.

0 Replies

Thomas wrote:

Where is the disagreement? If every maximum number produces something between an even distribution and one in which 5x % of all numbers start with 1, this is consistent with g_day's observation, whatever maximum number he uses, and whether he averages over randomly chosen maxima or not. Which was my point, and I think it was also Terry's.

Thomas, I'll try my best to explain it and then I'll let it go, because it's a very very simple bit of math and, well, if I can't do it justice I'll have to give up. I have no formal math training and there are others who can explain it better.

1) The point of this lil rule is that the average dataset will bear it out. Not one seclected sub set or another.

It's not specific sets but the average set.

2) Both yourself and Terry say that if the ceiling is X 1 is more prevalent and if the ceiling is Y one is just as common as all the others.

Thing is, with a random ceiling, there's a 90% chance that 1 will not be as common as the others.

That's why, on the whole this stands. And why pointing out that there can be sets that have 1 just as common as all the others is really pointless.

Both Terry and yourself are correct, you can indeed find specific datasets where 1 is no more prevalent than any other number.

But those sets constitute 10% of the available datasets, and that's why this axiom is true.

Both g_day and myself are saying the same thing. You guys missed the point even while your statements were not false.

0 Replies

Thomas's recent objection, which I second, is to do with the initial statement of this thread: the fact that the numbers are drawn from real-world data sets is significant. It must be significant, if it's on wikipedia, right? Wink

Measured parameters are human constructs, and every data set reports on parameters. It doesn't surprise me at all that, in general, the way in which we tend to measure things skews the findings of this study. In describing time-of-day the way we do in the U.S., for instance, fully a third of all possible times begin with one. It's a product of the system of measurement we use. (Where's that thread on path dependency?)

Which is not to challenge the usefulness of the theorem by any stretch of the imagination. Just that taking integers randomly from real data sets and taking integers randomly from all possible positive integers are very different actions.

0 Replies

What are we measuring that is skewing what study? Could you be more concise?

How can we measure anything other than real world data for emphirical studies?

The theorem is proved - case closed!

0 Replies

g__day wrote:

What are we measuring that is skewing what study? Could you be more concise?

We implicitly assume that the data we are measuring is scale invariant. This is a good approximation for many data sets, and the assumption leads to the distribution of numerical values you have given in your initial post. But it's nevertheless an assumption we make about the data, and it's not necessarily correct for any given dataset. Since Benford's law depends on this assumption, you overstate your case when you claim that this is a law about integer numbers and nothing else. For what it's worth, an Amazon search for "Benford's law" yields several statistics books, but no number theory books. This suggests that the law is not about numbers per se, but about real world data measured in terms of numbers.

0 Replies

Bookmarking - as I have a professional interest.

For once, I've come across something here which (a) I don't fully understand and (b) may add to my work.

I'm a forensic accountant and I have heard of Benford's law being applied to searches for numbers fraudulently generated to deceive manual review. Certain forensic software, used to detect suspicious transactions, depends on this principle.

I'll need to spend more time understanding this before I can add to the debate, however!

0 Replies

Why do 30% of all positive integers start with 1?

Related Topics

Quick Links

My Account

able2know