Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Statistical Formulas for Programmers (2013) (evanmiller.org)

240 points by Tomte 112 days ago | 39 comments

armanboyaci 111 days ago [-]

>Being able to apply statistics is like having a secret superpower.

I totally with this sentence. BUT If you ask for my opinion, merely knowing a list of statistical formulas is not very helpful. Most of the time, people don’t remember the underlying assumptions, so there is a fair chance they will use them in inappropriate situations.

I recommend watching these two YouTube videos. The presenters advocate using simulation/bootstrapping/shuffling methods instead of memorizing formulas.

Jake Vanderplas - Statistics for Hackers https://www.youtube.com/watch?v=Iq9DzN6mvYA

John Rauser - Statistics Without the Agonizing Pain https://www.youtube.com/watch?v=5Dnw46eC-0o

mont_tag 111 days ago [-]

IIRC, Jake's video inspired the example section in the Python random module docs. It takes about 15 minutes with those examples to learn how to put Jake's ideas into practice. https://docs.python.org/3/library/random.html#examples .

Terr_ 111 days ago [-]

> The presenters advocate using simulation/bootstrapping/shuffling methods instead of memorizing formulas.

Yeah, I often find it much easier to make a little Python script to do 10,000 monte-carlo trial, as opposed to properly" working things out and then not even being confident-enough in my result anyway.

asdff 111 days ago [-]

It makes no sense to memorize the formulas when most any statistical formula you'd actually use has a package or three that can run it in a way that's already probably reasonably benchmarked and not prone to you fat fingering some error rolling your own.

dapperdrake 111 days ago [-]

Assumptions are the part that matters.

asdff 108 days ago [-]

What assuming the package is correct? Sure it could be wrong in its implementation, but one could simulate expected results and compare the output of the tool if one doesn't trust that the community of data scientists nerds have somehow missed that the storied louvain package or whatever else is incorrect for years.

wodenokoto 111 days ago [-]

While I really liked the video by vanderplas, I did return to it after a year or two and paused every time he presented a problem and then tried to solve it using for loops and thinking hard.

I barely succeeded in any of it. So at that point just look up the formula instead of bootstrapping.

I’ll give the second one a shot too.

mcphage 112 days ago [-]

The article "How Not To Sort By Average Rating" by the same author (and also linked in this article) is really good, and definitely changed my thinking about any kind of "sort by best to worst" list: https://www.evanmiller.org/how-not-to-sort-by-average-rating...

vismit2000 111 days ago [-]

Covered by 3b1b some years ago: https://youtu.be/8idr1WZ1A7Q

mcphage 110 days ago [-]

Those were good, thanks! It's a shame he never released part 3, though :-(

Joker_vD 112 days ago [-]

Hm. I wonder how well would "Score = [Positive ratings] / ([Total ratings] + 1)" fare.

mcphage 111 days ago [-]

It'll help some, but I don't think enough—it's way, way easier to get a good score on a small number of ratings than a large number. And the span on the number of ratings is several orders of magnitude—for instance, on Amazon you can do a search and get back products with less than 10 ratings along side products with over 10,000 ratings.

Terr_ 111 days ago [-]

I think I avoid imposter syndrome in some areas, but Not Enough Real Math is definitely a weak spot.

When people start talking about eigenvalues, I'm just a business-rule caveman with a little discrete-math unga bunga.

This kind of statistical stuff falls somewhere in-between.

MrLeap 111 days ago [-]

Eigenvalues are a topic in linear algebra. They're coefficients you can put in front of some matrices or vectors that change their magnitude.

Linear Algebra was the most useful and fun math class I took in college. Highly recommended if you ever wanna do gamedev. It's more approachable than you probably think.

For me, when people start talking about differential equations, specifically the symbols you'll see in a wikipedia article about Navier Stokes equations, I'm just a business-rule caveman with a little linear algebra zug zug.

vector_spaces 111 days ago [-]

> Eigenvalues are a topic in linear algebra. They're coefficients you can put in front of some matrices or vectors that change their magnitude.

Multiplying a vector or a matrix by any nonunit scalar changes its magnitude (hence scalar!! i.e. something that scales). Not all scalars are eigenvalues. So this isn't quite right

Think about it geometrically instead. A linear operator transforms a space. Geometrically the transformation can be one or more of stretching, compressing, or rotating (taking shearing to be a kind of stretching). The directions in the space which remain the same other than having been scaled by some factor are the eigenvectors of the transformation. The scaling factor of one of those such directions is its eigenvalue.

Elucalidavah 111 days ago [-]

> when people start talking about differential equations

It's not like you are going to solve those analytically.

Implement a couple numerical solvers for things like Navier–Stokes and you'll see that differential equations is just obscenely compressed code.

roenxi 111 days ago [-]

Studying more statistics is often clever. Although in this case Mr. Miller led the the most important part - if there are two numbers (like 7 and 5) in a statistical context they might be the same number. That throws a lot of people into such a tailspin that they never really recover after making the obvious mistake of thinking they are different.

The powerful heuristic for the less technically inclined is to say "well, this evidence isn't conclusive until someone who knows statistics has tried to shoot it down".

cess11 111 days ago [-]

I'm probably like you, haven't taken the time as an adult to really rub these things in, but I find it helpful to sometimes throw up a P5.js and do some high school or early uni exercises graphically. JS can be a trivial language and doesn't get in the way and it gives a draw loop for free, including a global frame count you can pull from when you need some integers to juggle.

For me it teaches differently than trying to follow video lectures.

mportela 111 days ago [-]

Then definitely what 3Blue1Brown's video on eigenvalues and eigenvectors. [1] That's when I clicked to me! His entire series on Linear Algebra is incredibly well produced.

[1] https://youtube.com/watch?v=PFDu9oVAE-g

bob1029 111 days ago [-]

I'd add z-score (standard score) to your tool belt. The ability to identify or reject outliers is invaluable when trying to stabilize real-world business processes.

For example, if you are building heuristics that determine if a customer's bank account is "reasonably active", you may not want to consider very small transactions unless that is typical activity for a given customer.

mont_tag 111 days ago [-]

Another simple tool that gives you superpowers is a Q-Q plot. https://en.wikipedia.org/wiki/Q–Q_plot

Toenex 111 days ago [-]

Personally always loved me a Bland-Altman plot (https://en.m.wikipedia.org/wiki/Bland-Altman_plot)

TheHideout 111 days ago [-]

FYI, using this stuff without understanding Test Power is dangerous and can lead to making bad decisions with false confidence.

gpderetta 111 days ago [-]

Also: "Common statistical tests are linear models (or: how to teach stats)"[1]. Also also, bootstrapping is a superpower.

[1] https://lindeloev.github.io/tests-as-linear/

snitzr 111 days ago [-]

Why isn't 7 greater than 5?

DeepSeaTortoise 111 days ago [-]

Statistics gave him the superpower of predicting the future:

https://knowyourmeme.com/memes/fight-club-57-movie

glitchc 111 days ago [-]

That's hilarious!

avg_dev 111 days ago [-]

yes, and informative. i was looking at the article and i thought everything made sense but i could tell i was missing something about this line...

senkora 111 days ago [-]

Treat them as two draws from possibly different, independent distributions.

The question is whether the distribution that drew 7 “stochastically dominates” the distribution that drew 5. You may or may not be able to conclude that based on the available data and assumptions about the distributions.

https://en.m.wikipedia.org/wiki/Stochastic_dominance

For example, if you assume that the two distributions are approximately normal with very small variances, then you can probably conclude that the distribution that drew 7 stochastically dominates the distribution that drew 5. But if you assume that the variances are large, then you probably can’t conclude that.

dlivingston 111 days ago [-]

Sounds like you should read the article. :)

Kidding. The idea is that there may be some statistical uncertainty associated with the measurement of 7, and also of 5, and so the "real" value of 7 may actually be less than the "real" value of 5.

111 days ago [-]

cmdrmac 111 days ago [-]

This is certainly a very useful resource - even for a seasoned data scientist!

curtisszmania 111 days ago [-]

[dead]

hmcamp 111 days ago [-]

[flagged]

111 days ago [-]

extrememacaroni 112 days ago [-]

[flagged]

Jtsummers 112 days ago [-]

Converting the math in here to code isn't very hard.

Hussell 111 days ago [-]

The statisticians have a bunch of tricks to transform the formulas into more-easily computable forms, e.g. calculate both the average and the standard deviation in a single pass through the data instead of one pass to calculate the average and a second to calculate the standard deviation. Converting the math in here to efficient code isn't very easy.

glitchc 111 days ago [-]

You mean Welford's algorithm. Since code was requested:

https://jonisalonen.com/2013/deriving-welfords-method-for-co...

danhau 111 days ago [-]

Someone should write a „math notation for programmers“ article. Certainly would help me anyway.

Jtsummers 111 days ago [-]

https://news.ycombinator.com/item?id=28493031

There are others like this out there.

theWreckluse 112 days ago [-]

And also it's something programmers need to be skilled at.

dlivingston 111 days ago [-]

Yes. But also, probably just about every language will have a module with these functions. Python's NumPy and SciPy should have all of these built-in.

kqr 111 days ago [-]

...except it depends on knowledge of the t distribution but has no information on how to approximate it.

It is a good frequentist's toolbox, but it is not immediately translatable to code, no.

111 days ago [-]

Rendered at 03:53:46 GMT+0000 (Coordinated Universal Time) with Vercel.