To put it in plain mathematical language, ZIP codes are not defined as polygons [0]. The consequence is that performing any analysis with an assumption that ZIP codes are polygons is bound to be error-prone.
That's not the important problem and there's a simple solution with ZCTAs.
The big problem is zip codes are defined in terms of convenient postal routes and aren't suitable for most geospatial analysis. Census units, as the article explains, are a much better choice.
BiteCode_dev 6 days ago [-]
You ask a customer their census unit on purchase though.
mcphage 7 days ago [-]
> The consequence is that performing any analysis with an assumption that ZIP codes are polygons is bound to be error-prone.
Yeah, but any analysis you're likely to perform is approximate enough that the fact that ZIP codes aren't polygons is basically a rounding error.
Plus, it's a lot easier to get ZIP codes, and they're more reliably correct, so you might still get better results, than you would going with another indicator that is either (a) less reliable or (b) less available.
mattforrest 7 days ago [-]
They aren’t reliable correct actually. The boundaries that the Census publishes are called Zip Code Tabulation Areas which are approximations of zip codes and include overlaps.
wombatpm 7 days ago [-]
ZCTA5 roughly corresponds to the area of a 5 digit zip code. Problem is there are large areas of the west that don’t have permanent residents and no mail delivery. Plus they change over time.
mholt 7 days ago [-]
Yeah. ZIP codes are sets in the abstract-dimensional space of carrier delivery points. I suppose you could think of them as lines, but definitely not polygons.
cogman10 7 days ago [-]
Zip codes (in the US) are machine readable numbers a mail sorter can use to send a parcel to the right delivery truck for final delivery. In the US, they represent the hierarchy of postal centers with the most significant digit representing the primary hub for a region and the smallest number the actual post office that will be in charge of delivering the letter (or truck if you do the extended post code).
They don't represent geography at all, they represent the organizational structure of USPS.
They work by making the address on a letter almost meaningless. For some smaller population zip codes you can practically just put the name and zip code down and achieve delivery.
mywittyname 7 days ago [-]
> For some smaller population zip codes you can practically just put the name and zip code down and achieve delivery.
A 5+4 formatted ZIP code maps to just a handful of addresses. In cities with larger populations, the +4 could map to a single building, and in more sparely populated place, it might include houses on a handful of roads.
For smaller datasets, ZIP+4 might as well be a unique household identifier. I just checked a 10 million address database and 60% of entries had a unique ZIP+4, so one other bit of PII would be enough to be a 99.99% unique identifier per person.
With a geo-coded ZIP+4 database, you could locate people with a precision that's proportional to the population density of their region.
mattforrest 7 days ago [-]
Yeah but we have that already in the census hierarchy. Plus you have to pay to access Zip+4 geospatial data and it changes sometime as frequently as quarterly
Spivak 7 days ago [-]
Right but this ends up being a good approximation for geography because the reality of logistics is that you end up doing a cute n-ary search of the geography. When you know the regional hub you can say for certain a huge chunk of the US the zip code doesn't represent. And then you keep n-secting. Sometimes the land-mass you get at the end is specific enough for your uses.
You're not going to wind up with a situation where zip codes with the same regional marker end up on different coasts.
makeitdouble 7 days ago [-]
> You're not going to wind up with a situation where zip codes with the same regional marker end up on different coasts.
Couldn't this happen for military or proxy codes (PO boxes or other) ?
mattforrest 7 days ago [-]
Just use a spatial query. That’s what they are made for.
alsodumb 7 days ago [-]
I agree that they weren't explicitly meant to represent geography, but implicitly they do, right? Are there cases where this is violated?
In other words, is it safe to assume that for entity in a zip code is less than x distance away from the closest entity in the same zip code?
perrygeo 7 days ago [-]
> less than x distance away
zip codes don't even need to be contiguous. It's a mail delivery route, not a polygon.
There are 5 cases where the assumption is violated:
- Non-contiguous areas
- Zip codes that are a single point (some big companies get their own zip with a single mailbox, e.g. GE in Schenectady, NY is zip 12345)
- Zip codes that are a single line (highway-based delivery routes)
- Overlapping boundaries (since mail routes are linear, choosing a polygon representation is arbitrary and often not unique in space)
- Residents of some zip codes are not stationary (e.g. houseboats)
In short, asking questions about the area of a zip code is a category error - zip codes do not have a uniform representation in space. And we should be highly skeptical of any geospatial analysis that assumes polygons.
I write this as someone who grew up in the ZIP code 09180
maxerickson 6 days ago [-]
They do provide a location with whatever error bars on it.
What they do not have is any sort of spatial consistency, they are a convenience for mail sorting. So if you start analyzing patterns across zip codes, you are pulling in information that is likely useless for or harmful to answering your question.
makeitdouble 7 days ago [-]
It might be true, but does it help if the x varies from "on a nearby mountain" to "within a street block", and you sometimes have every habitants closer to another zip code than theirs ?
mattforrest 7 days ago [-]
Well put
jonas21 7 days ago [-]
ZIP codes are an emergent property of the mail delivery system. While the author might consider this a bad thing, this makes them "good enough" on multiple axes in practice. They tend to be:
- Well-known (everybody knows their zip code)
- Easily extracted (they're part of every address, no geocoding required)
- Uniform-enough (not perfect, but in most cases close)
- Granular-enough
- Contiguous-enough by travel time
Notably, the alternatives the author proposes all fail on one or more of these:
- Census units: almost nobody knows what census tract they live in, and it can be non-trivial to map from address to tract
- Spatial cells: uneven distribution of population, and arbitrary division of space (boundaries pass right through buildings), and definitely nobody knows what S2 or H3 cell they live in.
- Address: this option doesn't even make sense. Yes, you can geocode addresses, but you still need to aggregate by something.
ericrallen 7 days ago [-]
This is a tangent, but addresses are also way more complicated than most people realize - especially if you’re relying on a user to input a correct address or if you need to support multiple countries, somewhere with unique addresses like Queens[0], or you need to differentiate between units of a specific street address that uses something other than unit numbers for a unit designation.
At that point you need something like Smarty[1] to validate and parse addresses.
Just last week I had to deal with the fact that my house has the wrong address in multiple databases because things changed when an interstate went in 40-something years ago. It's not a big change--main st. vs N main st. but it was enough to mess up various things. Not as much as when I moved in 30 years ago but still enough to be wrong in old town and telco records. Took me a couple of days to get a permit issued to get electrical hooked back up after a fire as a result because apparently some town clerk insisted the address wasn't valid.
jwnacnud 6 days ago [-]
Here is a little-known (but very useful piece of information).
The US Postal Services has a team of people that handle address updates. This team is localized to different regions so that they generally are aware of local nuances. If you need to talk to the USPS about getting an address issue resolved simply go to this USPS AMS site and enter your zipcode to find the team that handles addresses in that area:
If they don't answer, leave a message. They have helped me thousands of times in my last 14 years working with address validations.
ghaff 6 days ago [-]
The USPS has always been correct since I moved in. It’s been local records and the telcos that have been the problem.
And in this case the fire companies had no problem finding my house in spite of the incorrect information in town records. As you suggest the field people on the ground generally know what the ground truth is.
rented_mule 7 days ago [-]
An annoyance for me is that I've yet to see any address validator get my current home address right. They all insist my address is on the road that leads to my road rather than my actual road. It's understandable that they can't be 100% accurate given the scale / complexity of addresses.
Most sites/apps will let me override the validator, but a few won't. The most common ones that insist on using the wrong address are financial institutions that say the law requires them to have my proper physical address and therefore they go with the (incorrectly) validated version.
USPS does not do home delivery in our area, and UPS/FedEx/etc. usually figure it out given that street numbers alone uniquely identify properties in our town.
killjoywashere 7 days ago [-]
Same! My wife ran a business from home during the pandemic and we actually went through the effort to work with Google Maps (they called us) to get it on the map. And of course USPS has no problem. But our address was originally a federal building with a letter, still only has a letter, no number, and there are now all sorts of work-arounds floating around on how resolve addresses in our neighborhood. What's wild is the Post Office is literally down the street from our house, and our house predates the founding of most of the big delivery services, which all manage to deliver to us, given their preferred incantation. If I can't get the shipper to pass the right incantation to their shipping service, shenanigans ensue. My (least?) favorite was an item that went across the Pacific Ocean 3 times over the course of 3 months.
jonathanoliver 7 days ago [-]
I just replied to an earlier message on this thread with the same offer:
I’d love to have you email your mailing address to support@smarty.com with a link to this HN thread. We may be able to help fix some of this.
jonathanoliver 7 days ago [-]
Send your address to support@smarty.com and link to this HN thread. I’ll keep an eye watching out for it. I’d love to see what our system does with your address.
We have non-postal addresses and a lot of other mechanisms to help here. We also have contacts at the USPS and others to help fix addresses.
bob1029 7 days ago [-]
Addresses are a huge ordeal in banking. Easily one of the most tortured domain types when it comes to edge cases and integration pain.
Every customer I've worked with insisted on having all addresses ran through the USPS verification API so they could get their bulk mailing discounts.
Even if you get the delivery/cost side under control, you still have to make sure you are talking about the right address from a logical perspective. Mailing, physical, seasonal, etc. address types add a whole extra dimension of fun.
nitwit005 7 days ago [-]
Yes, unfortunately, their assertion that everyone knows their zip code is wrong. People often write a neighboring code, and the post office just delivers it.
Similar issues for city name, of course.
steezeburger 7 days ago [-]
This sounds like the person doesn't know the receiver's zip code. Why are you extending that to not knowing their own zip code? Are they mailing something to themselves?
wisty 7 days ago [-]
People more or less mail themselves parcels all the time, with online delivery.
steezeburger 7 days ago [-]
Ha you make an excellent point actually. I wasn't even thinking of that.
toast0 7 days ago [-]
People often give out their mailing address, and may be misinformed about their zip code.
If you get close enough, it usually gets handled in the local sort, but not always.
On cities, the mailing address city really is the name of the post office that handles your delivery route. Often there's a relationship with the city you live in, but there's cases both ways --- I used to live outside city limits, we had a census designated place name, a municipal sanitary district and had a fire department at one time... but never a post office, so our mailing address used the nearby city name, where our post office resided. The place name had an incorporated city on the other side of the state, so using that wouldn't be great.
Nowadays, post offices often have a list of alternative place names, so where I live now, I can pick between the incorporated city name, the nearby large city where a post office that processes all my mail is located, or any of the numerous small post offices that once served my city.
rascul 7 days ago [-]
> On cities, the mailing address city really is the name of the post office that handles your delivery route.
Bigger cities can have multiple post offices and zip codes with the same mail address city.
tbrownaw 7 days ago [-]
I will occasionally still try to use the zip code for my old work address (from about a year before covid) when what I want is my home address.
6 days ago [-]
VWWHFSfQ 7 days ago [-]
Very common in NYC. People will use all of "New York, NY", "Queens, NY", or "Astoria, NY" all interchangeably and the post office will still just deliver it to the same place.
ericrallen 7 days ago [-]
This is sort of apocryphal - and also anecdotal because I have my own personal experience living in an annexed Boston neighborhood to draw on - but in a lot of the towns/neighborhoods that have been annexed by Boston, people still use the neighborhood name[1] as the city name because you are more likely to get your package when you indicate which “Washington St,” “Boylston St,” etc. you actually live at.
According to one commenter on the subject:
It doesn't matter, as long as the zip code is correct
They know their ZIP code far, far better than any other plausible geographic cell.
jonathanoliver 7 days ago [-]
Thanks for the shout-out. Founder of Smarty here.
Regarding article, it really depends on the use case of whether to use ZIP Code (TM), postal code, Canada Post Forward Sortation Area, lat/lon, Census Bureau block and tract, etc.
As has been noted, the ZIP Code is often good enough for aggregating data together and can be a good first step if you don’t know where to start.
ellisv 7 days ago [-]
There are point process models, but, yes, its much more common to want to aggregate to a spatial area.
Another consideration is what kind of reference information is available at different spatial units. There are plenty of Census Bureau data available by ZCTA but some data may only be available at other aggregate units. Zip Codes are often used as political boundaries.
I'd also mention the "best" areal unit depends on the data. There is a well known phenomenon called the modifiable areal unit problem in which spatial effects appear and vanish at different spatial resolutions. It can sort of be thought of as a spatial variation of the ecological fallacy.
raphman 7 days ago [-]
One more advantage: ZIP codes are a good trade-off if you want to gather anonymous data in a survey or provide anonymized data to an outside entity. For example, we recently conducted a survey on mobility patterns within our university. To offer respondents a reasonable amount of anonymity, we just asked for their (German) ZIP code and the location of their primary workplace.
This allows us to determine the distance and approximate route people would take between home and university campus - to a degree that is sufficient for our goals.
mattforrest 7 days ago [-]
Well you hit on all the points that discuss the compromises that zip codes offer. Just because you have them in your data doesn't mean that they can produce anything useful. You are correct that no one knows their census unit is (if you are thinking from someone entering this on a website) but collecting location or address will be a lot better.
Fact is a lot of web data contains a zip but if you can collect something better it will usually render better results. Unless you are analyzing shipments then that is fine.
JumpCrisscross 7 days ago [-]
Would add that there are network effects with zip code data. If you collect H2 data, you have fewer sources with which to join.
walrus01 7 days ago [-]
In terms of "good enough", a Canadian postal code, broadly equivalent to a zip code, is much more granular and can often identify an individual apartment building, or single city block. Plenty of large office buildings in major Canadian cities also have their own postal code.
The functionality of it is closer to the "Zip+4" with extension used to have a more granular routing of physical mail for USPS.
Sure, and in the States, ZIP+4 could once nail my postal location to a subset of 4 (of a group of 16) mailboxes within a particular set of entry doors on a particular apartment building.
But broadly speaking, nobody knows what their ZIP+4 is, while I imagine that most people in Canada know their postal code by heart.
It is interesting.
bluGill 7 days ago [-]
The plus four changes all the time so it isn't feasable to know it. The use is large mailers can get a discount by looking it up and presorting mail. If the mail coming into my post office has my mail next to my next door neighbors that saves them a lot of time.
kstrauser 7 days ago [-]
Is that still true? I would imagine any reasonably modern computer could map every physical address in a huge region to a (route number, stop number) pair. I wouldn't think the +4 would add a lot of value anymore.
bluGill 7 days ago [-]
The sort everything outgoing by where it goes on the truck is valuable. sure computers can sort but this is physical things and so mechanical limits apply.
throw0101c 7 days ago [-]
> In terms of "good enough", a Canadian postal code, broadly equivalent to a zip code, is much more granular and can often identify an individual apartment building, or single city block.
To the point that StatCan and other agencies have rules on the number of characters that are collected/disseminated with other data to make sure it's not too identifying:
Yeah but Zip+4 represent a collection of houses not a polygon so not useful for aggregations or statistical work
michaelmrose 7 days ago [-]
If you are worrying about address at all instead of tax or legal jurisdiction its probable that you as a business have a physical presence. You can probably correlate better by predicting which location a given address would likely interact with if you don't know already by prior purchases/interaction which they normally do so. I would suggest actual purchase data followed by travel time.
Zip and distance as the crow flies often gives shit data. My zip suggests I'm off in bum fuck and since I'm on the puget sound things that are relatively near as the crow flies can actually be hours away.
stevage 7 days ago [-]
> Easily extracted (they're part of every address, no geocoding required)
That's only true if you can also access the spatial boundaries of the zipcodes themselves.
In Australia, this turns out not to be true: the postal system considers their boundaries to be commercial confidential information and doesn't share them. The best we can do is the Australian Bureau of Statistics' approximations of them, which they dub "postal areas".
Also, "use a different grid" is only masking the problem, not actually fixing it.
The real problem is ever using an average without also specifying some sort of bounds. For median-based data, this probably means the upper and lower quartiles (or possibly other percentiles); for mean-based data, this probably means standard deviation.
MathCodeLove 6 days ago [-]
On the note of census units, the only reason we all know our zipcode is because we have to know it. If census units were used as frequently as zip’s I imagine they would quickly become more widely known as well.
hinkley 7 days ago [-]
Contiguous enough by data travel time as well. A few people will get 5 ms more latency than the exact optimal route, but it’s not like your routes are exactly optimal anyway.
And don’t forget sales tax. Which is state + county + city
kstrauser 7 days ago [-]
... + special entertainment district + business renovation area + exception + exception + exception + ...
jpjoi 7 days ago [-]
Zip codes are just weird to use for anything other than mail in general because they’re set up based off infrastructure.
I've noticed more and more super/hypermarkets started asking for your zip/postal code sometime during self-checkout. I'm guessing they use these as approximations about where people travel from, so they can evaluate if to open more stores closer to popular areas, or something like that. Pretty sure there is more use cases for postal codes too.
kjellsbells 7 days ago [-]
Postcodes are very useful (but not perfect) proxies for household socioeconomic status, which is useful for marketing and sales analysis.
That data linked with the payment method that the register collects pretty much gives the store exactly who you are and where you live even if you chose not to sign up to the store's loyalty program.
paraboli 7 days ago [-]
They use bulk mail to send out flyers, coupons, and can use zip codes to AB test these.
Spivak 7 days ago [-]
Wait until you find out that this is the same way phones used to work. The number was the row/colum for the operator needed to plug your line into.
throw0101c 7 days ago [-]
CGP Grey recently posted a video on Zip codes, "The Hidden Pattern in Post Codes":
That's what I was thinking of earlier, the succinct version is "your address is where mail needs to go, the zip code is how to get it there". Or in other words, the zip code is the address(es) of the sorting centers and post offices to the destination.
hammock 7 days ago [-]
Great article. Zip codes can be super expedient. But you have to be self aware that for many uses cases they function WORSE than a random grid. Because they have built-in aggregation of a central post office(and surrounding) with a certain radius of rural/less dense surrounding.
So for example, if you are sorting “rural zips” vs “urban zips” it will only take you so far, and may actually be harmful.
Same goes with MSAs/DMAs (media markets). These have to be used for buying media, but for geospatial analysis they are suboptimal for the same reasons.
Easiest way to dip your toe into the water of something better is to start with A-D census counties.
If you want to learn a bit more, there was a recent, really good Planet Money episode[1] about this exact same topic. They focus on the problems that you might face when using zip code for demographic analysis.
H3 is awesome here! What I don't think many people realize is that H3 cells and normal geographic data (like zips) are not mutually exclusive. You can take zip outlines, and find all the h3 cells within them and allocate your metric accordingly (population, income, etc).
This makes joining disparate data sources quite easy. And this also lets you do all sorts of cool stuff like aggregations, smoothing, flow modeling, etc.
We do some geospatial stuff and I wrote a polars plugin to help with this a while back [1].
They also only have one type of neighbor. Square grids have 2 neighbor types. Triangular grids have 3.
hammock 7 days ago [-]
Makes perfect sense. Thanks both
mannyv 7 days ago [-]
Zip codes, zctas, and tiger/line are good enough for what most people need. Maybe you can find an edge by using something more granular...but I'm not sure what edge you'd be looking to get with geodata. Maybe for real estate trends and/or market analysis?
clutchdude 7 days ago [-]
I agree.
Reading their alternatives, it strikes me with "ZCTA's are the worst form of small area aggregation except for all others."
Its not a great geography to use but it is quite useful if you know it's limitations and inaccuracies when you get into it. Stuff like multipolygon entities, island-polys, etc aren't fun to resolve but can be accounted for.
Add on that ZCTA's will historically follow some sort of actual boundary(rivers/highways/etc) they can tell a story in a way Census tracts can't.
temporallobe 7 days ago [-]
I don’t really see the problem but at the same time I understand it’s not a perfect solution. I used to do geospatial work using ESRI products and Zip codes polygons are very useful because people use are often interested in things inside of zip codes especially partial 3-4 digit zip code area, but they are occasionally non-contiguous so you can end up with strange results, visually and mathematically (for example how do you find the “center” of a non-contiguous shape? You don’t).
Edit: I wanted to point out that I recall that ESRI maps used to come “out of the box” with zip code polygon layers. While I agree they are technically not polygons in the strictest sense, they often are or they are fully closed shapes or close enough to it - and even if they are missing a few nodes to make a complete polygon, whoever did the digitizing probably manually closed the loop so to speak. Remember, geospatial maps are used for many different purposes, likely none of them having anything to do with postal routes, so in that sense they are “good enough” for most purposes.
SOLAR_FIELDS 7 days ago [-]
There actually are algorithms that compute the centroid of multi part polygons. So you can in fact find the center of a non-contiguous shape. Now, whether that centroid actually has value in a real world application, I’m not sure.
Anon84 7 days ago [-]
This is an example of the well known Modifiable Areal Unit problem: https://en.wikipedia.org/wiki/Modifiable_areal_unit_problem In general, your statistics depend on how you define your areas and you will get different pictures with different definitions.
paulddraper 7 days ago [-]
People use ZIP codes because they have ZIP codes.
No one has census blocks.
And coordinates can work but lack some inherent advantages, such as human readability and a semblance of pop density normalization.
eterevsky 7 days ago [-]
ZIP codes are a simple approximation, which does their job good enough in most cases.
The alternatives that the author suggests are much more complicated, both in terms of the implementation and in terms of convincing the user to give you their full address.
freyfogle 7 days ago [-]
There are many problems with zip codes / postal codes but the biggest two we see are:
a. Excel treats them as numbers instead of strings of digits and thus drops the leading 0
b. Developers make assumptions about postal codes based on how they work (or more usually how the developer incorrectly thinks they work) in their own country and these assumptions absolutely do NOT hold in other countries.
Until very recently I naively assumed that the area of a given zip code would be entirely within the area of some single city or town which would then be entirely within the area of a single county.
It was quite a rude awakening working with software that tries to apply the correct local taxes to a given address and finding that the statement “A given X can contain multiple Y” is true for every possible combination of zip, city, and county.
vikingerik 7 days ago [-]
The post office has adapted in turn to that Excel problem. They know it happens. If a parcel has only four digits for the zip code, they'll treat it as a leading zero for routing and delivery.
freyfogle 7 days ago [-]
yes, of course. The question is have all the developers writing their own code to deal with zip codes also adapted?
paganel 7 days ago [-]
Also, "everybody" knows their zip-code/postal-code is mostly an American/British thing, I still remember my British former boss asking me about my zip-code about 20 years ago (I live in Romania, we were implementing the first google-maps-based mashup in this country) and me answering that I have no idea, and that no-one around these parts really knows his/her postal-code. We do know our address, though, or used to, before we had smart-phones.
dhunter_mn 7 days ago [-]
I used to work for a company that basically merged USPS and Census Bureau data on a monthly basis. The output would be a roadbase that was optimized for address ranges on road segments. ZIP Codes were extra fun to work with.
Zamicol 7 days ago [-]
I wrote the blackout system for Comcast TV scheduling. My understanding was that blackouts were used mostly for sports where games need to be available in one area and not others. Contractually, they were required to use zip codes, so I used the US Post office's zip code data to enforce blackouts.
cwmoore 6 days ago [-]
Need a regex[2] using a trie to match valid state-zipcode pairs[1] for webpages likely to contain valid addresses?
Well funny story, some twenty something years ago I actually worked on an election cycle volunteer infra thing in France, and living in Paris which is department 75 and therefore 750xx the prefecture being 75000 I assumed it was neatly hierarchical 75004 won't be far away from 75003 (true)... The French thing being orderly and rational.
I didn't need much precision so truncating seemed an easy way to group stuff.
Oh the surprise. I never again made such assumptions, let's just say I should have gotten a clue from Corsica being 2A and 2B.
jwnacnud 6 days ago [-]
Amen! ZIP Codes (if referring to the US Postal Service) were only ever made for the purpose to sorting and delivering mail more efficiently. ZIP Codes serve the purposes of the US Postal Service, full stop. They don't respect any political or geographical boundaries - and they change at the whim of the USPS, as they were created and are maintained to suit their needs.
If used for other purposes they fall short.
mmmlinux 7 days ago [-]
Can anyone tell me why I have to enter both my city / state and a zip code. shouldn't one or the other of those plus my street address be enough information?
sophacles 7 days ago [-]
Several posts in this thread have linked the recent GCP-gray video on the topic, and it addresses this question better than I can. It's pretty interesting actually
ubermonkey 7 days ago [-]
Web devs not using a good library that will populate the former from the latter?
jayknight 7 days ago [-]
Some libraries will insist that my address is in a different city because my zip code spans the border. I mean if my mail has the other city it still gets to me, but for anything other than mail, they now have the wrong city for me.
mmmlinux 7 days ago [-]
Does it matter if the "city" is wrong if your street address + zip code is unique?
jayknight 7 days ago [-]
It depends on what they're doing with it. But mostly probably not.
stevage 7 days ago [-]
I still miss carto.com (originally CartoDB).
It was a really useful platform for uploading spatial data with a decent range of visualisation tools that didn't need code. You could do SQL if you wanted.
Then they got rid of the free tier, and set the cheapest tier at (iirc) $150 USD per month. And that was the end of that.
I've seen a few attempts like this, like loc8 and google's plus codes. Is there any advantage to Digipin over existing solutions other than avoiding splitting major cities into very different codes? None stood out to me from that document. The description is written pretty well.
Always sad when these schemes don't include a check digit in them though, even if the layout of this one gets typo'd codes pretty close to their intended destination.
talkingtab 6 days ago [-]
There should be a desensationalizer for titles on HN. When I read this it tricks my brain:
Stop
using
zip codes
BANG! Why should I stop using zip codes? Must read!!!
And then you go to the old (2019) page, that appears to be filled with useless clicks and arguments that appear biased.
"Do not use zip codes for geospatial analysis."
JackFr 7 days ago [-]
When doing your first ML project, zip codes are unsurpassed in providing a set of hand written digits to train on.
It's so well written and informative that I completely didn't mind the "and here's how to do it in Carto" bit in the middle. Instead I thought they earned it.
trgn 7 days ago [-]
First the mercator projection, now they're coming after the zip codes.
nancyp 7 days ago [-]
Instead of zip use the following?
Use Addresses
Use Census Units
Use your own Spatial Index
Why not lat, long?
ajfriend 7 days ago [-]
It depends on if you want to model a point or an area. lat/lng gives you a point, but you often want an area to, for example, count how many people are in that area. A spatial index like H3 provides a grid of area units.
HappMacDonald 7 days ago [-]
But so do lat long ranges.
ajfriend 7 days ago [-]
You can use those if they work for your application. One downside would be that you're storing 4 numbers compared to a single `int64` index with H3.
You also have to decide how you'll do that binning. Can bins overlap? What do you do at the poles? H3 provides some reasonable default choices for you so don't have to worry about that part of your solution design.
ww520 7 days ago [-]
Lat/lon is in a spherical coordinate. It’s more complicated to do calculation.
Btw. I have a need recently to compute the shortest distance from a point to a line defined by two points, all in lat/lon. Anyone has any lead on how to do it?
agtech_andy 7 days ago [-]
Zip codes are great for anything with delivery logistics.
Anything else is a loose correlation at best, that will likely change over time.
NelsonMinar 7 days ago [-]
It's remarkable how few of the comments posted here are informed by having read the article.
zuhayeer 7 days ago [-]
This is interesting since zip codes came up in consideration for how we built out our pay choropleth map in the US: https://levels.fyi/heatmap
Though ultimately it was far too granular (for example the Bay Area would be so many different zip codes). Instead we went with Nielsen's DMA (Designated Market Area) mappings within the US to abstract aggregated data a bit better. And of course this DMA dataset also had a different original use case. It was used for TV / media market surveys so it has some weird vestiges. Some regions are grouped very far and wide (you'll notice there's a bit of Denver within Nevada and its just a remnant of how it used to be categorized), but it still provides a bit of a broader level grouping than something acute like zip code.
We've also been considering using Combined Statistical Areas using population instead. This is something that is under way, and in the interim we've considered charting styles that don't necessarily need borders (for example this bubble map: https://www.levels.fyi/bubble-plot/europe/). The benefit with DMAs is that it offers full border coverage of the entire US whereas some hubs can still be missing from CSAs if relying on a population threshold. But the plan is to create some of our own regional definitions and borders using our own submissions combined with population. Will be an interesting project.
Very different use case -- ZIPs/ZCTAs have some semblance of population normalization
ajfriend 7 days ago [-]
If you care about that and have a data source, you can add, for example, population density per H3 cell as part of your analysis. That has the additional benefit of denoting the this quantity of interest explicitly, rather than some implicitly assumed correlation which may not be true.
ingenieroariel 7 days ago [-]
Hey AJ, this is almost on topic, do you know of a more up to date version of the dataset you used on the blog post release for H3 v4.0.0 [1]? They stopped updating in Oct 2023. Thanks!
[1] https://data.humdata.org/dataset/kontur-population-dataset
ajfriend 7 days ago [-]
I don't. And maybe I should have emphasized "and have a data source" more, since its doing a lot of the heavy-lifting in my statement :)
mattforrest 7 days ago [-]
Not necessarily true. The population isn't balanced at all between many. Census units are.
ellisv 7 days ago [-]
Absolutely this. Use other Census areal units if you can and ZCTAs only if you have to.
diggan 7 days ago [-]
What H3 do I belong to if my house is split between three different ones, pretty much equally? Any/all of them?
maxmouchet 7 days ago [-]
You take a smaller H3 :-) The maximum area of a resolution 15 H3 is 1 square meter, so unlikely to split a house in two.
hammock 7 days ago [-]
What is the benefit of H3 over a rectangular grid?
lacoolj 7 days ago [-]
For anyone curious, here is the official US Gov list of ZIP codes in CSV with lots of helpful related data (longitude, latitude, etc.)
0: https://manifold.net/doc/mfd8/zip_codes_are_not_areas.htm
The big problem is zip codes are defined in terms of convenient postal routes and aren't suitable for most geospatial analysis. Census units, as the article explains, are a much better choice.
Yeah, but any analysis you're likely to perform is approximate enough that the fact that ZIP codes aren't polygons is basically a rounding error.
Plus, it's a lot easier to get ZIP codes, and they're more reliably correct, so you might still get better results, than you would going with another indicator that is either (a) less reliable or (b) less available.
They don't represent geography at all, they represent the organizational structure of USPS.
They work by making the address on a letter almost meaningless. For some smaller population zip codes you can practically just put the name and zip code down and achieve delivery.
A 5+4 formatted ZIP code maps to just a handful of addresses. In cities with larger populations, the +4 could map to a single building, and in more sparely populated place, it might include houses on a handful of roads.
For smaller datasets, ZIP+4 might as well be a unique household identifier. I just checked a 10 million address database and 60% of entries had a unique ZIP+4, so one other bit of PII would be enough to be a 99.99% unique identifier per person.
With a geo-coded ZIP+4 database, you could locate people with a precision that's proportional to the population density of their region.
You're not going to wind up with a situation where zip codes with the same regional marker end up on different coasts.
Couldn't this happen for military or proxy codes (PO boxes or other) ?
In other words, is it safe to assume that for entity in a zip code is less than x distance away from the closest entity in the same zip code?
zip codes don't even need to be contiguous. It's a mail delivery route, not a polygon.
There are 5 cases where the assumption is violated:
- Non-contiguous areas
- Zip codes that are a single point (some big companies get their own zip with a single mailbox, e.g. GE in Schenectady, NY is zip 12345)
- Zip codes that are a single line (highway-based delivery routes)
- Overlapping boundaries (since mail routes are linear, choosing a polygon representation is arbitrary and often not unique in space)
- Residents of some zip codes are not stationary (e.g. houseboats)
In short, asking questions about the area of a zip code is a category error - zip codes do not have a uniform representation in space. And we should be highly skeptical of any geospatial analysis that assumes polygons.
Please see: https://opencagedata.com/guides/how-to-think-about-postcodes...
I write this as someone who grew up in the ZIP code 09180
What they do not have is any sort of spatial consistency, they are a convenience for mail sorting. So if you start analyzing patterns across zip codes, you are pulling in information that is likely useless for or harmful to answering your question.
- Well-known (everybody knows their zip code)
- Easily extracted (they're part of every address, no geocoding required)
- Uniform-enough (not perfect, but in most cases close)
- Granular-enough
- Contiguous-enough by travel time
Notably, the alternatives the author proposes all fail on one or more of these:
- Census units: almost nobody knows what census tract they live in, and it can be non-trivial to map from address to tract
- Spatial cells: uneven distribution of population, and arbitrary division of space (boundaries pass right through buildings), and definitely nobody knows what S2 or H3 cell they live in.
- Address: this option doesn't even make sense. Yes, you can geocode addresses, but you still need to aggregate by something.
At that point you need something like Smarty[1] to validate and parse addresses.
[0]: https://stackoverflow.com/questions/2783155/how-to-distingui...
[1]: https://www.smarty.com/
The US Postal Services has a team of people that handle address updates. This team is localized to different regions so that they generally are aware of local nuances. If you need to talk to the USPS about getting an address issue resolved simply go to this USPS AMS site and enter your zipcode to find the team that handles addresses in that area:
https://postalpro.usps.com/ppro-tools/address-management-sys...
If they don't answer, leave a message. They have helped me thousands of times in my last 14 years working with address validations.
And in this case the fire companies had no problem finding my house in spite of the incorrect information in town records. As you suggest the field people on the ground generally know what the ground truth is.
Most sites/apps will let me override the validator, but a few won't. The most common ones that insist on using the wrong address are financial institutions that say the law requires them to have my proper physical address and therefore they go with the (incorrectly) validated version.
USPS does not do home delivery in our area, and UPS/FedEx/etc. usually figure it out given that street numbers alone uniquely identify properties in our town.
I’d love to have you email your mailing address to support@smarty.com with a link to this HN thread. We may be able to help fix some of this.
We have non-postal addresses and a lot of other mechanisms to help here. We also have contacts at the USPS and others to help fix addresses.
Every customer I've worked with insisted on having all addresses ran through the USPS verification API so they could get their bulk mailing discounts.
Even if you get the delivery/cost side under control, you still have to make sure you are talking about the right address from a logical perspective. Mailing, physical, seasonal, etc. address types add a whole extra dimension of fun.
Similar issues for city name, of course.
If you get close enough, it usually gets handled in the local sort, but not always.
On cities, the mailing address city really is the name of the post office that handles your delivery route. Often there's a relationship with the city you live in, but there's cases both ways --- I used to live outside city limits, we had a census designated place name, a municipal sanitary district and had a fire department at one time... but never a post office, so our mailing address used the nearby city name, where our post office resided. The place name had an incorporated city on the other side of the state, so using that wouldn't be great.
Nowadays, post offices often have a list of alternative place names, so where I live now, I can pick between the incorporated city name, the nearby large city where a post office that processes all my mail is located, or any of the numerous small post offices that once served my city.
Bigger cities can have multiple post offices and zip codes with the same mail address city.
According to one commenter on the subject:
[0]: https://www.city-data.com/forum/boston/601106-mailing-addres...[1]: https://www.city-data.com/forum/boston/601106-mailing-addres...
Regarding article, it really depends on the use case of whether to use ZIP Code (TM), postal code, Canada Post Forward Sortation Area, lat/lon, Census Bureau block and tract, etc.
As has been noted, the ZIP Code is often good enough for aggregating data together and can be a good first step if you don’t know where to start.
Another consideration is what kind of reference information is available at different spatial units. There are plenty of Census Bureau data available by ZCTA but some data may only be available at other aggregate units. Zip Codes are often used as political boundaries.
I'd also mention the "best" areal unit depends on the data. There is a well known phenomenon called the modifiable areal unit problem in which spatial effects appear and vanish at different spatial resolutions. It can sort of be thought of as a spatial variation of the ecological fallacy.
Fact is a lot of web data contains a zip but if you can collect something better it will usually render better results. Unless you are analyzing shipments then that is fine.
The functionality of it is closer to the "Zip+4" with extension used to have a more granular routing of physical mail for USPS.
https://www.canadapost-postescanada.ca/cpc/en/support/articl...
https://en.wikipedia.org/wiki/Postal_codes_in_Canada
But broadly speaking, nobody knows what their ZIP+4 is, while I imagine that most people in Canada know their postal code by heart.
It is interesting.
To the point that StatCan and other agencies have rules on the number of characters that are collected/disseminated with other data to make sure it's not too identifying:
* https://www.canada.ca/en/government/system/digital-governmen...
* https://www12.statcan.gc.ca/nhs-enm/2011/ref/DQ-QD/guide_2-e...
Zip and distance as the crow flies often gives shit data. My zip suggests I'm off in bum fuck and since I'm on the puget sound things that are relatively near as the crow flies can actually be hours away.
That's only true if you can also access the spatial boundaries of the zipcodes themselves.
In Australia, this turns out not to be true: the postal system considers their boundaries to be commercial confidential information and doesn't share them. The best we can do is the Australian Bureau of Statistics' approximations of them, which they dub "postal areas".
The real problem is ever using an average without also specifying some sort of bounds. For median-based data, this probably means the upper and lower quartiles (or possibly other percentiles); for mean-based data, this probably means standard deviation.
And don’t forget sales tax. Which is state + county + city
CGP Grey has a great video on this: https://m.youtube.com/watch?v=1K5oDtVAYzk
That data linked with the payment method that the register collects pretty much gives the store exactly who you are and where you live even if you chose not to sign up to the store's loyalty program.
* https://www.youtube.com/watch?v=1K5oDtVAYzk
So for example, if you are sorting “rural zips” vs “urban zips” it will only take you so far, and may actually be harmful.
Same goes with MSAs/DMAs (media markets). These have to be used for buying media, but for geospatial analysis they are suboptimal for the same reasons.
Easiest way to dip your toe into the water of something better is to start with A-D census counties.
[1]: https://www.npr.org/2025/01/08/1223466587/zip-code-history
This makes joining disparate data sources quite easy. And this also lets you do all sorts of cool stuff like aggregations, smoothing, flow modeling, etc.
We do some geospatial stuff and I wrote a polars plugin to help with this a while back [1].
[1] https://github.com/Filimoa/polars-h3
Reading their alternatives, it strikes me with "ZCTA's are the worst form of small area aggregation except for all others."
Its not a great geography to use but it is quite useful if you know it's limitations and inaccuracies when you get into it. Stuff like multipolygon entities, island-polys, etc aren't fun to resolve but can be accounted for.
Add on that ZCTA's will historically follow some sort of actual boundary(rivers/highways/etc) they can tell a story in a way Census tracts can't.
Edit: I wanted to point out that I recall that ESRI maps used to come “out of the box” with zip code polygon layers. While I agree they are technically not polygons in the strictest sense, they often are or they are fully closed shapes or close enough to it - and even if they are missing a few nodes to make a complete polygon, whoever did the digitizing probably manually closed the loop so to speak. Remember, geospatial maps are used for many different purposes, likely none of them having anything to do with postal routes, so in that sense they are “good enough” for most purposes.
No one has census blocks.
And coordinates can work but lack some inherent advantages, such as human readability and a semblance of pop density normalization.
The alternatives that the author suggests are much more complicated, both in terms of the implementation and in terms of convincing the user to give you their full address.
a. Excel treats them as numbers instead of strings of digits and thus drops the leading 0
b. Developers make assumptions about postal codes based on how they work (or more usually how the developer incorrectly thinks they work) in their own country and these assumptions absolutely do NOT hold in other countries.
A relevant guide to geocoding and postal codes: https://opencagedata.com/guides/how-to-think-about-postcodes...
Until very recently I naively assumed that the area of a given zip code would be entirely within the area of some single city or town which would then be entirely within the area of a single county.
It was quite a rude awakening working with software that tries to apply the correct local taxes to a given address and finding that the statement “A given X can contain multiple Y” is true for every possible combination of zip, city, and county.
[1] https://techbio.org/wiki/Addresses/finding-addresses-in-webp...
[2] https://techbio.org/wiki/Addresses/zipcode-trie-regex
I didn't need much precision so truncating seemed an easy way to group stuff.
Oh the surprise. I never again made such assumptions, let's just say I should have gotten a clue from Corsica being 2A and 2B.
If used for other purposes they fall short.
It was a really useful platform for uploading spatial data with a decent range of visualisation tools that didn't need code. You could do SQL if you wanted.
Then they got rid of the free tier, and set the cheapest tier at (iirc) $150 USD per month. And that was the end of that.
Which is derived from longitude and latitude..
Always sad when these schemes don't include a check digit in them though, even if the layout of this one gets typo'd codes pretty close to their intended destination.
Stop using zip codes
BANG! Why should I stop using zip codes? Must read!!!
And then you go to the old (2019) page, that appears to be filled with useless clicks and arguments that appear biased.
"Do not use zip codes for geospatial analysis."
https://www.npr.org/2004/04/01/1805651/post-office-calls-for...
It's so well written and informative that I completely didn't mind the "and here's how to do it in Carto" bit in the middle. Instead I thought they earned it.
Use Addresses Use Census Units Use your own Spatial Index
Why not lat, long?
You also have to decide how you'll do that binning. Can bins overlap? What do you do at the poles? H3 provides some reasonable default choices for you so don't have to worry about that part of your solution design.
Btw. I have a need recently to compute the shortest distance from a point to a line defined by two points, all in lat/lon. Anyone has any lead on how to do it?
Anything else is a loose correlation at best, that will likely change over time.
Though ultimately it was far too granular (for example the Bay Area would be so many different zip codes). Instead we went with Nielsen's DMA (Designated Market Area) mappings within the US to abstract aggregated data a bit better. And of course this DMA dataset also had a different original use case. It was used for TV / media market surveys so it has some weird vestiges. Some regions are grouped very far and wide (you'll notice there's a bit of Denver within Nevada and its just a remnant of how it used to be categorized), but it still provides a bit of a broader level grouping than something acute like zip code.
I do like this map from the article though and the granularity you can get with zip code when zooming: https://clausa.app.carto.com/map/29fd0873-64cb-42a6-a90d-c83...
We've also been considering using Combined Statistical Areas using population instead. This is something that is under way, and in the interim we've considered charting styles that don't necessarily need borders (for example this bubble map: https://www.levels.fyi/bubble-plot/europe/). The benefit with DMAs is that it offers full border coverage of the entire US whereas some hubs can still be missing from CSAs if relying on a population threshold. But the plan is to create some of our own regional definitions and borders using our own submissions combined with population. Will be an interesting project.
GeoJSON data for the map borders: https://github.com/PublicaMundi/MappingAPI/blob/master/data/...
Nielsen DMA regions: https://blocks.roadtolarissa.com/simzou/6459889
http://federalgovernmentzipcodes.us/free-zipcode-database-Pr...