NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Stop Publishing Garbage Data, It's Embarrassing (successfulsoftware.net)
stared 22 minutes ago [-]
I dislike the premise. I mean, good data is wonderful.

But if institutions are expected to release clear data or nothing, almost always it is the later.

What is important, is to offer as much methodology and caveats as possible, even if in an informal way. Because there is a difference between "data covers 72% of companies registered in..." vs expecting that data is full and authoritative, whereas it is missing.

(Source: 10 years ago I worked a lot with official data. As all data it requires cleaning.)

chaps 24 minutes ago [-]
I have mixed feelings about this. On one hand, yeah stop publishing garbage data, but as a FOIA nerd... I'll take the data in any state it is. I'm not personally going to be able to clean the data before I receive it. Does that mean I shouldn't release the unsanitized (public) data knowing that it has garbage data within? Hell no. Instead, we should learn and cultivate techniques to work with shit data. Should I attempt to clean it? Sure. But it becomes a liability problem very, very quickly.
hermitcrab 18 minutes ago [-]
So you expect the 1000s of people trying to use the fuel price data to each individually clean and validate it, rather than the supplier doing it?
chaps 16 minutes ago [-]
What...?
torginus 36 minutes ago [-]
Data and metrics is 90% what upper management sees of your project. You might not care about it, and treat it as an afterthought, but it's almost the most important thing about it organizationally.

People who don't heed this advice get to discover it for themselves (I sure did)

IF you can't make the data convincing, you'll lose all trust, and nobody will do business with you.

agent_anuj 23 minutes ago [-]
It is not just embarrassing, it can potentially kill your demo, project or even product as user will first look at data and then the tech behind it. If the data is wrong, it means the tech does not work. I never took data seriously during my demos in the first 10 years of my career and no wonder the audience rejected most of my work though it was backed by solid platforms.
GMoromisato 31 minutes ago [-]
Clean data is expensive--as in, it takes real human labor to obtain clean data.

One problem is that you can't just focus on outliers. Whatever pattern-matching you use to spot outliers will end up introducing a bias in the data. You need to check all the data, not just the data that "looks wrong". And that's expensive.

In clinical drug trials, we have the concept of SDV--Source Data Verification. Someone checks every data point against the official source record, usually a medical chart. We track the % of data points that have been verified. For important data (e.g., Adverse Events), the goal is to get SDV to 100%.

As you can imagine, this is expensive.

Will LLMs help to make this cheaper? I don't know, but if we can give this tedious, detail-oriented work to a machine, I would love it.

hermitcrab 25 minutes ago [-]
>Clean data is expensive--as in, it takes real human labor to obtain clean data.

Yes, data can contain subtle errors that are expensive and difficult to find. But the 2nd error in the article was so obvious that a bright 10 year would probably have spotted it.

gdulli 22 minutes ago [-]
Why would you give this sort of work to a machine that can't be responsibly used without checking its output anyway?
GMoromisato 15 minutes ago [-]
It's not obvious to me that LLMs can't be made reliable.
Phlogistique 29 minutes ago [-]
That it's it's better to publish the garbage data than to not publish it though. I would worry about complaining too much lest they just decide to stop publishing it because it creates bad PR.
nick__m 23 minutes ago [-]
As long as the garbage data is authentic and the method used to produce it is adequately detailed, I agree with you that: "it's better to publish the garbage data than to not publish it"

But fake data or garbage data without the method, is better left unpublished !

hermitcrab 22 minutes ago [-]
Hard disagree on that. They just need a basic smell test before they put it out.
Tempest1981 18 minutes ago [-]
Agree. Maybe just add a Disclaimer.md file.
albert_e 18 minutes ago [-]
Concluding passage:

> Authors should have their work proof read

Agreed.

Opening passage:

> A quick plot of the latitude and longitude shows some clear outliners

"outliners"

Ouch!

hermitcrab 13 minutes ago [-]
OP here. Ouch indeed. I did actually get it proofread. But that was missed. I can't fire my proofreader, as we are married. ;0)

Now fixed.

rdiddly 10 minutes ago [-]
Not fixed at this hour
hermitcrab 4 minutes ago [-]
You might need to do a refresh.
mlaretallack 31 minutes ago [-]
I saw the RAC one this morning, though I was miss reading the graph, as why would the RAC publish such an obvious mistake.

I have written my own Home Assistant custom component for the UK fuel finder data, and yes, the data really is that bad.

alias_neo 15 minutes ago [-]
I was looking at that RAC chart this morning. Given it's Sunday, and I was reading before my morning coffee, I'm not ashamed to say it took me a good few seconds of zooming in and out to realise they'd used a decimal point where a comma should have been.

Easy type to make, but seriously, does no one even take a cursory look at the charts when publishing articles like this? The chart looks _obviously_ wrong, so imagine how many are only slightly wrong and are missed.

The fuel prices one could surely be solved with a tiny bit of validation; are the coordinates even within a reasonable range? Fortunately, in the UK, it's really easy to tell which is latitude and which is longitude due to one of them being within a digit or two of zero on either side.

hermitcrab 46 minutes ago [-]
If you are putting out data without doing even the most basic validation, then you should be ashamed.
ramon156 38 minutes ago [-]
What about most of Show HN's projects nowadays? Sometimes the docs straight up lie, and it takes 5 minutes to figure that out. Should they also be ashamed?

What about people who don't know how their own code works? Despite it working flawlessly? I'm asking because I don't really know.

Calazon 34 minutes ago [-]
> Sometimes the docs straight up lie, and it takes 5 minutes to figure that out. Should they also be ashamed?

Yes.

akudha 21 minutes ago [-]
How is it fair to compare a Show HN project with official government datasets? People depend on government datasets, multi-billion dollar businesses are built on top of them. A show HN project is typically someone building it in a weekend. They’re not even remotely in the same league.

Sure it is expensive to check every number, but at least some of it can be automated and flagged for human review, no? Switching lat/long numbers. For example

add-sub-mul-div 34 minutes ago [-]
This has become a spam site for AI shovelware projects that are nearly always posted by accounts with no activity here outside of self promotion.
subscribed 14 minutes ago [-]
If they publish a lie they should be ashamed, even if their lie is orders of magnitude less impactful.

And if someone publishes a flawless code but have no idea how it works, its not their code, quite clearly, AMD they should be ashamed if they lie it is.

It's just, like, my opinion, but I like it :)

hermitcrab 32 minutes ago [-]
>Sometimes the docs straight up lie, and it takes 5 minutes to figure that out. Should they also be ashamed?

Yes. Lying is bad, even if some people are trying hard to normalise it.

>What about people who don't know how their own code works? Despite it working flawlessly?

I think that is fine, as long as you aren't making untrue claims.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 17:01:36 GMT+0000 (Coordinated Universal Time) with Vercel.