NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
The Tragedy of Google Books (2017) (theatlantic.com)
philipkglass 1 hours ago [-]
These Google scans are also available in the HathiTrust [1], an organization built from the big academic libraries that participated in early book digitization efforts. The HathiTrust is better about letting the public read books that have actually fallen into the public domain. I have found many books that are "snippet view" only on Google Books but freely visible on HathiTrust.

If you are a student or researcher at one of the participating HathiTrust institutions, you can also get access to scans of books that are still in copyright.

The one advantage Google Books still has is that its search tools are much faster and sometimes better, so it can be useful to search for phrases or topics on Google Books and then jump over to HathiTrust to read specific books surfaced by the search.

[1] https://www.hathitrust.org/

caseysoftware 4 minutes ago [-]
I worked at the Library of Congress on their Digital Preservation Project, circa 2001-2003. The stated goal was to "digitize all of the Library's collections" and while most people think of books, I was in the Motion Picture Broadcast and Recorded Sound Division.

In our collection were Thomas Edison's first motion pictures, wire spool recordings from reporters at D-Day, and LPs of some of the greatest musicians of all time. And that was just our Division. Others - like American Heritage - had photos from the US Civil War and more.

Anyway, while the Rights information is one big, ugly tangled web, the other side is the hardware to read the formats. Much of the media is fragile and/or dangerous to use so you have to be exceptionally careful. Then you have to document all the settings you used because imagine that three months from now, you learn some filter you used was wrong or the hardware was misconfigured.. you need to go back and understand what was affected how.

Cool space. I wish I'd worked there longer.

caseysoftware 31 seconds ago [-]
Also.. it was fun learning the answer to "what is the work?"

If you have an LP or wire spool recording, the audio is the key, obvious work. But then you have the album cover, the spool case, and the physical condition of the media. Being able to see an album cover or read a reporter's notes/labeling is almost as important as the audio.

yonran 26 minutes ago [-]
> Dan Clancy, the Google engineering lead on the project who helped design the settlement, thinks that it was a particular brand of objector—not Google’s competitors but “sympathetic entities” you’d think would be in favor of it, like library enthusiasts, academic authors, and so on—that ultimately flipped the DOJ.

I was at Google on a team adjacent to Dan Clancy when he was most excited about the Authors’ Guild negotiations to publish orphan works and create a portal to pay copyright holders who signed up, and I recall that one opponent that he was frustrated at was Brewster Kahle of the Internet Archive, who filed a jealous amicus brief complaining that the Authors’ Guild settlement would not grant him access to publishing orphan works too. In my opinion Kahle was wrong; the existence of one orphan works clearinghouse would have encouraged Congress to grant more libraries access instead of doing nothing which is what actually happened in the 15 year since then.

Since then, of course, Brewster Kahle launched an e-library without legal authorization anyway which will probably be the death of the current organization that runs the Internet Archive. Tragic all around.

Zigurd 1 hours ago [-]
O'Reilly, for whom I've been a lead author and co-author, did this: https://www.oreilly.com/pub/pr/1042

They call it Founder's Copyright. The also use Creative Commons. The goal is to make out of print books available at no cost.

card_zero 58 minutes ago [-]
> A complete list of available titles is at www.oreilly.com/openbook

Exciting!

Follows link

Link no longer exists, gets O'Reilly front page instead

"Introducing the AI Academy, Help your entire org put GenAI to work"

Thanks O'Reilly.

MollyRealized 53 minutes ago [-]
It's okay, I'll just check the Wayb--shit
54 minutes ago [-]
ErikAugust 46 minutes ago [-]
“Page had always wanted to digitize books. Way back in 1996, the student project that eventually became Google—a “crawler” that would ingest documents and rank them for relevance against a user’s query—was actually conceived as part of an effort “to develop the enabling technologies for a single, integrated and universal digital library.” The idea was that in the future, once all books were digitized, you’d be able to map the citations among them, see which books got cited the most, and use that data to give better search results to library patrons. But books still lived mostly on paper. Page and his research partner, Sergey Brin, developed their popularity-contest-by-citation idea using pages from the World Wide Web.“

Larry Page had some cool ideas… can’t imagine Books will ever be resurrected, unfortunately.

dekhn 40 minutes ago [-]
He really wanted to digitize all of them to provide reference and training data for early language models (well before LLMs, transformers, etc).

He also had a plan (with George Church) to build enormous warehouses holding large-scale biology research infrastructure right next to google data centers. Because most biology research is done at locations that have reached their limit on computational/storage capacity.

Larry had many good ideas but he struggled to get the majority of them off the ground. For example, when Trump was president and invited all the major tech leaders, Larry came with a plan to upgrade the US electrical system with long-range DC.

svilen_dobrev 1 hours ago [-]
This seems to be the fate of knowledge/content that stays in institutions which have been built with the idea of collecting it and growing it.. but have turned into walled gardens/crypts of sort. Rot/Rust and be forgotten.

A very cynical and dark view is that the New things/people need that oblivion in order to feel great, for not haveing to compare with old great-er ones. Rewriting history as it seems fit the current powers-that-be, is easier this way.

Or may be it's just collective stupidity? or societal immaturity ?

(i am coming from completely different killed project on a different continent, but the idea is the same)

kyleee 33 minutes ago [-]
I think you are on to something, people frequently don’t want to grapple with and understand what has been done before, they prefer to just wing it and move forward on their own.
xipho 1 hours ago [-]
A huge proportion of this corpus is found in the Hathi Trust (see https://www.hathitrust.org/the-collection/). We have had a grant to crawl and derive an index on it via their supercomputing resources. I'm sure they are looking to LLM proposals, though they are exceedingly careful about the copyright issues.

https://www.hathitrust.org/

jsemrau 48 minutes ago [-]
>I'm sure they are looking to LLM proposals

Well, it is a use case for this challenge https://www.kaggle.com/competitions/gemini-long-context

fredgrott 1 hours ago [-]
thank you as some of us were looking for something to replace the archive.org digital book library part....
thayne 2 hours ago [-]
IMO if a work is out of print (or equivalent depending on the medium) for more than a few years, it should be released into the public domain. Or maybe something like the public domain, but requires attribution.
kps 54 minutes ago [-]
Like trademark: Use it or lose it.

(The reality is that publishers would put lazy photocopies up for sale at ten zillion dollars a piece.)

pfdietz 19 minutes ago [-]
So, e-books are either immediately out of print, or never out of print?
tightbookkeeper 5 minutes ago [-]
What if we applied the simple test that the book was originally published on paper and no other printings have occurred (digital or paper).
giraffe_lady 1 hours ago [-]
Then every book will be immediately out of print after its initial run, while the not-quite-a-cartel of publishers all decline to print it until it hits the point where they no longer have to pay the author.
Jtsummers 60 minutes ago [-]
Then the publisher loses out on exclusive publishing rights and also loses money. It's in their interests to keep it in print so long as it's a profitable book, even if they have to pay some percentage to the author. Once it goes into public domain every publisher can reprint it and the original publisher has to compete with them on price.
giraffe_lady 49 minutes ago [-]
ok
55 minutes ago [-]
1 hours ago [-]
submeta 42 minutes ago [-]
With library genesis, who needs Google Books anymore? I buy books physically to support the author/s and download an epub version from said site to my kindle. The physical books I hardly read, they are for my shelf. Although I love the feeling of printed books, but I read in bed, and it‘s easier to hold an ebook. Also I read when I commute. It’s lighter to have my Kindle Oasis with me with tons of books on it.
ghaff 17 minutes ago [-]
There’s the everything available online for free mindset. But, yes, I’ve basically donated all my books that were in the public domain. And, in general, have been massively purging my book collection of stuff I won’t realistically read again.
Animats 50 minutes ago [-]
We need a Copyright Term Reduction Act.

It's time. 50 years, renewal is possible but expensive.

pvg 2 hours ago [-]
datadrivenangel 2 hours ago [-]
Thanks Paul!
pvg 1 hours ago [-]
Wrong number, I'm afraid.
senkora 2 hours ago [-]
I’m sure the lawyers will eventually figure out a way to train an LLM on them.
datadrivenangel 2 hours ago [-]
They probably already have! It seems like an amazing training dataset even if you can't share source data.
amelius 50 minutes ago [-]
How do you train an LLM such that it is guaranteed to never regurgitate its training data?
anoncow 1 hours ago [-]
Sad and criminal.
2OEH8eoCRo0 1 hours ago [-]
The tragedy is that Google is tasked with this at all. It would be cool if public libraries could work together on a massive public digital library. This shouldn't be Google's responsibility.
Jtsummers 58 minutes ago [-]
Google wasn't tasked (by a third party) with this, they chose to do it.
LisaDziuba 2 hours ago [-]
[dead]
pluc 2 hours ago [-]
[flagged]
2 hours ago [-]
andrewstuart 2 hours ago [-]
Google must be tempted to put them in an LLM.
bborud 1 hours ago [-]
It would surprise me greatly if they haven't already.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 20:28:22 GMT+0000 (Coordinated Universal Time) with Vercel.