When we started Amazon, this was precisely what I wanted to do, but using Library of Congress triple classifications instead of ISBN.
It turned out to be impossible because the data provider (a mixture of Baker & Tayler (book distributors) and Books In Print) munged the triple classification into a single string, so you could not find the boundaries reliably.
Had to abandon the idea before I even really got started on it, and it would certainly have been challenging to do this sort of "flythrough" in the 1994-1995 version of "the web".
Kudos!
dredmorbius 3 days ago [-]
What are you referring to as the LoC triple classification?
I've spent quite some time looking at both the LoC Classification and the LoC Subject Headings. Sadly the LoC don't make either freely available in a useful machine-readable form, though it's possible to play games with the PDF versions. I'd been impressed by a few aspects of this, one point that particularly sticks in my mind is that the state-law section of the Classification shows a very nonuniform density of classifications amongst states. If memory serves, NY and CA are by far the most complex, with PA a somewhat distant third, and many of the "flyover" states having almost absurdly simple classifications, often quite similar. I suspect that this reflects the underlying statutory, regulatory, and judical / caselaw complexity.
Another interesting historical factoid is that the classification and its alphabetic top-level segmentation apparently spring directly from Thomas Jefferson's personal library, which formed the origin of the LoC itself.
For those interested, there's a lot of history of the development and enlargement of the Classification in the annual reports of the Librarian of Congress to Congress, which are available at Hathi Trust.
Never knew about LoC book Classification till now; based on what I read I'd call it a failed US-wide attempt to standardize US collections (not international ones).
Neat as it is, it's not free to access ($; why??), it's not used outside US(/Canada) and it's not used as standard by US booksellers or libraries, and it's anglocentric as noted in [0] (an alternative being Harvard–Yenching Classification, for Chinese books). Also that's disappointing you say that the states vary greatly in applying that segmentation.
The LoC classifications are, so far as I'm aware, free from distribution restrictions as works of the US government under copyright, and to that extent it's legal to distribute them for free.
However the LoC doesn't provide machine-readable data for free so far as I'm aware.
You can acquire the entire Classification and Subject Headings as PDF files (also WordPerfect (!!!) and MS Word, possibly some other formats), though that needs some pretty complex parsing to convert to a structured data format.
(I've not tried the WP files, though those might be more amenable to conversion.)
As far was "why", presumably some misguided government revenue-generating and/or business-self-interest legislation and/or regulation, e.g., library service providers who offer LoC Class/SH data, who prefer not to have free competition. (I'm speculating, I don't know this for a fact, though it seems fairly likely.)
smcin 3 days ago [-]
But you can't access Classification Web without a $$$ subscription plan!
(from $375 for Single User up to $1900 for 26+ Concurrent Users).
(Aaron Swartz would object. You can access US patent data for free, but not LoC Classification Web)
dredmorbius 3 days ago [-]
Pretty much my point.
I should look into the terms/conditions for that.
smcin 3 days ago [-]
People won't care why it isn't freely accessible, it's not going to displace ISBN (or other non-US classifications) without that; not even inside the US, and certainly not outside it.
I'm actually genuinely surprised it isn't freely accessible; Aaron Swartz (RIP) might have gone to war over that, and might have won that war in the court of public opinion.
Hey has anyone in the govt trained an LLM on it? Given title, author, keywords, abstract, etc. predict which LoC triple classification (or Dewey Decimal Classification, or Harvard–Yenching Classification, or Chinese Library Classification, or New Classification Scheme for Chinese Libraries (NCL in Taiwan)) a book would have? That would be neat, and a good way to proliferate its use instead of ISBN. (But the US govt would still assert copyright over the LoC classification.)
dredmorbius 2 days ago [-]
ISBN and LoC Classification serve totally different purposes.
ISBN identifies a specific publication, which may or may not be a distinct work. In practice, a given published work (say, identified by an author, title, publication date, and language) might have several ISBNs associated with it, for trade hardcover, trade paperback, library edition, large print, Braille, audio book, etc. On account of how ISBNs are issued, the principle organisation is by country and publisher. This also means that the same author/title/pubdate tuple may well have widely varying ISBNs for, say, US, Canadian, UK, Australian, NZ, and other country's version of the same English-language text.
There are other similar identifiers such as the LoC's publication number (issued sequentially by year), the OCLC's identifier, or (for journal publications) DOI. Each of these simply identify a distinct publication without providing any significant classification function.[1]
The LoC Classification, as the name suggests, organises a book within a subject-based ontology. Whilst different editions, formats, and/or national versions of a book might have distinct LoC Classifications, those will be tightly coupled and most of the sequence will be shared amongst those books. The LoC Classification can be used to identify substantively related material, e.g., books on economics, history, military science, religion, or whatever, in ways which ISBN simply cannot.
As I've noted, the Classification is freely available, as PDFs, WordPerfect, and MS Word files, at the URLs I'd given previously. Those aren't particularly useful as machine-readable structured formats, however.
________________________________
Notes:
1. Weasel-word "significant" included as those identifiers provide some classification, but generally by year, publisher, publication, etc., and not specifically classifying the work itself.
dredmorbius 3 days ago [-]
Answering separately:
...I'd call it a failed US-wide attempt to standardize...
The LOC Classification is a system for organising a printed (and largely bound) corpus on physical shelves/stacks. That is, any given document can occupy at most one and only one location, and two or more documents cannot occupy the same location, if they comprise separate bound volumes or other similar formats (audio recordings, video recordings, maps, microfilm/microfiche, etc.).
For digital records, this constraint isn't as significant (your filesystem devs will want to ensure against multiple nonduplicate records having the same physical address, but database references and indices are less constrained).
The Subject Headings provide ways of describing works in standardised ways. Think of it as strongly similar to a tagging system, but with far more thought and history behind it. ("Folksonomy" is a term often applied to tagging systems, with some parts both of appreciation and frustration.)
Where a given work has one and only onecall number fitting within the LoC classification, using additional standardised classifications such as Cutter Codes, author and publication date, etc., works typically have multipleSubject Headings. Originally the SH's were used to create cross-references in physical card catalogues. Now they provide look-up affordances in computerised catalogues (e.g., WorldCat, or independently-derived catalogues at universities or the Library of Congress itself). You'll typically find a list of LoC SH's on the copyright page of a book along with the LoC call number.
Back to the Classification: there are many criticisms raised about LoC's effort, or others (e.g., Dewey Decimal, which incidentally is not free and is subject to copyright and possibly other IP, with some amusing case history). What critics often ignore is that classifications specifically address problems of item storage and retrieval, and as such, are governed by what is within the collection, who seeks to use that material, and how. In the case of state legal classifications, absent further experience with both that section of the classification (section K of the LoC Classification) and works within it, I strongly suspect that the complexity variation is a reflection of the underlying differences in state law (as noted above) and those wishing to reference it. That is, NY, CA, and PA probably have far greater complexity and far more demanding researchers, necessitating a corresponding complexity of their subsections of that classification, than do, say, Wyoming, North Dakota, and South Dakota (among the three smallest sections of state law by my rather faltering recollection).
Peculiarities of both the Dewey and LoC classifications, particularly in such areas as history (LoC allocates two alphabetic letters, E and F respectively, to "History of the Americas), geography, religion, etc. In the case of Dewey, Religion (200) is divided into General Religion (200--209), Philosophy and Theory of Religion (210--219), then the 220s, 230s, 240s, 250s, 260s, 270s, and 280s to various aspects of Christianity. All other religions get stuffed into the 290s. Cringe.
Going through LoC's Geography and History sections one finds some interesting discontinuities particularly following 1914--1917, 1920, 1939--1945, and 1990. There are of course earlier discontinuities, but the Classification's general outline was specified in the early 1800s, and largely settled by the late 1800s / early 20th century. Both the Classification and Subject Headings note many revisions and deprecated / superseding terms. Some of that might attract the attention of the present Administration, come to think of it, which would be particularly unfortunate.
The fact that the LoC's Classification and SH both have evident and reasonably-well-functioning revision and expansion processes and procedures actually seems to me a major strength of both systems. It's not an absolute argument for their adoption, but it's one which suggests strong consideration, in addition to the extant wide usage, enormous corpus catalogued, and supporting infrastructure.
PaulDavisThe1st 3 days ago [-]
30 years ago, I knew barely any more about library science than I do know, and I know basically nothing now. The idea was that of a dewy eyed (pun intentional) idealist who wanted to build an online experience similar to wandering into the gardening section at <your favorite large bookstore> and then dialing down the water garden part and then the japanese water garden part.
> the LOC Classification is a system for organising a printed (and largely bound) corpus on physical shelves/stacks. That is, any given document can occupy at most one and only one location, and two or more documents cannot occupy the same location,
The last part of this is not really true. The LoC classification does not identify a unique slot on any shelf, bin or other storage system. It identifies a zone or region of such a storage system where items with this classification could be placed. There can be as many books on Japanese water gardening as authors care to produce - this has no impact on the classification of those books. The only result of the numbers increasing is that some instances of a storage system that utilized this classification (e.g. some bookstores) would need to physically grow.
dredmorbius 2 days ago [-]
The Classification doesn't establish unique positions, no, but it serves as the backbone on which those unique call numbers are generated. First the subject and sub-subject classifications, then specific identifiers generally based on title, author, and publication date.
But the detail of the Classification serves the needs and interests of librarians and readers in that you'll, for fairly obvious reasons, need more detail where there are more works, less where there are fewer, and of course changes to reality, as any good contributing editor to The Hitchiker's Guide to the Galaxy (the reference, not D. Adam's charming account loosely linked to it) can tell you, play havok with pre-ordained organisational schemes.
The LoC Classifiction itself is itself only one of these. There are other library classifications, as well as a number of interesting ontologies dating back to Aristotle and including both Bacons, Diderot, encyclopedists of various stripes, and more.
CRConrad 20 hours ago [-]
> What are you referring to as the LoC triple classification?
Lines of actually working code; lines of commented-out inactive code; lines of explanatory comments. HTH!
Naah, gotcha, the other "LoC"... But only got it on about the third occurrence.
ilamont 3 days ago [-]
> a mixture of Baker & Tayler (book distributors)
Having dealt with Baker & Taylor in the past, this doesn't surprise me in the least. It was one of the most technologically backwards companies I've ever dealt with. Purchase orders and reconciliations were still managed with paper, PDFs, and emails as of early 2020 (when I closed my account). I think at one point they even had me faxing documents in.
PaulDavisThe1st 2 days ago [-]
A bit tangential but one of my favorite early amzn stories is when a small group from Ingram (at the time, the other major US book distributor) came to visit us in person (they were not very far away ... by design).
It was clear that they were utterly gobsmacked that a team of 3 or 4 people could have done what we have done in the time that we had done it. They had apparently contemplated getting into online retail directly, but saw two big problems: (a) legal and moral pushback from publishers who relied on Ingram just being a distributor (b) the technological challenge. I think at the time their IT staff numbered about 20 or so. They just couldn't believe what they were seeing.
Good times (there weren't very many of those for me in the first 14 months) :)
layer8 4 days ago [-]
It’s not uncommon for an ISBN to have been assigned multiple times to different books [0]. Thus “all books in ISBN space” may be an overstatement.
There’s also the problem of books with invalid ISBNs, i.e. where the check digit doesn’t match the rest of the ISBN, but where correcting the check digit would match a different book. These books would be outside of the ISBN space assumed by the blog post.
And possibly not even assigned at all. I looked at the lowest known ISBNs for Czech publishers and a different color stood out: no, https://books.google.cz/books?vid=ISBN9788000000015&redir_es... is not a correct ISBN, I'd say :-) (But I don't know if the book includes such obviously-fake ISBN, or the error is just in Google Books data.)
Finnucane 4 days ago [-]
Publishers buy blocks of isbns based on expected need, how the actually assign them may be arbitrary.
rsecora 4 days ago [-]
Impressive presentation.
Note: The presentation reflects the contents of Anna's archive exclusively, rather than the entire ISBN catalog. There is a discernible bias towards a limited range of languages, due to Anna's collection bias to those languages. The sections marked in black represent the missing entries in the archive.
phiresky 4 days ago [-]
That's not entirely accurate since AA has separate databases for books they have as files, and one for books they only know the metadata of. The metadata database comes from various sources and as far as I know is pretty complete.
Black should mostly be sections that have no assigned books
bloak 3 days ago [-]
I found some books which are available from dozens of online bookshops but which are not in this visualisation. Perhaps they're not yet in any library that feeds into worldcat.org, though some of them were about five years old.
phiresky 3 days ago [-]
Did you search by title or ISBN? If you search by title, the search goes through Google Books, which is very incomplete (since I didn't build a search database myself). If you put in an ISBN13 directly, you'll find a lot more books are included (I'd say you can only find 10-30% of books via the Google Books API)
It's a bit misleading I guess the way I added that feature.
keepamovin 4 days ago [-]
Wow, that is really cool. What an amazing passion project and what an incredible resource!
Zooming in you can see the titles, the barcode and hovering get a book cover and details. Incredible, everything you could want!
Some improvement ideas: checkbox to hide the floating white panel at top left, and the thing at top right. Because I really like to "immerse" in these visualizations, those floaters lift you out of that experience to some extent, limiting fun and functionality for me a bit.
robwwilliams 3 days ago [-]
Ah, this is a perfect application for Microsift SilverLight PivotViewer, a terrific web interface we used for neuroimaging until Microsoft pulled the plug.
There is an awe inspiring TED talk by Gary W. Flake demonstrating its use.
When you zoom in it's book shelves! That's so cool
MeteorMarc 4 days ago [-]
Possible improvement: paperback and bounded editions are shown next to each other, but look the same. Do not know about the e-books.
greenie_beans 4 days ago [-]
those would be totally different isbns. to connect the related editions you'd probably need to get something like the FBR records for each work and idk if anna's archive has related books like that?
Found the presentation a little overwhelming in the current format. Took a bit to realize the preset part in the upper left actually led to further dataviz vectors like AA (yes/no), rarity, and Google Books inclusion. However, offers a lot in terms of the visualization and data depth available. Also liked https://archive.anarchy.cool/blog/all-isbns.html#visualizing for the region clustering look.
The preset year part was neat though in and of itself just for looking at how active certain regions and areas have been in publishing. Poland's been really active lately. Norway looks very quiet by comparison. China looks like they ramped in ~2005 and huge amounts in the last decade.
United States has got some weird stuff too. Never heard of them, yet Blackstone Audio, Blurb Inc., and Draft2Digital put out huge numbers of ISBNs.
phiresky 3 days ago [-]
It is admittedly pretty noisy, which is somewhat intentional because the focus was on high data density. Here's an a bit more minimalistic view (less color, only one text level simultaneously):
It could probably be tweaked further to not show some of the texts (the N publishers part), less stuff on hover, etc.
araes 3 days ago [-]
It's ok. Not an issue, since it was meeting the request of the competition, not some commenter on the web.
On the data density part, actually noticed there's most of the covers for the books too, which was kind of a cool bit. Not sure if it's feasible, yet would be neat to show them as if they were the color of their binding in their pictures.
Makes me almost want a Skyrim style version of this idea, where they're all little 3D books on their 3D shelves, and you can wander down the library isles by sections. Click a book like Skyrim and put it in your inventory or similar. Thought this mod [1] especially was one of the coolest community additions to Skyrim when it came out. Also in the "not sure if it's feasible" category.
That's a really nice view! Made me think of this map of IP addresses arranged as a Hilbert curve i saw in this Tom7 video: (the rest of the video is wildly good if you haven't seen it)
https://youtu.be/JcJSW7Rprio?si=wzFq4p61qYmpT59x&t=360
pfedak 3 days ago [-]
I think you can reasonably think about the flight path by modeling the movement on the hyperbolic upper half plane (x would be the position along the linear path between endpoints, y the side length of the viewport).
I considered two metrics that ended up being equivalent. First, minimizing loaded tiles assuming a hierarchical tiled map. The cost of moving x horizontally is just x/y tiles, using y as the side length of the viewport. Zooming from y_0 to y_1 loads abs(log_2(y_1/y_0)) tiles, which is consistent with ds = dy/y. Together this is just ds^2 = (dx^2 + dy^2)/y^2, exactly the upper-half-plane metric.
Alternatively, you could think of minimizing the "optical flow" of the viewport in some sense. This actually works out to the same metric up to scaling - panning by x without zooming, everything is just displaced by x/y (i.e. the shift as a fraction of the viewport). Zooming by a factor k moves a pixel at (u,v) to (k*u,k*v), a displacement of (u,v)*(k-1). If we go from a side length of y to y+dy, this is (u,v)*dy/y, so depending how exactly we average the displacements this is some constant times dy/y.
Then the geodesics you want are just the horocycles, circles with centers at y=0, although you need to do a little work to compute the motion along the curve. Once you have the arc, from θ_0 to θ_1, the total time should come from integrating dtheta/y = dθ/sin(θ), so to be exact you'd have to invert t = ln(csc(θ)-cot(θ)), so it's probably better to approximate. edit: mathematica is telling me this works out to θ = atan2(1-2*e^(2t), 2*e^t) which is not so bad at all.
Comparing with the "blub space" logic, I think the effective metric there is ds^2 = dz^2 + (z+1)^2 dx^2, polar coordinates where z=1/y is the zoom level, which (using dz=dy/y^2) works out to ds^2 = dy^2/y^4 + dx^2*(1/y^2 + ...). I guess this means the existing implementation spends much more time panning at high zoom levels compared to the hyperbolic model, since zooming from 4x to 2x costs twice as much as 2x to 1x despite being visually the same.
pfedak 3 days ago [-]
Actually playing around with it the behavior was very different from what I expected - there was much more zooming. Turns out I missed some parts of the zoom code:
Their zoom actually is my "y" rather than a scale factor, so the metric is ds^2 = dy^2 + (C-y)^2 dx^2 where C is a bit more than the maximal zoom level. There is some special handling for cases where their curve would want to zoom out further.
Normalizing to the same cost to pan all the way zoomed out (zoom=1), their cost for panning is basically flat once you are very zoomed in, and more than the hyperbolic model when relatively zoomed out. I think this contributes to short distances feeling like the viewport is moving very fast (very little advantage to zooming out) vs basically zooming out all the way over larger distances (intermediate zoom levels are penalized, so you might as well go almost all the way).
SrTobi 3 days ago [-]
Hi, I was the one nerdsniped :) In the end I don't think blub space is the best way to do the whole zoom thing, but I was intrigued by the idea and had already spend too much time on it and the result turned out quite good.
The problem is twofold: which path should we take through the zoom levels,x,y and how fast should we move at any given point (and here "moving" includes zooming in/out as well).
That's what the blub space would have been cool for, because it combines speed and path into one.
So when you move linearly with constant speed through the blub space you move at different speeds
at different zoom levels in normal space and also the path and speed changes are smooth.
Unfortunately that turned out not to work quite as well... even though the flight path was alright (although not perfect), the movement speeds were not what we wanted...
I think that comes from the fact that blub space is linear combination of speed and z component.
So if you move with speed s at ground level (let's say z=1) you move with speed z at zoom level z (higher z means more zoomed out).
But as you pointed out normal zoom behaviour is quadratic so at zoom level z you move with speed z².
But I think there is no way to map this behaviour to a euclidean 2d/3d space (or at least I didn't find any. I can't really prove it right now that it's not possible xD)
So to fix the movement speed we basically sample the flight path and just move along it according to the zoom level at different points on the curve... Basically, even though there are durations in the flight path calculation, they get overwritten by TimeInterpolatingTrajectory, which is doing all the heavy work for the speed.
For the path... maybe a quadratic form with something like x^4 with some tweaking would have been better, but the behaviour we had was good enough :) Maybe the question we should ask is not about the interesting properties of non-euclidean spaces, but what makes a flightpath+speed look good
pfedak 3 days ago [-]
The nice thing about deciding on a distance metric is that it gives you both a path (geodesics) and the speed, and if you trust your distance metric it should be perceptually constant velocity. I agree it's non-euclidean, I think the hyperbolic geometry description works pretty well (and has the advantage of well-studied geodesics).
I did finally find the duration logic when I was trying to recreate the path, I made this shader to try to compare:
https://www.shadertoy.com/view/l3KBRd
phiresky 14 hours ago [-]
Wow, amazing how you managed to convert that into a compact shader, including a representation of the visualization!
casey2 3 days ago [-]
Regarding ISBN The first section consists of a 3 digit number are issued by GS1, they have only issued 978 and 979, all other sections are issued by the International ISBN Agency.
The second section identifies a country, geographical region or language area. It consists of a 1-5 digit number. The third section, up to 7 digits, is given on request of a publisher to the ISBN agency, larger publishers (publishers with a large expected output) are given smaller numbers (as they get more digits to play with in the 4th section). The forth, up to 6 digits, are given to "identify a specific edition, of a publication by a specific publisher in a particular format", the last section is a single check digit, equal to 10|+´digits×⥊6/[1‿3] where digits are the first 12 digits.
From this visualization it's most apparent that the publishers "Create Space"(aka Great UNpublished, booksurge) and "Forgotten Books" should have been given a small number for the third section. (though in my opinion self-published editions and low value spam shouldn't get an isbn number, or rather it should be with the other independently published work @9798)
They also gave Google tons of space but it appears none of it has been used as of yet.
zellyn 4 days ago [-]
This really drives home how scattershot organizing books by publisher is. Try searching for "Harry Potter and the Goblet of Fire" and clicking on each of the results in turn: they're nowhere near each other.
Or try "That Hideous Strength" by "C.S. Lewis" vs "Clive Stables Lewis", and suddenly you're arcing across a huge linear separation.
Still, given that that's what we use, this visualization is lovely. Imagine if you could open every book and read it…
Finnucane 4 days ago [-]
Why would you expect otherwise? Titles are assigned ISBNs by publishers as they are being published. Books published simultaneously as a set might have sequential numbers, but otherwise not. Books separated by a year or more are not going to have related numbers. It's an inventory tracking mechanism, it has no other meaning.
I know that isn't an AMA, but may I ask, how is running a publishing house working out for you? From the outside, having a small publishing house, sounds like an uphill battle on all fronts. What is the main driver to become a publisher - hobby turned into profession?
bambax 3 days ago [-]
It's mostly still a hobby. I publish very few books and could do without being a "publisher". But being one allows me to exist in the official French databases where traditional bookshops can find my books and order with one click. Without that, they would have to search on Google, find a phone number or an email address, and they probably wouldn't even bother.
It's not a recipe for getting rich! But it works for me, (and costs almost nothing).
phiresky 3 days ago [-]
Huh, that text (and barcodes) are very offset from where they should be. Would you mind sharing what OS and browser you are using and if this text weirdness was temporary or all the time?
bambax 3 days ago [-]
This was on Chrome 109.0.5414.120, last available version for Windows 7. (And yes I know, I know, it's very bad to still run Win7 ;-)
Jun8 4 days ago [-]
Great description of ISBN format description and visualization. TIL that 978- prefix was “Bookland” ie a fictional country prefix that may be thought of as “Earth”. It has expanded to 979- which was originally “Musicland”.
This probably means that in the (hopefully near) future where we have extraterrestrial publishing (most likely in the Moon or Mars) we’ll need another prefix.
quink 3 days ago [-]
Not really. The 978 prefix, or previously ISBN-10 namespace, in addition to a recalculation of the checksum, makes most books go into the EAN-13 namespace. EAN is meant for unique identifiers (“Numbers”) of “Articles” in “Europe”. Later that got changed to “International”, but most still prefer the acronym EAN.
So 978 really is Bookland, as it used to be, and Earth, but the EAN-13 namespace as a whole really does refer to Earth as well. That said, the extraterrestrials can get a prefix just the same?
ks2048 3 days ago [-]
Does anyone see were the raw data is downloaded? I see this [1], but looks like it might just be the list of ISBNs and not the titles. I suppose following the build instructions for this page [2] would do it, but rather not install these js tools.
This is a wonderful submission to Anna's archive [1]. I really love people pushing the boundaries of shadow source initiatives that benefit all of us, especially providing great code and design. Can't emphasize enough the net plus of open source, BitTorrent, and shadow libraries that have had in the world. You can also make the case that LLMs wouldn't have been possible without shadow libraries; it's just no way of getting enough data to learn.
- How every book has a title & link to the google books
- Information density - You can see various publishers, sparse areas of the grid, and more
- Visualization of empty space -- great work at making it look like a big bookshelf!
Improvements?
- Instead of 2 floating panels, collapse to 1
- After clicking a book, the tooltip should disappear once you zoom out/move locations!
- Sort options (by year, by publisher name)
- Midlevel visualization - I feel like at the second zoom level (groupings of publishers), there's little information that it provides besides the names and relative sparsity (so we can remove the ISBN-related stuff on every shelf) Also since there are a fixed width of shelves, I can know there are 20 publishers, so no need! If we declutter, it'll make for a really nice physical experience!
youssefabdelm 4 days ago [-]
Does anyone know if there's an API where I could plug in ISBN and get all the libraries in the world that have that book?
I know Worldcat has something like this when you search for a book, but the API, I assume is only for library institutions and I'm not a library nor an institution.
artninja1988 4 days ago [-]
Did they get the bounty?
vallode 4 days ago [-]
I believe the bounty is closed but they haven't announced the winner(s).
IOUnix 4 days ago [-]
This would be absolutely incredible to incorporate into VR. You could create such an intuitive organizational method adding a 3rd dimension for displaying.g
IOUnix 4 days ago [-]
This would be absolutely incredible to incorporate into VR. You could create such an intuitive organizational method adding a 3rd dimension for displaying.
4 days ago [-]
pbronez 4 days ago [-]
Super cool. Love that you can zoom all the way in and suddenly it looks like a bookshelf.
When I got down to the individual book level, I found several that didn’t have any metadata- not even a title. There are hyperlinks to look up the ISBN on Google books or World Cat, and in the cases I tried WorldCat had the data.
So… why not bring the worldcat data into the dataset?
Hnrobert42 4 days ago [-]
I didn't appreciate the difficulty of the Fly to Book path calculation until I read the description!
karunamurti 1 days ago [-]
Nice, I found my book in the rack somewhere.
jaakl 3 days ago [-]
One more zoom LoD please: to the actual pages of the books!
fnord77 4 days ago [-]
there's a massive block under "German Language" that's almost entirely english
Good find, I think those would be books written in English published by German publishers. The blog post discusses how ISBNs are allocated ... specifically the ones you picture are published by Springer, which is a German company that publishes in the English language.
Considering a specific example: "Forecasting Catastrophic Events in Technology, Nature and Medicine". The website's use of "Group 978-3: German language" is a bit of a misnomer, if they had said "Group 978-3: German issued" or "German publisher" it would be clearer to users.
phiresky 4 days ago [-]
The names of the groups are taken directly from the official allocation XML from the International ISBN Organization. This is what they call it, I assume to distinguish it from "Germany". Maybe "German language publishers" would be appropriate.
siddharthgoel88 4 days ago [-]
What an amazing visualization.
Ekaros 4 days ago [-]
No wonder search did not work for one book I tried. First it did not have the prefix and then the information on either the book or the database had different numbers...
Searched for some books and didn't find them.. I assume this in't complete list of all USA based books with ISBN published?
phiresky 4 days ago [-]
Books visible should be fairly complete as far as I know. But the search is pretty limited (and dependent on your location) because that uses the Google Books API. If you put in an ISBN13 directly, that should work more reliably.
sinuhe69 4 days ago [-]
I wonder if one day we will have an AI that reads, summarizes and catalogues all the published books? A super librarian :) Imagine being able to ask questions like: "What have they written about AI in the 21st century?". Even better: "What did people not think of when they pursued AGI in the 21st century, which later led to their extinction?" ;)
pillefitz 4 days ago [-]
Since most foundational models have been trained on illegally acquired books, this info should be already baked in.
sinuhe69 3 days ago [-]
They had access only to just one tiny bit of the entire written words of the worlds. Not all books are available in electronic formats. You can see from the visualization in the article that we don’t even have the titles for a lot of published books with ISBN. And even so, books with ISBN comprise only a fraction of the entire ever written books. Not too mention books in various (minor) languages of the worlds.
soheil 3 days ago [-]
Needs a better minimap that moves as the map is zoomed in.
maCDzP 3 days ago [-]
I am guessing this is going to win the bounty by Anna’s Archive?
godber 4 days ago [-]
This is really exceptional work, and still works on my ten year old iPad. Great job!
tekkk 4 days ago [-]
Wow, what a cool little project, congratulations on shipping!
est 4 days ago [-]
I am amazed to see how many STEM books China published.
est 4 days ago [-]
can the graph be rotated vertically? The text is needlessly difficult to read.
When we started Amazon, this was precisely what I wanted to do, but using Library of Congress triple classifications instead of ISBN.
It turned out to be impossible because the data provider (a mixture of Baker & Tayler (book distributors) and Books In Print) munged the triple classification into a single string, so you could not find the boundaries reliably.
Had to abandon the idea before I even really got started on it, and it would certainly have been challenging to do this sort of "flythrough" in the 1994-1995 version of "the web".
Kudos!
I've spent quite some time looking at both the LoC Classification and the LoC Subject Headings. Sadly the LoC don't make either freely available in a useful machine-readable form, though it's possible to play games with the PDF versions. I'd been impressed by a few aspects of this, one point that particularly sticks in my mind is that the state-law section of the Classification shows a very nonuniform density of classifications amongst states. If memory serves, NY and CA are by far the most complex, with PA a somewhat distant third, and many of the "flyover" states having almost absurdly simple classifications, often quite similar. I suspect that this reflects the underlying statutory, regulatory, and judical / caselaw complexity.
Another interesting historical factoid is that the classification and its alphabetic top-level segmentation apparently spring directly from Thomas Jefferson's personal library, which formed the origin of the LoC itself.
For those interested, there's a lot of history of the development and enlargement of the Classification in the annual reports of the Librarian of Congress to Congress, which are available at Hathi Trust.
Classification: <https://www.loc.gov/catdir/cpso/lcco/>
Subject headings: <https://id.loc.gov/authorities/subjects.html>
Annual reports:
- Recent: <https://www.loc.gov/about/reports-and-budgets/annual-reports...>
- Historical archive to ~1866: <https://catalog.hathitrust.org/Record/000072049>
[0]: https://en.wikipedia.org/wiki/Library_of_Congress_Classifica...
However the LoC doesn't provide machine-readable data for free so far as I'm aware.
You can acquire the entire Classification and Subject Headings as PDF files (also WordPerfect (!!!) and MS Word, possibly some other formats), though that needs some pretty complex parsing to convert to a structured data format.
(I've not tried the WP files, though those might be more amenable to conversion.)
As far was "why", presumably some misguided government revenue-generating and/or business-self-interest legislation and/or regulation, e.g., library service providers who offer LoC Class/SH data, who prefer not to have free competition. (I'm speculating, I don't know this for a fact, though it seems fairly likely.)
https://www.loc.gov/cds/
https://www.loc.gov/cds/classweb/
(Aaron Swartz would object. You can access US patent data for free, but not LoC Classification Web)
I should look into the terms/conditions for that.
I'm actually genuinely surprised it isn't freely accessible; Aaron Swartz (RIP) might have gone to war over that, and might have won that war in the court of public opinion.
Hey has anyone in the govt trained an LLM on it? Given title, author, keywords, abstract, etc. predict which LoC triple classification (or Dewey Decimal Classification, or Harvard–Yenching Classification, or Chinese Library Classification, or New Classification Scheme for Chinese Libraries (NCL in Taiwan)) a book would have? That would be neat, and a good way to proliferate its use instead of ISBN. (But the US govt would still assert copyright over the LoC classification.)
ISBN identifies a specific publication, which may or may not be a distinct work. In practice, a given published work (say, identified by an author, title, publication date, and language) might have several ISBNs associated with it, for trade hardcover, trade paperback, library edition, large print, Braille, audio book, etc. On account of how ISBNs are issued, the principle organisation is by country and publisher. This also means that the same author/title/pubdate tuple may well have widely varying ISBNs for, say, US, Canadian, UK, Australian, NZ, and other country's version of the same English-language text.
There are other similar identifiers such as the LoC's publication number (issued sequentially by year), the OCLC's identifier, or (for journal publications) DOI. Each of these simply identify a distinct publication without providing any significant classification function.[1]
The LoC Classification, as the name suggests, organises a book within a subject-based ontology. Whilst different editions, formats, and/or national versions of a book might have distinct LoC Classifications, those will be tightly coupled and most of the sequence will be shared amongst those books. The LoC Classification can be used to identify substantively related material, e.g., books on economics, history, military science, religion, or whatever, in ways which ISBN simply cannot.
As I've noted, the Classification is freely available, as PDFs, WordPerfect, and MS Word files, at the URLs I'd given previously. Those aren't particularly useful as machine-readable structured formats, however.
________________________________
Notes:
1. Weasel-word "significant" included as those identifiers provide some classification, but generally by year, publisher, publication, etc., and not specifically classifying the work itself.
...I'd call it a failed US-wide attempt to standardize...
The LOC Classification is a system for organising a printed (and largely bound) corpus on physical shelves/stacks. That is, any given document can occupy at most one and only one location, and two or more documents cannot occupy the same location, if they comprise separate bound volumes or other similar formats (audio recordings, video recordings, maps, microfilm/microfiche, etc.).
For digital records, this constraint isn't as significant (your filesystem devs will want to ensure against multiple nonduplicate records having the same physical address, but database references and indices are less constrained).
The Subject Headings provide ways of describing works in standardised ways. Think of it as strongly similar to a tagging system, but with far more thought and history behind it. ("Folksonomy" is a term often applied to tagging systems, with some parts both of appreciation and frustration.)
Where a given work has one and only one call number fitting within the LoC classification, using additional standardised classifications such as Cutter Codes, author and publication date, etc., works typically have multiple Subject Headings. Originally the SH's were used to create cross-references in physical card catalogues. Now they provide look-up affordances in computerised catalogues (e.g., WorldCat, or independently-derived catalogues at universities or the Library of Congress itself). You'll typically find a list of LoC SH's on the copyright page of a book along with the LoC call number.
Back to the Classification: there are many criticisms raised about LoC's effort, or others (e.g., Dewey Decimal, which incidentally is not free and is subject to copyright and possibly other IP, with some amusing case history). What critics often ignore is that classifications specifically address problems of item storage and retrieval, and as such, are governed by what is within the collection, who seeks to use that material, and how. In the case of state legal classifications, absent further experience with both that section of the classification (section K of the LoC Classification) and works within it, I strongly suspect that the complexity variation is a reflection of the underlying differences in state law (as noted above) and those wishing to reference it. That is, NY, CA, and PA probably have far greater complexity and far more demanding researchers, necessitating a corresponding complexity of their subsections of that classification, than do, say, Wyoming, North Dakota, and South Dakota (among the three smallest sections of state law by my rather faltering recollection).
Peculiarities of both the Dewey and LoC classifications, particularly in such areas as history (LoC allocates two alphabetic letters, E and F respectively, to "History of the Americas), geography, religion, etc. In the case of Dewey, Religion (200) is divided into General Religion (200--209), Philosophy and Theory of Religion (210--219), then the 220s, 230s, 240s, 250s, 260s, 270s, and 280s to various aspects of Christianity. All other religions get stuffed into the 290s. Cringe.
Going through LoC's Geography and History sections one finds some interesting discontinuities particularly following 1914--1917, 1920, 1939--1945, and 1990. There are of course earlier discontinuities, but the Classification's general outline was specified in the early 1800s, and largely settled by the late 1800s / early 20th century. Both the Classification and Subject Headings note many revisions and deprecated / superseding terms. Some of that might attract the attention of the present Administration, come to think of it, which would be particularly unfortunate.
The fact that the LoC's Classification and SH both have evident and reasonably-well-functioning revision and expansion processes and procedures actually seems to me a major strength of both systems. It's not an absolute argument for their adoption, but it's one which suggests strong consideration, in addition to the extant wide usage, enormous corpus catalogued, and supporting infrastructure.
> the LOC Classification is a system for organising a printed (and largely bound) corpus on physical shelves/stacks. That is, any given document can occupy at most one and only one location, and two or more documents cannot occupy the same location,
The last part of this is not really true. The LoC classification does not identify a unique slot on any shelf, bin or other storage system. It identifies a zone or region of such a storage system where items with this classification could be placed. There can be as many books on Japanese water gardening as authors care to produce - this has no impact on the classification of those books. The only result of the numbers increasing is that some instances of a storage system that utilized this classification (e.g. some bookstores) would need to physically grow.
But the detail of the Classification serves the needs and interests of librarians and readers in that you'll, for fairly obvious reasons, need more detail where there are more works, less where there are fewer, and of course changes to reality, as any good contributing editor to The Hitchiker's Guide to the Galaxy (the reference, not D. Adam's charming account loosely linked to it) can tell you, play havok with pre-ordained organisational schemes.
The LoC Classifiction itself is itself only one of these. There are other library classifications, as well as a number of interesting ontologies dating back to Aristotle and including both Bacons, Diderot, encyclopedists of various stripes, and more.
Lines of actually working code; lines of commented-out inactive code; lines of explanatory comments. HTH!
Naah, gotcha, the other "LoC"... But only got it on about the third occurrence.
Having dealt with Baker & Taylor in the past, this doesn't surprise me in the least. It was one of the most technologically backwards companies I've ever dealt with. Purchase orders and reconciliations were still managed with paper, PDFs, and emails as of early 2020 (when I closed my account). I think at one point they even had me faxing documents in.
It was clear that they were utterly gobsmacked that a team of 3 or 4 people could have done what we have done in the time that we had done it. They had apparently contemplated getting into online retail directly, but saw two big problems: (a) legal and moral pushback from publishers who relied on Ingram just being a distributor (b) the technological challenge. I think at the time their IT staff numbered about 20 or so. They just couldn't believe what they were seeing.
Good times (there weren't very many of those for me in the first 14 months) :)
There’s also the problem of books with invalid ISBNs, i.e. where the check digit doesn’t match the rest of the ISBN, but where correcting the check digit would match a different book. These books would be outside of the ISBN space assumed by the blog post.
[0] https://scis.edublogs.org/2017/09/28/the-dreaded-case-of-dup...
Note: The presentation reflects the contents of Anna's archive exclusively, rather than the entire ISBN catalog. There is a discernible bias towards a limited range of languages, due to Anna's collection bias to those languages. The sections marked in black represent the missing entries in the archive.
Black should mostly be sections that have no assigned books
It's a bit misleading I guess the way I added that feature.
Zooming in you can see the titles, the barcode and hovering get a book cover and details. Incredible, everything you could want!
Some improvement ideas: checkbox to hide the floating white panel at top left, and the thing at top right. Because I really like to "immerse" in these visualizations, those floaters lift you out of that experience to some extent, limiting fun and functionality for me a bit.
There is an awe inspiring TED talk by Gary W. Flake demonstrating its use.
https://m.youtube.com/watch?v=LT_x9s67yWA
And here is our IEEE paper from 2011.
Really sorry this is not a web standard.
https://www.dropbox.com/scl/fi/bl8zkjs3y47q3377hh3ya/Yan_Wil...
There are more cool submissions here https://software.annas-archive.li/AnnaArchivist/annas-archiv...
Mine is at https://isbnviz.pages.dev
Out of all the VR vapourware, a real life infinite library or infinite museum is the one thing that could conceivably get me dropping cash.
It would be far more interesting as a project which tried to make all legitimately available downloadable texts accessible, say as an interface to:
https://onlinebooks.library.upenn.edu/
The preset year part was neat though in and of itself just for looking at how active certain regions and areas have been in publishing. Poland's been really active lately. Norway looks very quiet by comparison. China looks like they ramped in ~2005 and huge amounts in the last decade.
United States has got some weird stuff too. Never heard of them, yet Blackstone Audio, Blurb Inc., and Draft2Digital put out huge numbers of ISBNs.
https://phiresky.github.io/isbn-visualization/?dataset=all&g...
It could probably be tweaked further to not show some of the texts (the N publishers part), less stuff on hover, etc.
On the data density part, actually noticed there's most of the covers for the books too, which was kind of a cool bit. Not sure if it's feasible, yet would be neat to show them as if they were the color of their binding in their pictures.
Makes me almost want a Skyrim style version of this idea, where they're all little 3D books on their 3D shelves, and you can wander down the library isles by sections. Click a book like Skyrim and put it in your inventory or similar. Thought this mod [1] especially was one of the coolest community additions to Skyrim when it came out. Also in the "not sure if it's feasible" category.
[1] Book Covers Skyrim, https://www.nexusmods.com/skyrimspecialedition/mods/901?tab=...
I considered two metrics that ended up being equivalent. First, minimizing loaded tiles assuming a hierarchical tiled map. The cost of moving x horizontally is just x/y tiles, using y as the side length of the viewport. Zooming from y_0 to y_1 loads abs(log_2(y_1/y_0)) tiles, which is consistent with ds = dy/y. Together this is just ds^2 = (dx^2 + dy^2)/y^2, exactly the upper-half-plane metric.
Alternatively, you could think of minimizing the "optical flow" of the viewport in some sense. This actually works out to the same metric up to scaling - panning by x without zooming, everything is just displaced by x/y (i.e. the shift as a fraction of the viewport). Zooming by a factor k moves a pixel at (u,v) to (k*u,k*v), a displacement of (u,v)*(k-1). If we go from a side length of y to y+dy, this is (u,v)*dy/y, so depending how exactly we average the displacements this is some constant times dy/y.
Then the geodesics you want are just the horocycles, circles with centers at y=0, although you need to do a little work to compute the motion along the curve. Once you have the arc, from θ_0 to θ_1, the total time should come from integrating dtheta/y = dθ/sin(θ), so to be exact you'd have to invert t = ln(csc(θ)-cot(θ)), so it's probably better to approximate. edit: mathematica is telling me this works out to θ = atan2(1-2*e^(2t), 2*e^t) which is not so bad at all.
Comparing with the "blub space" logic, I think the effective metric there is ds^2 = dz^2 + (z+1)^2 dx^2, polar coordinates where z=1/y is the zoom level, which (using dz=dy/y^2) works out to ds^2 = dy^2/y^4 + dx^2*(1/y^2 + ...). I guess this means the existing implementation spends much more time panning at high zoom levels compared to the hyperbolic model, since zooming from 4x to 2x costs twice as much as 2x to 1x despite being visually the same.
Their zoom actually is my "y" rather than a scale factor, so the metric is ds^2 = dy^2 + (C-y)^2 dx^2 where C is a bit more than the maximal zoom level. There is some special handling for cases where their curve would want to zoom out further.
Normalizing to the same cost to pan all the way zoomed out (zoom=1), their cost for panning is basically flat once you are very zoomed in, and more than the hyperbolic model when relatively zoomed out. I think this contributes to short distances feeling like the viewport is moving very fast (very little advantage to zooming out) vs basically zooming out all the way over larger distances (intermediate zoom levels are penalized, so you might as well go almost all the way).
The problem is twofold: which path should we take through the zoom levels,x,y and how fast should we move at any given point (and here "moving" includes zooming in/out as well). That's what the blub space would have been cool for, because it combines speed and path into one. So when you move linearly with constant speed through the blub space you move at different speeds at different zoom levels in normal space and also the path and speed changes are smooth.
Unfortunately that turned out not to work quite as well... even though the flight path was alright (although not perfect), the movement speeds were not what we wanted...
I think that comes from the fact that blub space is linear combination of speed and z component. So if you move with speed s at ground level (let's say z=1) you move with speed z at zoom level z (higher z means more zoomed out). But as you pointed out normal zoom behaviour is quadratic so at zoom level z you move with speed z². But I think there is no way to map this behaviour to a euclidean 2d/3d space (or at least I didn't find any. I can't really prove it right now that it's not possible xD)
So to fix the movement speed we basically sample the flight path and just move along it according to the zoom level at different points on the curve... Basically, even though there are durations in the flight path calculation, they get overwritten by TimeInterpolatingTrajectory, which is doing all the heavy work for the speed.
For the path... maybe a quadratic form with something like x^4 with some tweaking would have been better, but the behaviour we had was good enough :) Maybe the question we should ask is not about the interesting properties of non-euclidean spaces, but what makes a flightpath+speed look good
I did finally find the duration logic when I was trying to recreate the path, I made this shader to try to compare: https://www.shadertoy.com/view/l3KBRd
The second section identifies a country, geographical region or language area. It consists of a 1-5 digit number. The third section, up to 7 digits, is given on request of a publisher to the ISBN agency, larger publishers (publishers with a large expected output) are given smaller numbers (as they get more digits to play with in the 4th section). The forth, up to 6 digits, are given to "identify a specific edition, of a publication by a specific publisher in a particular format", the last section is a single check digit, equal to 10|+´digits×⥊6/[1‿3] where digits are the first 12 digits.
From this visualization it's most apparent that the publishers "Create Space"(aka Great UNpublished, booksurge) and "Forgotten Books" should have been given a small number for the third section. (though in my opinion self-published editions and low value spam shouldn't get an isbn number, or rather it should be with the other independently published work @9798)
They also gave Google tons of space but it appears none of it has been used as of yet.
Or try "That Hideous Strength" by "C.S. Lewis" vs "Clive Stables Lewis", and suddenly you're arcing across a huge linear separation.
Still, given that that's what we use, this visualization is lovely. Imagine if you could open every book and read it…
https://i.imgur.com/mhw6Mub.png
It's not a recipe for getting rich! But it works for me, (and costs almost nothing).
This probably means that in the (hopefully near) future where we have extraterrestrial publishing (most likely in the Moon or Mars) we’ll need another prefix.
So 978 really is Bookland, as it used to be, and Earth, but the EAN-13 namespace as a whole really does refer to Earth as well. That said, the extraterrestrials can get a prefix just the same?
[1] (Gitlab page) https://software.annas-archive.li/AnnaArchivist/annas-archiv...
[2] https://github.com/phiresky/isbn-visualization
Just thank you.
https://software.annas-archive.li/AnnaArchivist/annas-archiv...
Although I don't know if this was the winning entry or not
https://news.ycombinator.com/item?id=42652577 - Visualizing All ISBNs (2025-01-10, 139 comments)
Things I love:
- How every book has a title & link to the google books
- Information density - You can see various publishers, sparse areas of the grid, and more
- Visualization of empty space -- great work at making it look like a big bookshelf!
Improvements?
- Instead of 2 floating panels, collapse to 1
- After clicking a book, the tooltip should disappear once you zoom out/move locations!
- Sort options (by year, by publisher name)
- Midlevel visualization - I feel like at the second zoom level (groupings of publishers), there's little information that it provides besides the names and relative sparsity (so we can remove the ISBN-related stuff on every shelf) Also since there are a fixed width of shelves, I can know there are 20 publishers, so no need! If we declutter, it'll make for a really nice physical experience!
I know Worldcat has something like this when you search for a book, but the API, I assume is only for library institutions and I'm not a library nor an institution.
When I got down to the individual book level, I found several that didn’t have any metadata- not even a title. There are hyperlinks to look up the ISBN on Google books or World Cat, and in the cases I tried WorldCat had the data.
So… why not bring the worldcat data into the dataset?
https://i.imgur.com/LKDuTJP.png
Considering a specific example: "Forecasting Catastrophic Events in Technology, Nature and Medicine". The website's use of "Group 978-3: German language" is a bit of a misnomer, if they had said "Group 978-3: German issued" or "German publisher" it would be clearer to users.
https://en.wikipedia.org/wiki/ISBN#ISBN-10_to_ISBN-13_conver...