Anatoly Vorobey (avva) wrote,
Anatoly Vorobey
avva

о том, как я книгу доставал

(вряд ли эта запись будет интересна даже программистам)

Ниже под катом копия моего рассказа в чате близкому другу сегодня о моем упрямстве и глупости. По-английски, небрежно, со сленгом и сокращениями (потому что близкому другу) и кое-какими подробностями, которые будут понятны только программистам, хотя таких мало.


>i was reading some contrarian guy
>physicist turned financeer called scott locklin

>he's too ranty for my taste e.g. https://scottlocklin.wordpress.com/2020/06/19/on-cultures-that-build/
>but b/c i'm anal i opened a bunch of tabs to his blog posts and then also looked at his HN comments, which are sometimes amusing rants
>anyway in one comment some time ago he said there're great books that were popular but fell out of favor for essentially random or unknown reasons and now aren't even in print anymore
>and offers as an example jan de hartog's novel "the captain", i haven't heard of either author or title so was curious

>a dutch mid-century author who went to live in the u.s. and converted to writing in english. the novel is about a 1940s wartime convoy from the allies to russia, over the northern sea, and follows the hero who's a captain of one of the vessels

>an unlikely theme to be a best- or even better-seller but apparently sold 1M copies around 1967 when it was 'lished and was v popular for some time
>and as promised now isn't even in print, there's no kindle edition, the usual pirate libraries don't have it

>i know you're bored already but this story gets even more hilarious if possible, just wait for it
>so at this point i'm with child to get my hands on this book and i hope you appreciate my use of this tudor-era idiom

>i look everywhere online and nada, but then i look at archive.org as a last resort b/c they're a dumping ground for a lot of ppl's uploaded books, copyrighted stuff [which this novel inevitably is] does get removed but not always
>and i find it on archive.org but in a weird, "borrow this book for 1 hr" form
>apparently this is a new thing, archive.org had this less weird "borrow this book for 2 weeks" feature where you could press a button, DL a book in a regrettably DRMed PDF, or read it online. if they have just one copy, only one person will be allowed to "borrow". this gets around copyright dictats, i guess.

>now we need to tie in coronavirus into this sprawling story. in april they announced a temporary change due to the pandemic where they'd remove this one-borrow thing and allow multiple ppl to borrow same book at once.
>now about 3 weeks ago they were sued by a bunch of publishers for this. i guess the publishers didn't fear the covid-19 optics. this is a serious thing actually b/c it threatens the viability of archive.org as a whole, which is incredibly important if you ask me which noone ever does

>so i think the change to 1 hour is in response to that suit, a unilateral tightening of the reins or smth like that
>and more damningly, these 1hr borrows which i think are most borrows there now, don't let you DL a DRMed PDF, which if they did, i would've proceeded to remove the DRM via unlawful means
>so i'm saying damn, but i'm not gonna read the entire book in my browser, even though i can continue to re-borrow it once 1hr passes. my pride won't take that. i hold against reading drmed books i can't liberate yadayada

>so i'm looking at the chrome network tab to track the requests it makes to display the book and it's straightforward, there's a php they xhr call each time they want a page, and params in the url say stuff like rotate=0 and scale=8, if you remove both you just get the page in large resolution
>i test it in a separate tab, it works, i get a binary DL which file(1) tells me is a jp2 file, which in case you don't know (i didn't remember) is JPEG2000. i can DL all the pages like this and glue them into a pdf. obv, not manually
>damn this story is long :(

>i neglect to do it 2days ago, which was when this happened, i just leave the tab on my chrome and tell myself i'll pick it up tmrrow
>so yday i sit down to bash-automate it with wget, and find out to my utter distaste that it stops working. the webserver gives me 404 if i'm not doing the req in context of the right chrome tab. i re-check and re-borrow the book and re-check again and it's not any mistake i made. stupenduously, it worked 2days ago but not yday.
>i'm 1/4-thinking they saw me DL out of band in their logs and added a sec mechanism overnight, which is really stupid thinking, but i allow my 1/4 to be stupid sometimes, it lets the other 3/4 be so damned smart y?

>more probably the sec mechanism was off for some reason and they switched it on, or just rolled it out coincidentally, whatev. i dunno wtf it is, cookies, req headers heuristics, whatev
>so i muck around more in exasperation and nothing is working, then i look into chrome extensions that allow you to DL all the page's resources, and i find one that works neatly, it records XHR reqs too as they whizz by, and then "DLs" everything by mostly taking it from the browser cache and packages everything into a zip

>anyway then i sit like a dork, turn on this ext and click the -> button on the page about 400 times, to just page through the book. that felt rly stupid. i guess could find a thingy to automate that too, but whatev
>we haven't plumbed the depths of my stupidity yet, but the best is around the corner now

>it gives me a large zip with ~460 jp2 images tucked in a specific subdir in it. unfortunately, the images have filenames like BookReader-12345678.php. the 12345678 is probably just the hash of some sort
>i don't know which file is which page in the book

>i hope they're consequential as i was paging
>so i order them by mod time and look and now, they're roughly consequential but jump around like 3, 75, 23, 43, whatev, mostly within the first 100, then mostly within the second, etc.
>i guess the ext was "DLing" in multiple threads maybe

>i briefly ponder upon the option of installing some OCR, figuring out how to use it, running it on the 456 images, scripting finding the page number in the text output (on the bottom in chapter-starting pages, on the top in others) then renaming the files
>it sounds sooo exhausting and not nearly enough stupid for the likes of me

>so instead i bash-rename all pages by modtime to smth like o0001.jp2 etc. (o for orig), copy them all to a single windows folder (bash-renaming was done in linux-in-windows)
>and then i open two windows on desktop, one a preview of each page that i can cycle pressing right-right-right arrow, one file manager where i can press F2 (rename), quickly type smth like 0023 instead of o0001 press enter.
>then i spent oh i dunno 1-1.5hr yesterday night and today doing that repeatedly to 456 pages
>i cycle through a few previews and remember 3-4 numbers in short memory. then i f2-rename them in the file manager, then 3-4 more.
>btw my short memory sucks donkeys

>ok that was it, that was how stupid i'm being for being a stubborn ass that wanted to get this book for no good reason in the 1st place (probably it sucks) then wouldn't settle for reading it in browser
>not done yet btw, smth like 60 pages left
>it better be a fucking masterpiece which i really doubt
Tags: английский, интернет
Subscribe
  • Post a new comment

    Error

    default userpic

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 21 comments