Project The Schlog Archives | Most recent version: 22 October 2024

Soygoy

I will fight for /anthro/

8,620 Threads, 3.5 Gigabytes. Soygoy presents...​

logo-transparent-bg.png

Soggy's Scrapbook: soyjak.blog

A full archive of every post and thread on soyjak.blog leading up to thread number 8620.
This is a sister project to Soygoy's Expeditions: soyjak.blog

Download (Google Drive)​

Overview​

Whilst writing and preparing my book / memoir I had an issue in regards to archival. archive.is, archive.ph and archive.today are all extremely slow, so I decided to make my own scraper. 12 cups of tea and 24 hours later, here it is. Surprise! My 10k posts project is 500 posts early! Hinting at it all day, practically begging to get this out there for you all to enjoy. The entirety of soyjak.blog condensed down into 3.5 gigabytes! You can now own a piece of Schlog history, forever! [wholesome]

How did you do it?​

I used python, curl and pageres. I wanted to make an overengineered go script but gave up trying to make it view webp's because I'm an absolute noob and I have no fucking clue with what I'm doing with go, but python sucks dick so I felt like not sharing the script because it was designed for the server that I was running it on overnight. Not to worry though, I plan on creating a toolset for scraping and archiving on my github later. The images are in Webp format to save on file space. I know JPG is similar, but in testing, Webp had a smaller file size. I had to keep the size down as much as possible, and thankfully I got it to about 3.5 gigabytes, very small in comparison to what could have been like 50 gigabytes.

It's also important to note that my tools can work on other xenforo based websites, but maybe not Kiwifarms, that's 150k threads o algo, if 8k threads is 24 hours of labour then 150k threads is like 480 hours and 60 gigabytes o algo, though such a small file size for the entirety of kiwifarms... hmm...

Why not just use text o algo?​

Curling / wgetting web pages just doesn't work, it doesn't preserve css or image support, so I opted to take full page screenshots instead. I could make text only transcripts, but that'd require making a program to cut out a lot of text from a html time and I just think that taking an image of a web page is a lot faster imo. Sucks if you need the text though, you might have to use OCR for that, or I might make a text only version in the future so you can have something to search for occurences of words specifically throughout the entire website.

How can I use your archive?​

Download the zip file, extract it and open thread-names.txt, use ctrl + f to find certain words you're looking for, get the thread number and then open thread-number-page-1 for example. The thread "add a sports board" is thread number 2026, so you'd go to thread-2026-page-1.webp and then just increment the number value in page-1. It's a bit iffy which is why I wanted to make a custom software that'd allow you to just input a url and get the thread and it'd give you controls to magically sort through everything but it's a bit of a pain right now due to my utter lack of knowledge. I could make a python script for it, but then you guys would have to go through all the hassle of downloading dependencies and that's just urrgghh.

Preferably, you should use a web browser so you can edit the url to access pages easier, or just search for the thread in your file explorer. In Linux you can create a slideshow by doing something like

gwenview `ls | grep -i "thread-10" | tr -d '/n'`

and that'll make a slideshow of that specific thread and all of it's pages. I would make some shell scripts but that would just leave Windows users in the dark. I'll do so anyways though for myself.
If you use ark like me you can also search in the archive and directly preview images without ever having to extract it!
1729646495429.png

Why not host a webserver?​

I could, infact I probably might have enough space to do so even on my main website, my primary concern is just cost and maintenance. If it is possible, I will make a schlog archival service on my own website o algo. But releasing a zip is much more better because you can just look through everything yourself, it's only 3 to 4 gigabytes and everyone can just keep 99% of the schlog stored on a usb drive o algo lol. Pretty efficient if you ask me. I'd urge you to download it, as there is no reason not to, the more people who have this archive the less likely history will be lost.

And remember, you will always be remembered.​

[wholesome]
 
Last edited by a moderator:
is there anything on the schlog really worth archiving
Sparkles diary, I guess? I mean idk, but it's only 3.5 gigabytes so why the fuck not, right?
Save it for something to tell your grandkids about o algo. Then again, archiving is not about why it's because we can.
 
Sparkles diary, I guess? I mean idk, but it's only 3.5 gigabytes so why the fuck not, right?
Save it for something to tell your grandkids about o algo. Then again, archiving is not about why it's because we can.
were profile posts archived?
 
https://soyjak.blog/conversations/ is a url that you can use to access conversations, if you have permissions for a convo it'll show, so if I create a new conversation, it'll give me a url number and my most recent conversation was like 1029 o algo. So I'm betting the total amount of conversations to be under 2000 ish
OH thanks, geg reminds me of this
I wonder what conversation 1000 was
 
Back
Top