Project The Schlog Archives | Most recent version: 22 October 2024

8,620 Threads, 3.5 Gigabytes. Soygoy presents...​

Soggy's Scrapbook: soyjak.blog

A full archive of every post and thread on soyjak.blog leading up to thread number 8620.
This is a sister project to Soygoy's Expeditions: soyjak.blog

Download (Google Drive)​

Overview​

Whilst writing and preparing my book / memoir I had an issue in regards to archival. archive.is, archive.ph and archive.today are all extremely slow, so I decided to make my own scraper. 12 cups of tea and 24 hours later, here it is. Surprise! My 10k posts project is 500 posts early! Hinting at it all day, practically begging to get this out there for you all to enjoy. The entirety of soyjak.blog condensed down into 3.5 gigabytes! You can now own a piece of Schlog history, forever! [wholesome]

How did you do it?​

I used python, curl and pageres. I wanted to make an overengineered go script but gave up trying to make it view webp's because I'm an absolute noob and I have no fucking clue with what I'm doing with go, but python sucks dick so I felt like not sharing the script because it was designed for the server that I was running it on overnight. Not to worry though, I plan on creating a toolset for scraping and archiving on my github later. The images are in Webp format to save on file space. I know JPG is similar, but in testing, Webp had a smaller file size. I had to keep the size down as much as possible, and thankfully I got it to about 3.5 gigabytes, very small in comparison to what could have been like 50 gigabytes.

It's also important to note that my tools can work on other xenforo based websites, but maybe not Kiwifarms, that's 150k threads o algo, if 8k threads is 24 hours of labour then 150k threads is like 480 hours and 60 gigabytes o algo, though such a small file size for the entirety of kiwifarms... hmm...

Why not just use text o algo?​

Curling / wgetting web pages just doesn't work, it doesn't preserve css or image support, so I opted to take full page screenshots instead. I could make text only transcripts, but that'd require making a program to cut out a lot of text from a html time and I just think that taking an image of a web page is a lot faster imo. Sucks if you need the text though, you might have to use OCR for that, or I might make a text only version in the future so you can have something to search for occurences of words specifically throughout the entire website.

How can I use your archive?​

Download the zip file, extract it and open thread-names.txt, use ctrl + f to find certain words you're looking for, get the thread number and then open thread-number-page-1 for example. The thread "add a sports board" is thread number 2026, so you'd go to thread-2026-page-1.webp and then just increment the number value in page-1. It's a bit iffy which is why I wanted to make a custom software that'd allow you to just input a url and get the thread and it'd give you controls to magically sort through everything but it's a bit of a pain right now due to my utter lack of knowledge. I could make a python script for it, but then you guys would have to go through all the hassle of downloading dependencies and that's just urrgghh.

Preferably, you should use a web browser so you can edit the url to access pages easier, or just search for the thread in your file explorer. In Linux you can create a slideshow by doing something like

gwenview `ls | grep -i "thread-10" | tr -d '/n'`

and that'll make a slideshow of that specific thread and all of it's pages. I would make some shell scripts but that would just leave Windows users in the dark. I'll do so anyways though for myself.
If you use ark like me you can also search in the archive and directly preview images without ever having to extract it!
View attachment 59372

Why not host a webserver?​

I could, infact I probably might have enough space to do so even on my main website, my primary concern is just cost and maintenance. If it is possible, I will make a schlog archival service on my own website o algo. But releasing a zip is much more better because you can just look through everything yourself, it's only 3 to 4 gigabytes and everyone can just keep 99% of the schlog stored on a usb drive o algo lol. Pretty efficient if you ask me. I'd urge you to download it, as there is no reason not to, the more people who have this archive the less likely history will be lost.

And remember, you will always be remembered.​

[wholesome]
URLs for any fellow scrapers:
https://soyjak.blog/profile-posts/
https://soyjak.blog/posts/
https://soyjak.blog/threads/
https://soyjak.blog/conversations/
https://soyjak.blog/search/
https://soyjak.blog/members/

etc.
 
Last edited by a moderator:
This won't stop anyone oversharing.
Which is a good thing, people will overshare and I will archive it.
But what's the point?
I love this website and I want to keep it forever. I've made some memories here, so I'm going to archive it for as long as I can and have it as a little digital scrapbook. You can read your favourite threads offline o algo, put it onto CDs or DVDs or USB drives and then hide them in the ground o algo for historians to find one day geg.
 
Trust me if you think this archive is useless now, come 5 years time o algo, you won't be thinking it's so useless then. When the sharty went down yesterday that struck the fear of god into me, it could all go at any point of time and you'd have nothing, so I created something, and now we have this archive. And the archive is only 3.5 Gigs, very small for such a large site.
 
Trust me if you think this archive is useless now, come 5 years time o algo, you won't be thinking it's so useless then. When the sharty went down yesterday that struck the fear of god into me, it could all go at any point of time and you'd have nothing, so I created something, and now we have this archive. And the archive is only 3.5 Gigs, very small for such a large site.
I don't think I will even remember this site in 5 years. That's a really long time.
 
I don't think I will even remember this site in 5 years. That's a really long time.
I think you might, perhaps. It's all about the community, it's all about the memories we made, the story. We deserve to be remembered and I have a feeling that the schlog and the sharty are going to become legendary some day and maybe even overtake 4chan, but who knows. Bald man glasses forum may be silly, but it's not worth getting rid of entirely and forgetting about, to some degree, it's worth remembering.

I also just really like working on technical projects like this, it feels like I shouldn't be able to archive an entire website and yet here I am.
were pfps archived?
All pfps as of 22 October 2024 are archived. You should check for yourself, it's only 4 gigs or so. I can give a sample perhaps.
test2.png
 
Back
Top