Stefan Ram
2024-09-10 15:33:01 UTC
I posted this yesterday in news.software.readers, but I'm not
sure what the best group is for this. I'd love to get some
feedback on whether this is a solid approach.
So, yesterday I was chewing the fat about how to whip up a database
for posts retrieved from newsservers.
I'm picturing some program that pulls newsgroups from newsservers
and dumps them into a database.
In my mind's eye, a post looks something like this, give or take:
Path: A
Message-ID: B
Body: C
. But if you snag the same post from a different server, it might
look like this:
Message-ID: B
Path: D
Body: C
. At first blush, you'd end up with the same body stored multiple
times in the database. Talk about a waste of space!
To trim the fat, we could rejigger these posts so all the variable
stuff is up front:
Path: A
Message-ID: B
Body: C
and
Path: D
Message-ID: B
Body: C
Now the tail end of both posts is identical, so we can toss that
in a separate table at position 0.
The posts themselves would then just contain the different parts
and a pointer to the shared bit that's only stored once:
Path: A
Rest: 0
Path: D
Rest: 0
0:
Message-ID: B
Body: C
. This way, you could store the same post from multiple newsservers
without eating up your hard drive space like it's In-N-Out fries.
sure what the best group is for this. I'd love to get some
feedback on whether this is a solid approach.
So, yesterday I was chewing the fat about how to whip up a database
for posts retrieved from newsservers.
I'm picturing some program that pulls newsgroups from newsservers
and dumps them into a database.
In my mind's eye, a post looks something like this, give or take:
Path: A
Message-ID: B
Body: C
. But if you snag the same post from a different server, it might
look like this:
Message-ID: B
Path: D
Body: C
. At first blush, you'd end up with the same body stored multiple
times in the database. Talk about a waste of space!
To trim the fat, we could rejigger these posts so all the variable
stuff is up front:
Path: A
Message-ID: B
Body: C
and
Path: D
Message-ID: B
Body: C
Now the tail end of both posts is identical, so we can toss that
in a separate table at position 0.
The posts themselves would then just contain the different parts
and a pointer to the shared bit that's only stored once:
Path: A
Rest: 0
Path: D
Rest: 0
0:
Message-ID: B
Body: C
. This way, you could store the same post from multiple newsservers
without eating up your hard drive space like it's In-N-Out fries.