Discussion:
Reducing Redundancy
(too old to reply)
Stefan Ram
2024-09-10 15:33:01 UTC
Permalink
I posted this yesterday in news.software.readers, but I'm not
sure what the best group is for this. I'd love to get some
feedback on whether this is a solid approach.

So, yesterday I was chewing the fat about how to whip up a database
for posts retrieved from newsservers.

I'm picturing some program that pulls newsgroups from newsservers
and dumps them into a database.

In my mind's eye, a post looks something like this, give or take:

Path: A
Message-ID: B

Body: C

. But if you snag the same post from a different server, it might
look like this:

Message-ID: B
Path: D

Body: C

. At first blush, you'd end up with the same body stored multiple
times in the database. Talk about a waste of space!

To trim the fat, we could rejigger these posts so all the variable
stuff is up front:

Path: A
Message-ID: B

Body: C

and

Path: D
Message-ID: B

Body: C

Now the tail end of both posts is identical, so we can toss that
in a separate table at position 0.

The posts themselves would then just contain the different parts
and a pointer to the shared bit that's only stored once:

Path: A
Rest: 0

Path: D
Rest: 0

0:
Message-ID: B

Body: C

. This way, you could store the same post from multiple newsservers
without eating up your hard drive space like it's In-N-Out fries.
Lawrence D'Oliveiro
2024-09-10 21:04:05 UTC
Permalink
Post by Stefan Ram
To trim the fat, we could rejigger these posts so all the variable
Use the Message-ID field as the primary key.
Stefan+ (Stefan Froehlich)
2024-09-11 13:41:35 UTC
Permalink
Post by Stefan Ram
I'm picturing some program that pulls newsgroups from newsservers
and dumps them into a database.
"There is another theory which states that this has already
happened"
Post by Stefan Ram
Path: A
Message-ID: B
Body: C
. But if you snag the same post from a different server, it might
Message-ID: B
Path: D
Body: C
At first blush, you'd end up with the same body stored multiple
times in the database. Talk about a waste of space!
Not only a waste of space, you definitely want to avoid duplication
in a database. These two incarnations are the same posting so its
contents should be stored only once.
Post by Stefan Ram
To trim the fat, we could rejigger these posts so all the variable
Path: A
Message-ID: B
Body: C
and
Path: D
Message-ID: B
Body: C
Now the tail end of both posts is identical, so we can toss that
in a separate table at position 0.
You'd really want to archive more than one incarnation of a posting,
just because you pulled it from different servers? Why?

Parse the incoming postings, extract the headers, store at least the
Message-Id in a separate attribute of your table (wisely some more)
and set a unique key to that field.
Post by Stefan Ram
This way, you could store the same post from multiple newsservers
without eating up your hard drive space like it's In-N-Out fries.
Still, the question remains: Why?

Only reason I could see is to generate a database of distribution
paths pointing to your archive. But I can't see any benefit in that.

Bye,
Stefan
--
http://kontaktinser.at/ - die kostenlose Kontaktboerse fuer Oesterreich
Offizieller Erstbesucher(TM) von mmeike

Der hastige Bürger will Stefan. Das muß ja wohl einen Grund haben?
(Sloganizer)
Loading...