Citizens of the Imperium

Citizens of the Imperium (http://www.travellerrpg.com/CotI/Discuss/index.php)
-   Software Solutions (http://www.travellerrpg.com/CotI/Discuss/forumdisplay.php?f=63)
-   -   Compressing the UWP for 8-bit machines (http://www.travellerrpg.com/CotI/Discuss/showthread.php?t=39249)

robject July 20th, 2018 08:54 AM

Compressing the UWP for 8-bit machines
 
I'm writing a trade sim for the Commodore 64, because (1) I like Traveller and (2) I like the C64 (nostalgia and a bit of a programming challenge).

The UWP has grown in the past decade. Most of this data is not important for a basic trading game. Here it is:

1910 Regina A788899-C Ri Pa Ph An Cp (Amindii)2 Varg0 Asla0 Sa { 4 } (D7E+5) [9C6D] BcCeF NS - 703 8 ImDd F7 V BD M3 V

I'm open to bright ideas about ways to compress this record, which are not too taxing on an 8-bit program. But first, some things to think about.

DISK ACCESS IS SLOW

The C64 disk drive is slow. Yes, emulation mode eases this some, but time is time. I typically lump worlds in subsector-sized files and name them with an x,y subsector index for easy access.

DISK SPACE IS PRECIOUS

How precious?

If I'm hardcore about it, I have 174k on one diskette. If I'm slightly less hardcore, I can use an 800k diskette. But for the moment I'm going to use just one diskette for the program and the data.

WHAT ABOUT TAPE?

I haven't tried it, but storing the data on a tape image might be better. One potential bottleneck is its linear nature. Anyway, even if tape works out, I'll still compress the UWP a bit because... well because these systems are slow, even when emulated.

MEMORY IS PRECIOUS

I've got 38k RAM, into which I shove subsets of my program and data at various times.

So nothing too fancy.

Next post I'll talk about whittling down the UWP first.

robject July 20th, 2018 08:55 AM

I. A BYTE SAVED IS A BYTE EARNED

The easiest place to start is throwing out un-needed data.

1910 Regina A788899-C Ri Pa Ph An Cp (Amindii)2 Varg0 Asla0 Sa { 4 } (D7E+5) [9C6D] BcCeF NS - 703 8 ImDd F7 V BD M3 V

That's Regina. What do we care about?

(a) I need the UWP
(b) I need the world name
- But let's reduce it to 15 characters or so?
(c) I don't think I need all of the UPP (!)
- I want the starport code and TL.
(d) I don't need all of the remarks.
- I want the 12 used in the trade rules.
- A couple of the others would be nice (Sa).
- Say we can have a maximum of 5.
(e) Importance might be nice, might not matter.
(f) I don't think I need the Extension codes.
(g) I don't think I need the Nobility codes.
(h) I need the base codes
(i) I want the travel zone
(j) I want the belts and GGs count, but I don't think I need the pop mult.
(k) I'd like to have the number of worlds.
(l) I'd like to have the two-character allegience code.
(m) I'd like to have the stellar data

That cuts the UWP down a bit.

Code:

1910 Regina        A788899-C Ri Cp Sa      { 4 } NS - 703 8 Im F7 V BD M3 V
Remove unneeded spaces and braces where they don't matter.

Code:

1910Regina        A788899-C RiCpSa  4NS-7038ImF7VBD M3V

robject July 20th, 2018 09:14 AM

II. ABBREVIATE

Lots of ways to compress, and it's always a tradeoff of speed and efficiency. Here, we have to go for speed, so these are quick and dirty.

1. Compress the Hex.

If we assume a subsector file, whose name contains the subsector's own coordinate offset into the galaxy, then hex is 0101 thru 0810. We can compress that down to two characters (or even one if we really had to):

1-9 is 01 - 09
A is 10.

Code:

1ARegina        A788899-C RiCpSa  4NS-7038ImF7VBD M3V
2. World name

If I assume it's okay to have one record per line, versus a fixed-size record, I can put the world name at the end. This drops the average record length while also allowing very long names in rare cases. This is a form of "free" Huffman coding.

Code:

1AA788899-C RiCpSa  4NS-7038ImF7VBD M3VRegina
3. Let's drop the star numbers. Three more characters saved.

Code:

1AA788899-C RiCpSa  4NS-7038ImFVBDMVRegina
4. If we've only got a dozen or so remark codes, let's shorten them to one character each and manage collisions gracefully. Five characters saved.

Code:

1AA788899-C RCS  4NS-7038ImFVBDMVRegina

AnotherDilbert July 20th, 2018 09:19 AM

You don't need the trade codes since you can calculate them from the UPP?

Save the UPP as a number, perhaps with 4 bytes per hex-digit, that way it fits into a 32-bit number.

The sector coordinates fits into less than 16-bits (11 bits?) as a number.

Do you really need the stellar info?

aramis July 20th, 2018 01:50 PM

Quote:

Originally Posted by robject (Post 589692)
I. A BYTE SAVED IS A BYTE EARNED

The easiest place to start is throwing out un-needed data.

1910 Regina A788899-C Ri Pa Ph An Cp (Amindii)2 Varg0 Asla0 Sa { 4 } (D7E+5) [9C6D] BcCeF NS - 703 8 ImDd F7 V BD M3 V

That's Regina. What do we care about?

(a) I need the UWP
(b) I need the world name
- But let's reduce it to 15 characters or so?
(c) I don't think I need all of the UPP (!)
- I want the starport code and TL.
(d) I don't need all of the remarks.
- I want the 12 used in the trade rules.
- A couple of the others would be nice (Sa).
- Say we can have a maximum of 5.
(e) Importance might be nice, might not matter.
(f) I don't think I need the Extension codes.
(g) I don't think I need the Nobility codes.
(h) I need the base codes
(i) I want the travel zone
(j) I want the belts and GGs count, but I don't think I need the pop mult.
(k) I'd like to have the number of worlds.
(l) I'd like to have the two-character allegience code.
(m) I'd like to have the stellar data

That cuts the UWP down a bit.

Code:

1910 Regina        A788899-C Ri Cp Sa      { 4 } NS - 703 8 Im F7 V BD M3 V
Remove unneeded spaces and braces where they don't matter.

Code:

1910Regina        A788899-C RiCpSa  4NS-7038ImF7VBD M3V

You really don't need the stellar data unless you're enforcing the 100 diameter limit on the star, in which case you need to know the orbit number, but orbit number is not stored in even the current UWP.

Likewise, the trade codes which are UWP derived can be deleted and generated on the fly, sacrificing speed for compactness of storage.
Or, not stored on disk, and recalculated on load. (noting the 1541 buffers, and the bottleneck is getting between the 1541 and the C64; both of them are faster than the serial connection between them).

aramis July 20th, 2018 02:11 PM

Quote:

Originally Posted by AnotherDilbert (Post 589694)
You don't need the trade codes since you can calculate them from the UPP?

Save the UPP as a number, perhaps with 4 bytes per hex-digit, that way it fits into a 32-bit number.

only needs 1 byte per code as is.
Also, C64 BASIC (I somehow doubt Rob's using ASM) doesn't handle Int32; it limits to sInt16, with no provision for uInt16. Strings are 1 byte length and the content bytes in PETSCII. Floats are a 1 byte exponent and 4 byte mantissa
Floats are very memory lossy, especially given 38 KB functional memory; one is better off using an int with 0-32 + (0-32*32) (12/16 bits) or (0-16)+(0-16*16)+(0-16*256) (15/16 bits), both fitting into a 2 byte memory word, as opposed to a 5 byte float; the 5 byte float is 1 byte longer than 2 ints, for the same data in the 5 bit wordspace.
Strings are somewhat better, but typecasting is absent, and bitwise operations are a pain, so...

whartung July 20th, 2018 03:34 PM

Barring extensions, if you stick to LBBs, the base stats of the UPP are only 4 bits.

Hex: 2 byte, row and column.
Name: Actual length + \0 or prefixed with length byte.
UWP: 4 bytes. Each attribute is 4 bits. Sorry, no TL 16+ for you.
Trade codes are either calculated on the fly, or a 2 byte bit-set, allowing 16 different codes.
PBG: 2 bytes, if you don't want P, then 1 byte. With 2 bytes, you can get P, B, G and number of worlds.
Base Codes, Allegiance and Travel Zone can be coded in to 2 bytes. Allegiance is a reference to a table of codes, not the codes themselves (0 == Imperial, 1 == Zhodani, whatever). Use a limited list, not "for the imperium", but "for the sector".

Note, for a sector, the Hex needs only 11 bits, so you can probably readily squeeze Allegiance or Travel Zone in to the remaining 5 (3 bits for Allegiance -- 8 of them, 2 for zone).

So, excluding the name, each system is: 2 (Hex) + 4 (UWP) + 2 (Trade Codes) + 2 (PBGW) + 2 (Base/Alleg/TZ) = 12. If you're frugal, you can squeeze in a reference to the subsector name with 4 bits. Otherwise, add a byte and get some more bits to play with. We'll call it 13 for now.

Spinward Marches has 439 systems. 5707 bytes.

SM names are 2791 bytes + 439 for length, 3230. But it's not a fixed length structure, so you'd need to crawl through it to search it easily.

If you don't use an entire sector as your playground, but, say, a quadrant, then you'll be under 256 systems, which will help with system references.

With 256 systems, you could create an index by system name, with a single byte to act as a reference to the actual system. Really kind of depends on how often you need to search by system name. Arguably, you never do, but even if you did, having a sorted list in memory will facilitate that readily. Even though the names aren't fixed length, you can still do a binary search on a sorted list. You just need to look back to find the start of each row when you jump around. No big deal.

SM names are 2791. 15 character names * 439 = 6585. Lots of wasted space.

I'm assuming you don't need to search for system names because you always know what system you're in, you always know what systems are nearby. So, you don't need to search for names that often, you can just show them off of lists and such.

So, with that in mind.

Your system record will have 13 bytes for the main data, add 2 more for a pointer to the system name. This is a fixed length array, so each system is identified by its 1 byte system number. The systems are stored in Hex order.

The names are store with a leading length, with the hi bit set, and the name is followed by the system number.

So, Regina, is system #61. Regina is 6 long. So its length is 128 (hi bit) + 6 = 134.

So, the bytes for Regina are: 134 R e g i n a 61 (sorry I didn't want to suss out the ascii for the letters).

Here's why the length has the hi bit set. If you decide to search for a name, and decide to use a binary search, then when you jump "to the middle" of a segment, you'll likely just land in the middle of a name. So you need a marker to delimit the records. When you land, you scan backwards until you hit a number that's > 128. If you land ON a number that's great than 128, simply check the NEXT number. If it is also > 128, then you have a system number (and you should scan back). Otherwise you're at at the beginning of a row, and don't need to scan at all. Bullseye.

If you don't need to search for names, you can eliminate the system # and the hi bit length shenanigans.

If you use 2 bytes for the system number, then for your system number, don't set the hi bit for the second byte. Move that bit in to the high byte. So, for system 400, 400 / 128 = 3. 400 mod 128 = 16. High byte is 3, low byte is 16. System # is hi byte * 128 + low byte.

So, Spinward Marches.

439 systems.

15 byte/system = 6585
2791 for the names + 439 for length byte + 878 for system byte = 4108

Total 10693. That's for the entire sector, thats for the Spinward Marches. Different for different sectors. You could do a quadrant in 3K, probably.

If you want to go for an on disk format, it would cost a more memory overall, but you wouldn't need it all in at the same time. I'd use fixed length formats, and waste the space for the name in that case, but you'd lose that white space when you load it.

Disk is slow, but that can be mitigated if you're willing to dedicate some RAM to page buffers and a simple LRU caching mechanism. It's also important to take the time to access the disk intelligently. So, if you wanted to load the data for 5 different systems, it's worth your time to sort the requests to the disk in the order that the system appear on the disk vs just randomly accessing the drive.

That should be enough ideas for now.

Oh, one more thing.

Also remember.

A C-64 is slow. Really slow. Watching paint dry and grass grow slow. However slow it was in the day, it's 10000% more slow now because we're using to machines that are 10000% faster. Now only are machines 10000% faster today, we humans are REALLY jaded. I personally find it very hard to use vintage machines for anything more than sculpture today.

So, keep that in mind when you're staring at the disk light waiting for 10K of data to load, while timing it with a sand filled egg timer. The C-64 disk drives stream about 2K of data per second, not counting seek time.

Adam Dray July 20th, 2018 04:07 PM

Bitwise operations can be done pretty easily with math using *, +, -, and mod operators. I assume Rob doesn't want to manually pack data into 8-bit spaces?

Can you further pack system names? Don't use ASCII or PETSCII or whatever, but use your own 5-bit-per-character mapping? A-Z plus space only? (27 possibilities, fits in 5 bits per character with room for a few extra characters in the space if you want to specially map them).

00000 NUL (string terminator)
00001 A
00010 B
00011 C
..
11010 Z
11011 SP (space)
11100 ' (apostrophe)
11101 - (hyphen)
11110 * (asterisk)
11111 ! (exclamation point)

You'll have to write a shift and add packing routine, but you can compress all of the Spinward Marches system name data (2854 characters) into 920 16-bit ints (including null-terminators) and store that in a single binary file.

You can keep that packed in memory, if you need to. Then cache an array of bit offsets into that packed data (array of 439 ints with a max offset less than the 14709 bit size of the packed data), or do a sequential search counting 00000 null sequences each time you need a system name (might be too slow).

AnotherDilbert July 20th, 2018 04:26 PM

Quote:

Originally Posted by aramis (Post 589701)
Also, C64 BASIC (I somehow doubt Rob's using ASM) doesn't handle Int32; it limits to sInt16, with no provision for uInt16. Strings are 1 byte length and the content bytes in PETSCII.

Can't we store the numbers in the string and store and access with ASC and CHR?

Picking out a nibble would only be question of dividing 16?

aramis July 20th, 2018 07:56 PM

Quote:

Originally Posted by AnotherDilbert (Post 589710)
Can't we store the numbers in the string and store and access with ASC and CHR?

Picking out a nibble would only be question of dividing 16?

The problem is there are several 5 bit values, and several 4 bit values.
You can safely pack 3 5-bit words into a 16 bit sInt16. It's keeping track of which and how far they go, and which editions you want compatibility with.

Since Rob's using the T5 format, we need to allow sizes to 15 (still 4-bit), and TL's to 20 (5 bit).
Now, Stellar types can be stored in two bytes -

Color:none O B A F G K M BD : 4 bits...
Size: Ia Ib II III IV V VI D : 8 values. If we count up from zero, and assume
color Decimal: 0 1 2 3 4 5 6 7 8 9 : 4 bits

Packing for storage, it's easy to use the bitwise in a string, but using ints in arrays is comparable, and saves code space.

So, making the easiest CCCCCSSSDDDD

Given Star 1 as S(1)%
To extract the Color.
C% = (S(1)% AND 6944) / 128

To extract the size
S%= (S(1)% AND 224) / 16.

To extract the decimal
D% = S(1)% AND 15

For workability, either you're extracting these to integers and using them... or extracting them to a string.

It's easiest just to use the numbers
A working array can be had for the UWP without packing, and leaving off the name,
U%(0,n) Hexnumber
U%(1,n) Starport
U%(2,n) Size
U%(3,n) Atm
U%(4,n) Hydrographics
U%(5,n) Population
U%(6,n) Government
U%(7,n) Law Level
U%(8,n) Tech Level
U%(9,n) Gas Giants
U%(10,n) Belts
U%(11,n) Worlds
U%(12,n) Bases (easy enough to extract in this case at each use)
U%(13,n) Allgience
U%(14,n) Tradecodes (the 13 standard trade influencers, bitwise)
u%(15,n) Star 1
u%(16,n) Star 2
u%(17,n) Star 3
u%(18,n) Star 4


So about 40 bytes per world, counting overhead...bit-banging can compress that to about 17, but at a significant loss of speed. Adding names adds 3+length Bytes per world... essentially instantly doubling the size of the database.

so 80 B per world x a subsector (80) 1600 B.... just shy of 1.6 KB semicompressed,
25600 B, or 25 KB. And half of that's the names.
And the names can be compressed for storage easily - but that's not a great idea. 58 significant characters; we can fit the needed characters into a 6-bit by subtraction/addition of 32. but it only compresses 4:3 (75%) over actual strings... better packig loses lower case and saves 1 bit. Which gets 3:2 (66%).

It's 3 operations per value packed other than the LSW. Which makes it slow things way down.

It's a balancing act between speed and size.


All times are GMT -4. The time now is 01:15 AM.

Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.
Copyright (c) 2010-2013, Far Future Enterprises. All Rights Reserved.