• Welcome to the new COTI server. We've moved the Citizens to a new server. Please let us know in the COTI Website issue forum if you find any problems.

Data cleaning

Hemdian

SOC-14 1K
Baron
Count
(Re-posted from the TML)

As a spinoff project from writing Traveller Universe (software) I wanted to "clean" the data. (Many of the electronic sector files contain typos, errors, or are from mismatching eras.) The idea was to provide a dataset free of errors for each sector and for different milieus in a variety of formats (H&E, Galactic, TrTools, etc). In other words there should be, for example, a different H&E sector file for the Classic era Spinward Marches, Rebellion era Spinward Marches, and New Era Spinward Marches. And if someone picks the Classic era Spinward Marches then there should be a Classic era Deneb beside it (and not a Rebellion era Deneb).

But to date all I've had time to do is the first subsector of the Spinward Marches. See here.

Some people have suggested I should set up a separate mailing list for this task and share out the work. Initially I was resistant to the idea but its just been suggested again so I thought I'd see what the reaction would be from the wider Traveller community: would people be interested in contributing to such a project?

Regards PLST
 
I’d be interested in helping out.

What data do I need and what format does the ‘cleaned’ data need to be in?
 
Originally posted by Nellkyn:
I’d be interested in helping out.

What data do I need and what format does the ‘cleaned’ data need to be in?
Thanks. What I had in mind was ...

First the data would be checked against canon sources. This would identify (1) transcription errors (eg. "Dentus/Regina" becoming "Tentus/Regina"), and (2) cross-canon typos (eg. Zeycude/Chronor having an M9V star in Spinward Marches Campaign and Domain Of Deneb but a K9V star in Regency Sourcebook). Obveously, some characteristics could legitimately change over time so a decision would have to be made on a case by case basis.

Second, the physical characteristics of a world (size, atmos, hydro, and stars) must adhere strictly to the CT rules ... but allow *some* leyway on social characteristics (sometimes additional DMs not mentioned in the rules were applied by GDW et al ... IIRC the Solomani Rim has high populations and TLs than the Spinward Marches).

Third, where data exists for some but not all millieus then a commonly agreed and accepted extrapolation could be created for the missing millieus. (Some input into these extrapolations could come from non-canon sources such as landgrab entries and RICE papers.)

And possibly, if we got that far, it would be nice to bringing in additional data about specific systems where it exists. This would include canon publications, landgrab entries, RICE papers, and perhaps even the extra info in the Galactic datasets.

Of course, datasets made from intermediate steps could also be published ... so if someone *wanted* the Supplement 3 version of the Spinward Marches they could have that too.

This would be either a lot of work for a few people or a bit of work for lots of people. Right now I'm just trying to get an idea of how many people are willing to contribute. Things like who needs what and data format, etc, can be worked out later (but with tools like Universe I can convert a variety of formats as needed).

Regards PLST
 
Sounds good...I like Universe, even though it doesn't (yet) have system and world mapping functions like Galactic...
The problem lies in the amount of OOP material to wade through. I can think of MAYBE 5 or 6 people on Earth who have enough material on hand to take on such a project, and I'm sure they all have other things to do. Just from personal experience - I've been working on a copy of Far Frontiers for the past few months, but have hit a wall concerning Traveller Chronicle #3 & 4 - I need the data to finish the sectors, but can't find the materials. You mention the Spinward Marches Campaign - a very rare product that has gone on Ebay for $80...How many people own that, plus the CT supp, Behind the Claw, MT, the Challenge magazines (what...80 of them?)...all the little blips and blurbs of printed material on a sector over the past 25 years...How many people own A copy of Tiffany Star?
Yes, I admire the project, and would gladly help myself, but I don't think that there are more than a couple die hard Travellers in the world who COULD help....

-MADDog
 
Dog, hate to disagree but....

This is an 80/20 rule situation. 80% of people will want 20% of the data. So maybe we can't get every sector cleaned up and formatted decently... immediately. But maybe we just start where we can, and go from there.

I think data format (at least in terms of what to include, once you go beyond the 7 UWP digits.... do you do World Builder's Handbook, Book 6, World Tamer's Handbook, GT First In & Far Trader, etc.... how do you include refs to library data, articles, etc) is gonna be a big bugaboo. But perhaps task one is getting what I will call 'level zero data' - ie the cleaned UWPs. That is less contentious and will be of some great use.

So, let me suggest for this project that we have a basic format which includes, at the file scope, and indication of:
- source(s) of data in the file
(this will help in the case where, for instance, there is the Judge's Guild sectors in Gateway, and now newer pave-over versions from QLI)
- the identity (or handle) of those who've checked the file
- the era the file data covers (first survey, second survey, TNE, MT, etc)
- the name of the sector covered
- the names of the subsectors covered
- the offset of the sector from reference's (0,0) sector (Core is it?)

For each system in the file, we should have
- planet name
- hex location in sector (not subsector) relative coordinates (0101 to 3240)
- 7 digit UWP

I will volunteer, once formats are in place, to enter and verify at least the data from the Glimmerdrift Reaches and Ley Sector from Judges Guild, plus various other odd chunks I have (from the various magazines). But I need a data format.

Here's another idea:

If we settle on a standard format, we could even whip up a little VB app or something to make for 'nice' data entry. Heck, utilities could take existing files (or corrected ones) and identify 'rules violators'.

There are lots of ways this could go. But it definitely is a good Traveller Community Project. And eventually perhaps a common interchange format could become present between Universe, H&E, Galactic, etc.
 
Originally posted by kaladorn:
And eventually perhaps a common interchange format could become present between Universe, H&E, Galactic, etc.
I thought the .sec format already was a standard data format between all of these programs? Unless you're needing more information in the file that that provides...in which case I'm not sure how much Galactic is being maintained these days to update it.

That and I *think* it would be possible to put the extra information in comment lines before the UWP data proper (the "header" section). I am not sure though how fixed the header information is in .sec files so this may not be possible.

Casey
 
It's possible you are correct Casey. I'm not saying how a common interchange format should be created, but it would be nice. There is also the possibility that no-longer supported code with no open-source repository would have to be removed from such consideration. I just think if a format was evolved and the data files created, easy to use, and complete, that various programmers would gladly import them.

Another thought occured to me after reviewing the spinward marches revisions that were linked earlier in this thread. It is obvious that the data source needs to be per-system since apparently multiple data sources may pertain.
 
Hello.
Whats wrong with using Gal24 with a different galaxy name for each era.
1 = Classic.
2 = Mega Traveler.
3 = TNE.
4 = What ever.
If you want to attach information to a system in Gal you just mark the system and type the info in.
The only problem i can see with Gal is it dosn't have a pos for star type, but you could put that in with the system info and the system map, though most classic sectors dont have star types either.
Bye.
 
In the long run, it would be handy in any truly portable format to have
- a format easily importable to spreadsheets, databases, and other tools
- a format easiliy viewable with a textual browser (be that HTML, plaintext, XML, whatever)
- a format which accomodates *all* of the components of system design as we actually have canon on some of the systems

I'm not sure Galactic actually covers these. Nor, admittedly, am I sure it does not.

Does Galactic run under Windows XP? I'm not suggesting abandoning people using an older toolset necessarily, but the people doing the work will want to be able to view the results.

One thing a new format offers that it seems to me all of the existing format lack is the possiblity of a Data Description Document or whatever you wish to call it.

As of now, a new person who picks up any of the older formats has to puzzle away a bit to figure out how things are layed out. And some of the mechanisms for layout using plaintext files mean ugly manipulation, which is annoyin from a programming PoV.

Whatever the long term format this data is to reside in, it should support an open standard format and that format should be *well documented* and *complete*. Bandaiding things into older formats in kludgy ways is sure to cause you grief down the road, or so I'd worry.
 
Hello Kaladorn.
Yes Virginia there is a santa clause, ALL files in Galactic can be opened and used as notepad docs.
Open the gal directory, open the specific gal (classic), open the sector files (should be five may be less), right click on the file and tell it to open in notepad, WALLA.
To save changes (Control S).
One small problem, If you import a notepad doc you must change the file extender to match what should be there (mnu).
It takes 30 seconds to import a gal file to excel, Yes gal24 works under XP/pro.
I only had one problem with it, I was trying to run to many sectors (510) it dosn't like more than about 500, you hit 500 when you load the core and rime routes aswell as the 80 odd central sectors.
If someone is going to clean sectors what are you going to do with the sectors like Marananth/Alkahest, you know non canon sectors that where canonish.
Bye.
 
Well, by actually sorting the Data by period AND source, you could easily have

1105
1105-JG
1116
1120

etc.

for any given sector.

I myself have a lot of JG stuff and like it and a lot of PP stuff which is also good. Those datasets should be preserved.
 
I think the idea of collating all of this Traveller data is fantastic. Here's an idea:

Make the database a web-based tool. People could come along, look for CT data for eg District 268, see that it's not already in the database, and then go to a form and start punching in data for each system (or, alternatively, upload a file with the data in a predetermined format).

Then the next person to come along to look at District 268 data would see that it's been entered, but hasn't been double-checked. So they pull out their copy of the indicated source material and check it.

Finally, a third person comes along looking for District 268 data for the CT era, finds that it is in the database and has been checked, and then downloads a copy of it munged up in .gal format (or .xml format, or .xls or what have you).

An ambitious project, I suppose, but it certainly lends itself to incremental progress on both the development side and the data side. That is, the interface for entering, storing, and crudely querying a subset of the data we want could be quickly put together. Later, the facility for double checking could be added, along with some process for making edits. Later, someone could add .xls output so data could be sucked up into Excel; or someone could do the .gal export stuff. The incremental nature of collecting the data could truly harness the free time of thousands of Traveller geeks across the globe. Heck, people could even enter their own wacky house versions of District 268 or whatever, so that other people could share and enjoy the fruits of their labor.

Only trouble is finding someone to host the bugger.

file_22.gif
 
I'm starting to put up PHP on my other hobby website. We allegedly have access to a mySQL instance too. So it is possible (though I'm not volunteering yet.... I'm a ways from knowing exactly how to hook into the DB from PHP.... just started learning PHP) that I may be able to make something like this happen eventually.

I certainly agree that such a groupware project is good. But you couldn't just have everyone editing everyone else's work. There are synchronization issues not to mention issues with people 'correcting' (to the wrong stats) good data.

Now, maybe what you could do is accumulate all users input (even conflicting ones) into a file for a sector. Once that has transpired, you could then have someone else (picked, or volunteer that was then trusted and appointed) go through and verify/resolve (and have one or two reviewers assigned) the conflicts.

This is where this *should* go in the long run, but it is a non-trivial bit of back end server scripting/coding to make it work right and be solid. As well as still having some conceptual issues to deal with.
 
Back
Top