T5 Index First Draft- needs editing...

Ackehece · Sep 22, 2014

UPDATE!
Initial Document - USE FOR ERRATA FIXES (spelling, Capitalization, Inconsistent style editing)

https://www.dropbox.com/s/yrb9p7xskf5y3mw/trav5%20index.pdf?dl=0

Header Index - active edit file (anyone can work on - just find the marker in the file and work from there *move marker when done editing*

https://www.dropbox.com/s/jj33ielbc7qxc8d/T5%20Index%20of%20topics%20%281%29.docx?dl=0

Word Index - active edit file (Daddicus update version) http://www.travellerrpg.com/CotI/Discuss/showpost.php?p=490933&postcount=15

https://www.dropbox.com/s/1m1d1y2a962e68l/Index.txt?dl=0

marker
++++++++++++++++++++++++++++++++++++++++++++++++++
**************************************************
******************** Finished to here ************
++++++++++++++++++++++++++++++++++++++++++++++++++

**************************original post*****************************
https://www.dropbox.com/s/yrb9p7xskf5y3mw/trav5 index.pdf?dl=0

this is an automated word for word extract from the Trav 5 PDF - there are issues and it definitely shows spelling mistakes. It does need editing... major edits! but it is worth having. take a look and I will keep updating and editing as I go.

whulorigan · Sep 22, 2014

Ackehece said:
this can be used for errata and editing purposes as well... it shows an a large amount of inconsequential spelling issues and capitalization inconsistencies.

Now THAT is impressive.
Make sure DonM has a copy (and/or post it up on the T5 Errata thread) as well.

Ackehece · Sep 22, 2014

completely automatic extraction - the PDF is editable and completely public (anyone can download the link)

as per my signature!

Magnus von Thornwood · Sep 22, 2014

It's one of the Errata-men to the rescue.

Just a couple of things that lept out at me, so I shot them.

Page numbering. Not cool, well, okay kinda cool since I am sure to learn new number in the Latin I missed in high school, but otherwise it is a big pain. Suggest using the standard system.

Typography errors. Wow is this nice for those folks who get to clear up the typographical errors like DinSHA versus Dinsha and DiPloMAT versus Diplomat. Yep glad I don't have that job. Also it looks like some entries are for two words that bled together, such as "hisDuty".

The Obvious. There are a lot of words that can probably be struck, like the one that just caught my eye, page xcvi dishes 590, 618, though oddly hating on the dishes I do support keeping Dishwashers 570, 618 as they can be used as a trade good and media murder bomb.

Do and does look like candidates and I bet the could get cut too, but I am only up to D at this point. Okay, in I and found it, its, it's. Yikes. I am in R now and I say keep the roll variants. We gamers do a lot of things that involve rolls. Keep table too. The and that need killing.

page xi. Long strings of nonsense letters that start with A. May want to check the page numbers and see if these are a bunch of UPPs or UWPs that are bleeding together into one long string.

page lxxxiv. Same issue as above.

page lxxxv, cv, cxiv, cxxvi, ccxxvi, ccxxvii, ccxxx, ccxli, cclxiv, all of cclxv, cclxvi, ccxcvii, ccxcviii, ccxcix, ccc, cccxx, ccclxxiii, cclxxvii, cclxxviii, . More instances of the above nonsense string that begins with the current index letter.

I am a gearhead, I find myself surprised that there is only the one entry for PGMP at 258.

And at page ccclxxxii or as I read it page 382, I am done with the first skim, very cool. Has Hemdian seen this yet? I know he too has a T5 Index project he is working on, he, Don, Marc and Rob should probably see it too.

Ackehece · Sep 22, 2014

sorry - posted the wrong one.. I have on that uses normal page numbers :rofl: *UPDATED- now in a better format! with actual non academic format page numbers!

yes there are some serious... formatting issues

Ulsyus · Sep 22, 2014

This is another great start, with a little work needed as pointed out. OTPS, I know which pages to go to to find the word Already!

Ackehece · Sep 22, 2014

Magnus von Thornwood said:
Just a couple of things that lept out at me, so I shot them.

Page numbering. Not cool, well, okay kinda cool since I am sure to learn new number in the Latin I missed in high school, but otherwise it is a big pain. Suggest using the standard system.

sorry that was a mistake - it's a tool I use for the university that does academic paper annotation and roman numerals are used for that. fixed now

Magnus von Thornwood said:
Typography errors. Wow is this nice for those folks who get to clear up the typographical errors like DinSHA versus Dinsha and DiPloMAT versus Diplomat. Yep glad I don't have that job. Also it looks like some entries are for two words that bled together, such as "hisDuty".

Yes I did notice this - but since we want a good edit I have not collated these yet into one heading. - figure fix em and them edit it out!

Magnus von Thornwood said:
The Obvious. There are a lot of words that can probably be struck, like the one that just caught my eye, page xcvi dishes 590, 618, though oddly hating on the dishes I do support keeping Dishwashers 570, 618 as they can be used as a trade good and media murder bomb. Do and does look like candidates and I bet the could get cut too, but I am only up to D at this point. Okay, in I and found it, its, it's. Yikes. I am in R now and I say keep the roll variants. We gamers do a lot of things that involve rolls. Keep table too. The and that need killing.

yeah the index is pretty crazy - somewhere in the range of 27k different words! (many of which are not words.....)

Magnus von Thornwood said:
page xi. Long strings of nonsense letters that start with A. May want to check the page numbers and see if these are a bunch of UPPs or UWPs that are bleeding together into one long string.

page lxxxiv. Same issue as above.

page lxxxv, cv, cxiv, cxxvi, ccxxvi, ccxxvii, ccxxx, ccxli, cclxiv, all of cclxv, cclxvi, ccxcvii, ccxcviii, ccxcix, ccc, cccxx, ccclxxiii, cclxxvii, cclxxviii, . More instances of the above nonsense string that begins with the current index letter.

genetic codes and UWP etc all create insane names/codes - these can be dropped in the final index.

Magnus von Thornwood said:
I am a gearhead, I find myself surprised that there is only the one entry for PGMP at 258.

as am I...
also see unrefined - be amazed it does not affect jump (this should be errata)

Magnus von Thornwood said:
And at page ccclxxxii or as I read it page 382, I am done with the first skim, very cool. Has Hemdian seen this yet? I know he too has a T5 Index project he is working on, he, Don, Marc and Rob should probably see it too.

posted this in the errata thread as well. but figured all should get a look at this.

Ackehece · Sep 22, 2014

Ulsyus said:
This is another great start, with a little work needed as pointed out. OTPS, I know which pages to go to to find the word Already!

ALSO creating a index that references just section titles. :rofl:

Magnus von Thornwood · Sep 22, 2014

You are cool and I eat crow

Ackehece said:
ALSO creating a index that references just section titles. :rofl:

Yeah, I eat some crow considering how I ragged on the T5 Has No Index crowd. It is nice start to what is sounding like a useful too and I am not so foolish as turn up my nose at it because of stubborn pride. Mostly. :devil:

Thanks again, this is looking good and seems like an excellent addition once it is finalized.

Ackehece · Sep 22, 2014

Ackehece said:
ALSO creating a index that references just section titles. :rofl:

this is very early... but I think it should be understandable
https://www.dropbox.com/s/jj33ielbc7qxc8d/T5 Index of topics (1).docx?dl=0

Ulsyus · Sep 23, 2014

That looks a bit more like an index - can you flip it around and sort by word?

cym0k · Sep 23, 2014

Ackehece said:
this is very early... but I think it should be understandable
https://www.dropbox.com/s/jj33ielbc7qxc8d/T5 Index of topics (1).docx?dl=0

Plain text version (oh the huge manatee!)?

It would allow playing around with column settings, font size etc without having to fight LibreOffice or your WP of choice.

Ackehece · Sep 23, 2014

cym0k said:
Plain text version (oh the huge manatee!)?

It would allow playing around with column settings, font size etc without having to fight LibreOffice or your WP of choice.

always doable... like I said anything is possible

Ackehece · Sep 23, 2014

Ulsyus said:
That looks a bit more like an index - can you flip it around and sort by word?

The word index is already sorted by alphabetic....

Technically an index is either the section heading index (the second index done as a word doc as it is actively being worked on) or a word index(a concordance - which can be digitally extracted) - both are supremely useful.

and finally ctrl-f is your friend in most cases :devil:

Daddicus · Sep 23, 2014

Here's a copy with some tweaks. It's plain text. If you find it useful, I can maintain it.

https://www.dropbox.com/s/1m1d1y2a962e68l/Index.txt?dl=0

I took out all the lines where one of these are true:

The title "Index" on each page.
The page numbers of the PDF file itself.
The letters of the alphabet (if they're the only thing on the line).

I also converted pages of the form pageX-pageY to be individual page numbers. So, 25-30 became 25, 26, 27, 28, 29, 30.

I merged every instance of repeated words I could find, but only where the word is the whole word. While doing this step, I removed any duplicated page numbers.

For example,

abandon, 16, 102, 129
Abandoned, 92
abandoned, 105, 237, 478
Abandonment, 639
abandons, 212

became

abandon, 16, 92, 102, 105, 129, 212, 237, 478, 639

I did NOT remove simple words like the, that, there, etc. I can do that very easily, but I would like a list of simple words that should be removed before I tackle that.

NOTES:

While the file in DropBox, when viewed in DropBox, appears to have words with more than one line, they don't actually. If you download the plain text file, each word has its own line.

If desired, I can produce this as an Excel file (.xlsx or .csv).

It only takes 1-2 minutes to run, so feel free to ask for enhancements.

If anybody wants to see the Excel macro source code, I can post it.

Finally, the word "Raw" in the first line is just a placeholder/title. I plan on removing that in a later iteration.

cym0k · Sep 23, 2014

Daddicus said:
Here's a copy with some tweaks. It's plain text. If you find it useful, I can maintain it.

https://www.dropbox.com/s/1m1d1y2a962e68l/Index.txt?dl=0

.......

Finally, the word "Raw" in the first line is just a placeholder/title. I plan on removing that in a later iteration.

Brightest star in the galaxy, right there! Cheers!

EDIT: 9pt fonts, 3 columns, 239 pages later! Oh my.... Comprehensive you are.

Ulsyus · Sep 23, 2014

Daddicus said:
Here's a copy with some tweaks. It's plain text. If you find it useful, I can maintain it.

https://www.dropbox.com/s/1m1d1y2a962e68l/Index.txt?dl=0

Nice evolution of this thing, don't stop.

Do you need help with the list of deletable common words?

Ackehece · Sep 23, 2014

it is humming along - do need to retain the first one for editing / errata purposes

As for the section title index rather than the word index - it has been updated to where this block shows

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
*****************************************************************
******************** Finished to here ******************************
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Daddicus · Sep 24, 2014

Achehece, I made this new list from your latest. I didn't update Index.txt, but I can easily enough. Please let me know if you made any changes. I use macros to generate my lists from yours.
--------------

I've added a file (WordsByCount.txt) with the words in order by most occurrences to least. https://www.dropbox.com/s/s150wm0wflqm49q/WordsByCount.txt?dl=0

The first ~500 lines of the file are words that appear ~50 times or more. It seems to me that almost all of these are either useless to an index (the, than, then, there, etc.), or belong in a glossary rather than an index (referee, roll, etc.).

NOTE: 500 and 50 are arbitrary. I picked them just to show the scope (there are 14000+ total words).

I'm going to need help deciding which ones should be in the glossary and which should be killed. If everybody has Excel 2007 or later, I can post the workbook; that way, you could see line number at which you're looking.

Ackehece · Sep 24, 2014

Daddicus said:
Achehece, I made this new list from your latest. I didn't update Index.txt, but I can easily enough. Please let me know if you made any changes. I use macros to generate my lists from yours.
--------------

I've added a file (WordsByCount.txt) with the words in order by most occurrences to least. https://www.dropbox.com/s/s150wm0wflqm49q/WordsByCount.txt?dl=0

The first ~500 lines of the file are words that appear ~50 times or more. It seems to me that almost all of these are either useless to an index (the, than, then, there, etc.), or belong in a glossary rather than an index (referee, roll, etc.).

NOTE: 500 and 50 are arbitrary. I picked them just to show the scope (there are 14000+ total words).

I'm going to need help deciding which ones should be in the glossary and which should be killed. If everybody has Excel 2007 or later, I can post the workbook; that way, you could see line number at which you're looking.

oh I know.... pain...

but I think we can create a few algorithms to reduce the number

1) No adverbs, adjectives, pronouns, prepositions, conjunctions, interjections or articles in the list.

2) UWP/Genetic codestrings can be deleted

3) consolidate forms of the word into one (you've done much of this already!

4) Unfortunately spelling mistakes may have created some short words that should still in the index..... so we need to look first if it doesn't appear to make sense.

T5 Index First Draft- needs editing...

SOC-13

SOC-14 1K

SOC-13

Super Moderator

SOC-13

SOC-14 1K

SOC-13

SOC-13

Super Moderator

SOC-13

SOC-14 1K

SOC-12

SOC-13

SOC-13

SOC-13

SOC-12

SOC-14 1K

SOC-13

SOC-13

SOC-13