Results 1 to 12 of 12

Thread: script to clean Gutenberg texts

  1. #1
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    872
    Thanks
    457
    Thanked 175 Times in 85 Posts

    Question script to clean Gutenberg texts

    As there seems to be interest (according to my poll) for a clean text corpus I want to ask here for a script to clean Gutenberg texts.
    When I was adding the Gutenberg set, I was thinking that it could be interesting to use texts in many languages rather than just
    choosing a single language, but the testset also includes JPEG, PNG and a little AVI that might hurt compression and bias results.
    Secondly, the texts seem to have different character set encodings which might also hurt compression.

    So I would remake the corpus and include nothing but clean texts. I am not famIlyar with perl scripts - are there any other text cleaners?

  2. #2
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Firstly, those non-textual files like JPG take up a lot of space in Gutenberg testset. I think that those JPGs constitute much more than a half of the compressed sizes for each compressor. That's the biggest flaw in the Gutenberg set right now.

    You've mentioned different languages and different encodings - it would be good to sort the files by language and encoding before compression. Also each file contains a Project-Gutenberg-specific prologue and epilogue. That should be removed. Then cleaning should be performed.

    For now the easiest and most rewarding action is to remove the non-textual files like JPGs and such. I have an interest for a clean textset for testing my upcoming CM codec so I'll probably do a cleaning script (if nobody will do before I need it), but that codec isn't even near a working state now.

  3. #3
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    872
    Thanks
    457
    Thanked 175 Times in 85 Posts
    The plan is to redesign Gutenberg testset. Sorting by encoding and language will take many many hours. The prologue can be removed of course
    and all non-textual files will be removed.
    Last edited by Stephan Busch; 3rd February 2013 at 02:04.

  4. #4
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I didn't mean fully manual work. I don't know exactly how many texts have a language written in the header but quite a few have. Same goes for encoding. So it could be automated to some degree. But as I've said - currently I'm absorbed with my new compressor.

  5. #5
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    872
    Thanks
    457
    Thanked 175 Times in 85 Posts
    I will wait with new Gutenberg testset until we have an automated solution (other than perl script) for cleaning.
    What will be done this month: 3D Game and D.N.A. testsets will be replaced by Camera Raw and uncompressed audio testsets.

  6. #6
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Hi,

    Sorry for stalling the undertaking. However, I've actually started to work on it and I already have some usable version. Right now it just copies text files to another folder and (heuristically) strips the headers and footers.

    Project is located here: https://github.com/tarsa/text-cleaner

    It can be somewhat cumbersome to use. I think the easiest way for you to use it would be to:
    - edit the Launcher.scala file. There are two paths: one for existing Project Gutenberg data directory and one to store the results. Inside the first directory there must be file master_list.csv and extext?? directories
    - run ./activator or activator.bat (depending on your OS)
    - enter "run" and hit enter
    - select Launcher class and wait

    Sample session:
    Code:
    piotrek@piotrek-desktop:/tmp/text-cleaner$ ./activator
    [info] Loading project definition from /tmp/text-cleaner/project
    [warn] Multiple resolvers having different access mechanism configured with same name 'typesafe-ivy-releases'. To avoid conflict, Remove duplicate project resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
    [info] Set current project to text-cleaner (in build file:/tmp/text-cleaner/)
    > run
    [info] Compiling 1 Scala source to /tmp/text-cleaner/target/scala-2.10/classes...
    
    Multiple main classes detected, select one to run:
    
     [1] com.github.tarsa.squeezechart.textcleaner.Launcher
     [2] com.github.tarsa.squeezechart.textcleaner.TextCleaner
    
    Enter number: 1
    
    [info] Running com.github.tarsa.squeezechart.textcleaner.Launcher 
    [success] Total time: 15 s, completed 2013-12-17 19:27:06
    > exit
    piotrek@piotrek-desktop:/tmp/text-cleaner$
    Right now Ubuntu says the source directory with Project Gutenberg data has size 709.5 MB and the resulting folder has size 409.9 MB.



    How the algorithm works:
    Firstly there's List("project gutenberg", "etext", "etexts", "ebook", "ebooks", "small print") - that's a list of marker sequences.
    Algorithm finds those sequences in the files (using a method usually called 'whole words' in find dialogs).
    Algorithms ignores marks that are further away than 50 lines from other marks or beginning or end of file.
    Algorithm scans from beginning of file for marks and stops when there's more than 50 lines between current mark and next one (of if the first mark is further than 50 lines from beginning).
    Analogously for scanning from the end.
    Additionally, after founding the bounding marked lines, algorithm further scans content to exclude whole paragraphs containing marked lines.
    After that the algorithm outputs the content that wasn't excluded.



    I hope you would be able to use the program without much trouble. If you have one, write here and I'll try to help.

  7. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I've pushed new version which does some preliminary cleaning and also sorts by language. Turns out that non-English .txt files in Project Gutenberg testset constitues of less than 1% of all .txt files in that testset.

    I'm not sure what to do with non-ASCII letters though (ie those outside of 7-bit space). Probably I'll skip non-English texts altogether and forget about the problem

  8. The Following User Says Thank You to Piotr Tarsa For This Useful Post:

    Stephan Busch (19th December 2013)

  9. #8
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    872
    Thanks
    457
    Thanked 175 Times in 85 Posts
    I am still interested in this program. It could generate clean text files. n my opinion it should also clean non-english texts but I am not aware how to solve the non-ASCII-letters issue.

  10. #9
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Stephan Busch View Post
    I am still interested in this program. It could generate clean text files. n my opinion it should also clean non-english texts but I am not aware how to solve the non-ASCII-letters issue.
    I'm not sure what the issue is, but you can convert between encodings with the command-line utility that comes with iconv: http://www.gnu.org/software/libiconv/ If you have Linux or OSX, it may already be installed.

    What file format are the Gutenberg texts in that needs to be stripped of images?

  11. #10
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    The objective is, in general, to provide a testset of uniform unstructured data. By doing that, we remove the need for special logic in compressors for handling structures like formatted text, markup, timestamps, etc That would make it easier to compare core compression algorithms only, rather than filters + compression combinations, even if a compressor has builtin filters and other logic for formatted data that cannot be disabled. The exception is dictionary preprocessing but that often can be disabled. Or we can confuse the preprocessors by doing rot-13 encoding

    I haven't looked deep into the non-English texts, so I'm not sure if there are multi-byte character encodings present. If so, then I think it would make more sense to remove such files completely. Multi-byte character encodings are structures in themselves, as most compressors works on a single-byte level. For example PPM compressors have entries for every byte position, LZ coders have offsets and lengths denominated in bytes, n-gram models in CM coders model n-grams starting and ending on byte boundaries, and so on

    From what I estimate, non-English texts constitute of less than 2% of all Project Gutenberg's text (on SqueezeChart), so IMO it's not worth bothering with them. After all, there's multilingual Bible compression test on SqueezeChart with tons of multi-byte encodings and one can draw conclusions from that benchmark.

    Edit:
    There's at least one file with UTF-8 encoding: clprm10u.txt with some really old Icelandic text
    Last edited by Piotr Tarsa; 19th December 2013 at 14:35.

  12. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Have you looked at http://pizzachili.dcc.uchile.cl/texts.html ?
    They have a 2.2 GB text corpus from the Gutenberg project with headers deleted. Plus a few other interesting data sets.

  13. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    Piotr Tarsa (19th December 2013)

  14. #12
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    The objective is, in general, to provide a testset of uniform unstructured data. By doing that, we remove the need for special logic in compressors for handling structures like formatted text, markup, timestamps, etc That would make it easier to compare core compression algorithms only, rather than filters + compression combinations, even if a compressor has builtin filters and other logic for formatted data that cannot be disabled. The exception is dictionary preprocessing but that often can be disabled. Or we can confuse the preprocessors by doing rot-13 encoding
    Or, you could confuse them by feeding them Icelandic. The `file` utility in unix will try to report the encoding of a text file.

Similar Threads

  1. FreeArc and Inno Setup script
    By mondragon in forum Data Compression
    Replies: 101
    Last Post: 31st March 2015, 20:13
  2. IS script help
    By zhuda in forum The Off-Topic Lounge
    Replies: 3
    Last Post: 10th February 2015, 19:03
  3. LZ compression of audio texts
    By willvarfar in forum Data Compression
    Replies: 12
    Last Post: 27th June 2012, 18:51
  4. What's wrong with my testing script?
    By m^2 in forum Data Compression
    Replies: 20
    Last Post: 21st September 2008, 19:24

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •