Sorry for stalling the undertaking. However, I've actually started to work on it and I already have some usable version. Right now it just copies text files to another folder and (heuristically) strips the headers and footers.
Project is located here: https://github.com/tarsa/text-cleaner
It can be somewhat cumbersome to use. I think the easiest way for you to use it would be to:
- edit the Launcher.scala file. There are two paths: one for existing Project Gutenberg data directory and one to store the results. Inside the first directory there must be file master_list.csv and extext?? directories
- run ./activator or activator.bat (depending on your OS)
- enter "run" and hit enter
- select Launcher class and wait
Right now Ubuntu says the source directory with Project Gutenberg data has size 709.5 MB and the resulting folder has size 409.9 MB.
[info] Loading project definition from /tmp/text-cleaner/project
[warn] Multiple resolvers having different access mechanism configured with same name 'typesafe-ivy-releases'. To avoid conflict, Remove duplicate project resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
[info] Set current project to text-cleaner (in build file:/tmp/text-cleaner/)
[info] Compiling 1 Scala source to /tmp/text-cleaner/target/scala-2.10/classes...
Multiple main classes detected, select one to run:
Enter number: 1
[info] Running com.github.tarsa.squeezechart.textcleaner.Launcher
[success] Total time: 15 s, completed 2013-12-17 19:27:06
How the algorithm works:
Firstly there's List("project gutenberg", "etext", "etexts", "ebook", "ebooks", "small print") - that's a list of marker sequences.
Algorithm finds those sequences in the files (using a method usually called 'whole words' in find dialogs).
Algorithms ignores marks that are further away than 50 lines from other marks or beginning or end of file.
Algorithm scans from beginning of file for marks and stops when there's more than 50 lines between current mark and next one (of if the first mark is further than 50 lines from beginning).
Analogously for scanning from the end.
Additionally, after founding the bounding marked lines, algorithm further scans content to exclude whole paragraphs containing marked lines.
After that the algorithm outputs the content that wasn't excluded.
I hope you would be able to use the program without much trouble. If you have one, write here and I'll try to help.