# Thread: Hello. For any interested in assisting to write an algorithm

1. ## Hello. For any interested in assisting to write an algorithm

Hello.
I am not a programmer with some limited skills in that field, but have had interest in compression for some time.
I had some interested in the notion of a step-by-step efficient operating system on specific custom hardware. I have been with computers for a number of years and have been there to witness their evolving and increase of power/speed/architecture.

I decided to start looking at data compression to have an integrated tool with this OS mentioned and its programming, allowing for data to be lossless and compressed very small where applicable such as executables designed to use many steps of the same few opcodes.

After looking at multiple approaches to compression and data structure on paper as its bits, and as a principle of 0/1 being 2 possibilities in a generally unlimited sequence, I looked at multiple things. I ended up figuring out the notion of hashing, where minor data applied on a string can be used to reasonably accept a string for its validity. This can even be like every last bit of every 64 bits for a stretch and be like 24B worth of this. The idea is that there are a number of possible files that can possibly fit that exact same hash throughout the spectrum, and where some minor data is altered and most is the same the hash can reflect this, and where the number of the 'sequence' of files possible is used with that hashing data you result in data equivalent to the original file in length.

I looked at a number of methods, working on smaller data, larger data. I considered in general the makeup of the data as its bits and positioning/possibilities, the data as a decimal number for its 'sequence' providing there is accounting for the number of 0s where they are present in the start of the file and have a math formula to work on the decimal numbers, and others.

After time I began to consider the notion of looking specifically at repetitive strings, which is what is done in modern compression tools. I figured modern tools should be able to do a decent job. Where one can find repetition of a string throughout, one can represent this in less data and write this represented (coded) form using less data and make up for the additional identification information for this coded data by the amount of data repetition of the strings takes.
Repetition in a row in general is the largest form of compressing a file, since where this is prevalent will be the most bits reduced from one entry of a string and one entry for a number to represent its repetition.
I figured there are methods similar to huffman where one can use a variable prefix code tree with prefixes of different length that will not be mistaken when reading from left to right to represent data, also allowing for compression, as well as something like LZ which can make a dictionary type data with some information based on repetition found within a specific chunk/frame.

I figured that in order to accurately capture the repetition in a file, there must be a means to get all repetitive strings available, as well as be able to account for repetition in a row specifically as a number. I then figured it is more appropriate to target specific areas of the file and process those specific areas individually and piece the results together than work on the file as a whole, for the purpose of accounting for data distribution. This means a specific area in a file which 'will not compress' besides one minor stretch of data should be able to be detected and acknowledged, and where there are two of these that are different, have a coded representation to have these represented in their smallest form overall with their differences and common data.

Many compression techniques use frames/chunks in order to be able to consistently process a file in 'one go' per se without a lot of temporary storage/calculating and end up being relatively cheap in terms of the compression done for space/time. They also use loose string detection such as 'frequency of every 8 bits' for example.

I was after something which in principle should be as 'thorough as possible' on a string where it can detect compression in the spectrum of data, being as dynamic as can be. This means, looking for the exact ideal strings to suit a variable code tree that will have the least entries and will ultimately have the smallest result possible. This means, if the most ideal string to give to the least bit prefix is exactly a 5 bit string, that is what it will be, and if the next is exactly an 11 bit string, that is exactly what it will be. I sought after having this approach done dynamically on the data to suit the exact areas of the file for distribution as well as determining raw/compressible areas. I sought after ensuring repetition in a row was also acknowledged specifically when met as a number for its repetition as this is the highest form of compression.

All in all I did the research and developed a general algorithm (on paper) which should be able to extremely comprehensively capture repetition in a string in an automated and systematic approach where any data you give it should be able to get with enough advance calculation and temporary storage the compression possible. This is the concept behind this thorough compression approach.

A game like kkreiger is 96KB for this one level demo.

This claims to be using a kind of hand tailored procedural compression for its content, which extracts to RAM on startup for few mins to a few hundred MB.
Considering the palette of this game is very minimal and has mostly repetition of very few colours, this is very doable with the compression algorithm I have written as well as make it smaller. This algorithm should make this kkreiger potentially smaller, doing a process like compressing the raw repetitive and few colour extracted assets heavily and compensating for any flaws in the structure of their raw 'shell' data.
This 96KB basically is the equivalent of a few minutes loading for this one level than larger data which will take less time to load.

An example for the confirmation of general compression algorithms was a handmade BMP of 8 colours, 1024x768, 2.25MB, which were 4 lines each and that cluster of 32 repeated 24 times. PNG did this as 5KB, and with and without the 56B header RAR was 2KB and 7z was 1KB on their highest form. I was discussing this compression on a forum where a member used DEFLATE on the data without the header and the result was under 100B. The member further shrank this info for its patterns fitting in 5 bits and header into 50B.
It shows how for majority repetition the above compression algorithms were far from comprehensive. This compression algorithm I wrote will also be able to internally repeat a process to capture more repetition once data has shuffled itself, as expected from the BMP and in general occurs in compressed data where the DEFLATE result can shrink further. The compression algorithm all in all should be able to get the data into less than the 50B of the DEFLATE+POST SHRINK within the exact same file, and have areas in the compressed file of 'compressed' + 'layered' + 'raw' in one file for having any kind of data which can be encapsulated within one file and have that specific portion temporarily decode and be appended than direct left-to-right stream.

In principle, this algorithm should be able to squeeze and drain a file to its general potential max, as well as has options to scale and be cheaper/looser for space/time and resemble other compression algorithms. It is meant to be very comprehensive, where you can adjust its options for a custom setting and have the option to specifically squeeze a file.
The only thing in general which is not part of this algorithm which would make it more 'complete' for that thorough option is having data be moved in advance to suit more compression where the information to represent moving the data will be less stored than the result of the compression achieved and make up. The calculated max possible expectation for this is extremely little compression and a lot of calculating, which for the time being I have not included.

I am not a programmer, but have this algorithm written for all its functions and what is should do, and can show that it will do a thorough job. Including adjusting its settings and allowing for it to be done in chunks to be more friendly on space, it is meant to be very comprehensive.
It is ideal to have an ASM version of this tool for the speed required for the more intense settings, as well as having no image/sound etc and work in a boot time environment (full control of hardware without os concurrency).

With the omission of moving / shuffling data in advance for the minute possible compression for applicable files in the spectrum, this should be able to squeeze a string where possible to its bone while being comprehensive and adjustable to be versatile for space/time and user preference. It is meant to be a specific tool that one can use and adjust to personal preference for the CPU/Storage balance of time/space for their use, as well as having a means for archiving data in a very (its most in general without moving data in advance for its very little compression) compact manner.

For something like the hutter wiki data, in principle this approach is meant to just about get data to its bone, and it does rely on advance calculation to confirm compression is present in order to only compress 'what can be' and simply reject any data including a whole file where none is accepted for confirmation.
Only thing missing which will be a fair amount of calculating and result in very little compression is recognizing moving data / shuffling data in advance to compress very slightly further.

I'm personally interested in a final product, and anyone that is interested in developing this tool and even just testing it on the wiki is welcome.
I'm not a programmer, and I am some way from figuring asm and having this written. I am looking at the overview of computer expectations/performance in general, without being very specific to OS/hardware configurations for the programmer.
I have put the research on the net, and can be looked at and tried/read by others.

The mediafire link is the same as the one in the second post above.
Anyone interested in having this done with all its bells and whistles in asm and have a result for the hutter information (as a side thing, with consideration that it is missing the very minute possible data compressed with moving data in advance) would be nice.
All details including immediate decoding where an advance calculation provides minimal bits to determine offset and its algorithm state is there.

Thanks.

EDIT:
Decided to upload the research info and algorithm here too as an attachment

EDIT:
One note on the files which will not compress. These are generally expected to be the result of compressed files in the file spectrum for what is compressible, making the possibility 50/50.
When the file has a scan to determine compressible areas, it can be adjusted to its strength/degree for results to offset and individually process for a scan which considers distribution.
This in general will only occur if there is 2 possibilities in a string - i.e. there is both a 0 and 1 in the string. Depending on strength, data in a stretch can be scanned for all iterations of 2 bit combinations and together where available.

Where 00, 01, 10, or 11 is found in a stretch, it is guaranteed to be compressible because it is 1 of 4 possible options and will always fit in 1 bit. Where two of them are tested for having presence in a stretch, it is 2 of 4 possible options and will always fit in 1 bit. These stretches have guaranteed compression. The specific stretches of 1 or 2 combinations will always have 100% compression where long enough to include the database info and representation.

Where three of them are tested for having presence in a stretch, it is 3/4 possible options and will require 2 bits for this. However, as it only uses 3 of 4 bit combinations, depending on the general length of the stretch / some prevalent data it has leniency for compression. It will generally 75% compress providing the stretch is long enough to compensate and there is some minimal (and minor) prevalence between them for having a variable representation.

Where all 4 are used, which in general is a default for the file or for remaining areas not specifically associated to above possibilities, providing that absolutely none of the scans above found anything or from the remainder, the possibility for compression at this point is from determined repetition. There is generally 25% compression based on combination content / length. It requires specific repetition of data to account for the database info and the representation. Where no repetition is found in this scenario of above scans not finding info and all 2 bit combinations are generally in balance of the 00/01/10/11 throughout without some generally major prevalence, there isn't necessarily data to be used where one is more than another and can be compensated by giving it a smaller representation and the other data a larger representation.

This data is generally considered 'uncompressible', and is generally targetted for transformation of a file where it can be compressed, and in the least bits. It comes from iterations of 3 / 4 combinations, with a ratio more related to something like 25% / 75% respectively.

Where 'random data' is referred, it is possible to compress providing it contains stretches which when scanned pick up the possible iterations of combinations/lengths, and where not there is repetition of strings and the 2 bit 00/01/10/11 are not all in balance and has some compensation for a ratio to encode (with a variable prefix).

Discussed in the description is mirrored/inverse strings.
It is possible to fake something like each odd/even bit to be swapped intentionally to encourage more repetitive areas as a form of preprocess, or something to this effect (mirror/inverse can do something along this). Moving / shuffling data can also assist (minor).

The writeup is more along the lines of doing the data without moving/changing data in advance.

2. There was one oversight with the pdf of the algorithm for the general expected compression range of data, which is to allow for both a variable length prefix code tree and a static length prefix code tree, and their co-existence (entries in each to switch between them, similar to entries to acknowledge raw gaps).
This is one which includes this addendum (mention in general and expectation).

There is also this to give some more detail on the expected range of compressible data.

From the 50% compressible data spectrum in principle where data has the prevalence to record this, the approach for this tool is to be dynamic/systematic (and work automatically in determining) to be able to accommodate and compress the possible 50% compressible data range (hence allowing for the smallest possible compressed files) when specifically acknowledging repetition/prevalence.
It is not targeting data types, but rather different arrangements of data as its algorithm functions/makeup. Something like repetition in a row as a number allows for the highest form of compression and is encouraged to capture that specific form of data arrangement (allowing for many gbs of 0 bits for example to be in a file totaling a few b where other tools can be kb+). Something like looking to be specific to data patterns in advance and targeting only these areas to compress them as well as being specific to the data distribution from common and different strings within compressible areas.
The idea is to have basic functions which target specific arrangements of 0/1 data rather than specific to a file type, which in general would be accommodated (and accurately) by targeting algorithms suiting specific data arrangements.

Besides moving data in advance, and possibly looking into a notion of 'fake decompressing uncompressible data as though it is compressed using poorer constraints and then compressing the result with a stronger scheme', the essential components are meant to be targeting data arrangements and their co-existence.

The idea is to be able to get the complete 50% possible compressible spectrum based on prevalence/repetition on the highest setting (and allow for the smallest files), then be scalable to specific user standards for space/time in a custom way.
This with its current setup and algorithm components should accommodate a number very close to the full 50% possible on its highest capacity (including the addendum of variable/static prefixes and their co-existence. This was an oversight).

3. ## The Following User Says Thank You to SoraK05 For This Useful Post:

Nania Francesco (19th July 2014)

4. The approach to this is to look at strings and ways to capture repetition that can be detected.
This has multiple approaches to be very specific in detecting data and repetition, and ways to record it their most efficient manner.

One thing I noticed is something similar to repetition in a row which can be the highest form of repetition where something similar can be done for data which is generally dispersed in a predictable and set manner, such as every 64 bits is the exact same string, and even multiple iterations like every 64 + 32, and this can be recorded as is in the header entirely without any presence in the written prefix/raw part, as well as general offsetting. This can be considered part of its own section in the header to set this up as part of the algorithm functions, and would be determined against the weight of doing this or leaving them by interference with another aspect.
When I get around I can figure a decent algorithm implementation to systematically detect presence of fixed predictable string+location info to be the equivalent of repetition in a row for static data placed around a file.

While this algorithm may not be entirely complete toward a 50% ratio of files in the spectrum (without moving data and being streamable left-to-right <even with decoding layers temporarily>), considering its approach at getting minimal strings and recording things appropriately like repetition in a row as a number, variable/static prefixes, being specific to areas and distribution with advance processing and such, this should be generally very close for a string detecting method.

Any other things that can be done in general to compliment this approach is welcome.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•