Results 1 to 13 of 13

Thread: LZ compression of audio texts

  1. #1
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts

    LZ compression of audio texts

    Just wondering if LZ -style compression has ever been attempted on audio text, and if so what the results were?

    http://williamedwardscoder.tumblr.co...o-books-better

  2. #2
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    You usually can't apply LZ algorithms to audio. Apart from the noise that you could filter out more or less, a certain word can be spoken many different ways with different emphasis and so on. You'd need at least some very good algorithms to be able to apply those subtle differences to a "base audio word".
    http://schnaader.info
    Damn kids. They're all alike.

  3. #3
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Sorry I didn't mean it so literally.

    I mean, if you can send the transcript of the text in just a few hundred KB, then the decoder can know that the speaker is about to say the word "the", and it has a fair few samples of that voice previously saying "the" and so on.

    Can you imagine a system that tries to encode the speech speed variations and any slight pronounciation variation using much less bits than a conventional streaming audio compressor?

    I wonder if its useful? These days, in audio texts and offline podcasts, a text transcript of the what the person is saying is available.

  4. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    it would yield even more economy being applied to video files. for example, we can encode "angeline jolie smiling" in just a few bytes instead of passing multimegabyte video

  5. #5
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    it would yield even more economy being applied to video files. for example, we can encode "angeline jolie smiling" in just a few bytes instead of passing multimegabyte video
    Doubtless that was meant as a subtle dismissal.

    Its certainly true that camera trackers can often infer very accurate 3D models for most films, and even more so in 3D filming.

    Of course video compression formats are trying to reduce to a minimum the window needed to decode, and all I know of are block-based using at most a handful of adjacent frames.

    Whereas the compressing audio where a transcript exists is a niche but interesting use-case where the playback devices are already software programmable and have sufficient RAM to hold the decode around to refer to. Even problems like seeking can be overcome simply because, despite audio being 1000x bigger than text, its miniscule compared to video.

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Actually that is a good point. In theory you could transcribe speech to text at about 10 characters per second and compress it to about 10 bits per second. Even more surprising, you could do this with video, if you consider the rate at which the human brain stores information. http://csjarchive.cogsci.rpi.edu/198...p0493/MAIN.PDF

    So basically, that is correct. If you would use the same words to describe two videos, chances are if you watched one after the other that you wouldn't notice any differences. You actually remember a lot less than you think you do.

    Of course it is a hard AI problem to convert the words "Angelina Jolie smiling" and maybe a few more details into a video. I don't expect it will be solved anytime soon, but it is something that a human artist could do, so not theoretically impossible.

  7. #7
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    On the video front, its routine to chop out bits of building and objects for editing; that's all part of movie "post production" and the tools are quite effective.

    If you can swap out a face of a building you can just as well send the face once and coordinates in each frame (or coordinates in 3D space and the deduced path of the camera) and have the video player put it back.

    However, this works against the way videos are played back.

  8. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Yes, there are lots of tricks you could do. For example, track eye movements and only send fine detail where you are likely to look. Predicting eye movements is hard. We typically look at edges, corners, and movement, but also things that interest us such as faces and text.

  9. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    ok, the best compression is to convince one to not watch the film

  10. #10
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Which is steering away a bit from the idea that having the transcript of what someone is *about* to say next can help audio compression of the person saying it. Where's the flaw in that thinking?

  11. #11
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    You'd probably also want to track inflections, speed and other nuances of human speech, but in principle it sounds like it would work. The hard thing is to know how much of this is already taken into account in the existing codecs (whether deliberately or accidentally).

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    1. If there's a specific app which was used to generate a given .wav
    from text, then its likely possible to reverse-engineer it.
    In DOS times, when tracker music was much more popular, there were
    even attempts to generate tracker files from wavs, which is probably
    more complicated than reverse-engineering a specific speech app
    (all of them failed though).

    But this is only possible for 100% synthesized speech wavs - even
    simple background music and/or additional processing would make it
    much less efficient - we'd still be able to recover and subtract
    most of text, but there'd be still a lot of extra data remaining, so
    overall compression ration would be much worse.

    2. Actually a few years ago I did try to implement something like LZ
    matching of audio - not for compression though, but for speech
    recognition. It linked the patterns by normalized/quantized/hashed
    ids and then selected the best match by MSE of actual samples
    corresponding to the same context id. Then, it actually worked
    pretty well for vowels (I didn't solve the length issue though, so
    it could output "aaa" for "a"), but I couldn't find any such
    patterns for consonants - and no wonder, as some of them are
    transitions between patterns, others are shaped noise etc.

    3. Anyway, speech recognition is quite possible (I've seen some
    pretty impressive results on youtube recently, with auto-generated
    subtitles), and then we can compress the text, and re-generate sound
    from it with speech synthesis software.
    But lossless compression would be much more troublesome, and the
    main problem is that the main volume of audio data is noise anyway.
    Sure, pattern matching has some potential for lossless audio
    compression improvement too, but likely not as a transformation
    (audio -> text + xxx). Instead, it would be probably more helpful as
    a kind of match-model in a CM compressor.

  13. #13
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Quote Originally Posted by willvarfar View Post
    Which is steering away a bit from the idea that having the transcript of what someone is *about* to say next can help audio compression of the person saying it. Where's the flaw in that thinking?
    Yes, that's the idea, but there are two problems. First, for lossless compression, the speech signal you want (10 b/s text) is overwhelmed by incompressible noise. So you throw that away, save just the text, and use a speech synthesizer. The second problem is that it is an AI problem. It needs a lot of computing power. We sort of know that the patterns we want to detect are phonemes, but we want the algorithm to learn those features without us having to label or encode them.

    Google did a study with learning important image features using neural networks: http://arxiv.org/pdf/1112.6209v3.pdf

    This is a lossy compression problem. The algorithm learns to detect high level features like human faces, human bodies, or cats, given nothing more than a training set of lots of images (in this case, 10^7 200x200 images, each a single frame from a different Youtube video). To compress an image, you would transmit its high level features (person holding a cat). To decompress, you would search for a pixel array that activates the same features.

    The problem is that it requires a lot of computing power. In this case, 1000 16-core CPUs for 3 days to train a 9 layer neural network with 10^9 connections. And that is tiny compared to what the human brain does, so the results are still pretty poor compared to human vision.

Similar Threads

  1. Audio Compression
    By osmanturan in forum Data Compression
    Replies: 129
    Last Post: 2nd November 2017, 12:51
  2. exists an recompressor for h264 streams and audio streams?
    By thometal in forum Data Compression
    Replies: 34
    Last Post: 12th January 2017, 01:16
  3. Sac: (State-of-the-Art) Lossless Audio Compression
    By Sebastian in forum Data Compression
    Replies: 38
    Last Post: 25th April 2016, 16:01
  4. Audio test file
    By Mihai Cartoaje in forum Download Area
    Replies: 1
    Last Post: 7th January 2009, 10:47
  5. Lossless Audio Codec
    By encode in forum Forum Archive
    Replies: 8
    Last Post: 1st August 2007, 18:36

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •