Remove UTF-8 characters from string

April 13, 2016 — BarryK

For the very first time, have hit a problem with the existence of UTF-8 characters in an Ubuntu packages database file, 'Packages.xz'.

Yesterday, I ran '0setup' in woofQ, which downloads the Ubuntu db files and converts them to "Puppy standard format". Except, the script crashed.

In the 'main' repo, I found that the 'Description' field for 'firefox-locale-nb' has UTF-8 characters.

Well, 0setup has "LANG=C" for speed, with LANG set to the original for certain operations. I think that if I removed the LANG=C, that would fix it, however, instead, I opted to filter-out the UTF-8 characters.

In the 0setup script, for Ubuntu/Debian case, where 'Packages.xz' gets expanded, I made this change:

      #[ $RETSTAT -eq 0 ] && mv -f $xDLFILE ${PKGLISTFILE}pre

      #160411 filter out utf-8 chars...

      if [ $RETSTAT -eq 0 ];then

       iconv -c -f utf-8 -t ascii $xDLFILE > ${PKGLISTFILE}pre

       rm -f $xDLFILE

      fi

<

This deletes the UTF-8 characters, doesn't try to convert them to ASCII.

This problem has cropped up with the latest db update from Ubuntu Xenial Xerus.
The woofCE developer guys will also hit this problem, but they might prefer a proper fix, to allow UTF-8 chars.

Here is a reference on filtering out UTF-8:
http://stackoverflow.com/questions/8562354/remove-unicode-characters-from-textfiles-sed-other-bash-shell-methods

EDIT:
Ha ha, technically, ASCII characters are UTF-8. It would be more correct to state that I am filtering out multi-byte UTF-8 characters.

Tags: linux