Recursive-strip C comments

Forum: DSL Tips and Tricks
Topic: Recursive-strip C comments
started by: stupid_idiot

Posted by stupid_idiot on Dec. 07 2007,12:41

Big Fat Warning:
Hi all,
None of the methods described here actually work.
I haven't found any workable solution yet.
Everything I've posted are just preliminary attempts.
Please do not use ANY of these examples in their unmodified form on any source code!!!
Problems:
-- No reliable way to preserve directives ("#define", "#ifdef", "#endif", etc).
-- "Missing semicolon" compilation errors after deleting comment blocks (the comment blocks were acting the role of semicolons by signifying the end of a function).
-- Most probably, there are many, many more problems which I haven't ran into yet.

I am a newbie who is very new to sed and regular expressions, and:
I am not even sure sed is the proper tool for this purpose (stripping source code).
I am also a total Perl newbie (only just started reading the 'Llama Book' AKA < 'Learning Perl, Third Edition' > from O'Reilly).
(I wholeheartedly recommend the 'Llama Book' to anyone who is new to Perl -- It is very well written!)
Speaking as a total Perl newbie:
IMHO, I think someone who is versed in Perl may be able to produce a more ideal solution in Perl.

Code Sample

find ./ -name "*.h" | \
while read i; do \
cpp -P -fpreprocessed "$i" > "$i.tmp" \
&& mv "$i.tmp" "$i"; done
find ./ -name "*.h" | \
while read i; do \
sed -i '/^ *$/d' "$i"; done

This will recursively strip all comments and empty lines from any C headers in the current directory. This would help reduce the size of any '-dev' extensions.
Any improvements are most welcome!
Thanks.

Explanation:

Code Sample

cpp -P -fpreprocessed "$i" > "$i.tmp" && mv "$i.tmp" "$i"; done

1. Process all '.h' files with 'cpp' and overwrite original files with processed files.

Code Sample

sed -i '/^ *$/d' "$i"

2. '/^ *$/d' -- Remove lines that: begin with any number of spaces and end after any number of spaces (i.e. lines that contain only spaces).

Possibly useful also:
1. 'whitespace_stripper.sh'

Code Sample

for i in "$@";
do sed -i '/^ *$/d' "$i"; done

e.g. 'whitespace_stripper.sh file1 file2 file3 file4 [...]'
2. 'stripsh' (strip shell scripts)

Code Sample

for i in "$@";
do sed -i -e 's/\t*#.*//g' \
-e 's/\ *#.*//g' \
-e '/^#/d' \
-e '/^\t*#/d' \
-e '/^ *#/d' \
-e '/^ *$/d' \
"$i"; done

e.g. 'stripsh file1 file2 file3 file4 [...]'
NOTE: This script has a problem -- It deletes the first line of any script; for example:

Code Sample

#!/bin/sh

You have to put the line back again after running the script.
Question for everyone: How do we make 'sed' ignore lines that begin with "#!"?
Thank you very much!

Explanation of above script:
's/\t*#.*//g' -- (Substitution) Pattern: Begins with any number of TABs ("\t"), followed by a "#", followed by any number of any character (".*"). Replacement: Null (i.e. deletes the relevant part of matching lines -- does NOT mean deleting the entire line).

'/^\t*#/d' -- (Deletion) Delete lines that: begin with any number of TABs, followed by a "#" (i.e. bash and perl comments).

'/^ *#/d' -- (Deletion) Delete lines that: begin with any number of SPACEs, followed by a "#" (i.e. bash and perl comments).

'/^ *$/d' -- (Deletion) Delete lines that: begin with any number of SPACEs, followed by an end-of-line ("$") -- i.e. delete empty lines.

Posted by curaga on Dec. 07 2007,18:49

well, find uses the current dir as a default, so the first command could only be

find -name "*.h" |

Nice anyway

How about adding a help function, and then calling this as remove-comments.sh?

There's also this issue: if the file has bash-style comments, aka lines starting with #, cpp will bail out and the .tmp file will be left there..

Here's my go:

Quote

#!/bin/sh
ext=h
case $1 in
-h* | --h*) cat << EOF
Use $0 in the top directory of your sources to strip all headers of comments.

$0 -ext sh operates on .sh files instead of .h
EOF
;;
-ext) ext=$2;; esac

find -name "*.$ext" | \
while read i; do \
cpp -P -fpreprocessed "$i" > "$i.tmp"
mv "$i.tmp" "$i"; done
find -name "*.$ext" | \
while read i; do \
sed -i -e '/^ *$/d' -e '/^# /d' "$i"; done

Posted by mikshaw on Dec. 08 2007,01:54

I don't really see a whole lot of use for removing comments from headers, unless you see headers as being nothing more than dependencies for compiling. If you are a software developer, those comments can be very useful.

The final product can easily be reduced with the strip command.

Posted by jpeters on Dec. 08 2007,04:00

One difference, stupid's version doesn't remove lines beginning with '#!' (maintaining the space after # request).

Edit: I guess it needs line 14 joined to line 15 "\ && ", then it works. (as long as everyone places spaces in their comments after the '#')

Interesting code (although I'd agree with mikshaw in favor of comments and spaces)

Posted by stupid_idiot on Dec. 08 2007,06:56

Edited original post:
Changed 'cpp' command to

Code Sample

cpp -P -fpreprocessed

cpp manual: "-P -- Inhibit generation of linemarkers in the output from the preprocessor."

Posted by stupid_idiot on Dec. 08 2007,07:30

Quote (mikshaw @ Dec. 08 2007,04:54)

The final product can easily be reduced with the strip command.

Yes, but:
Actually, I was thinking about extensions like 'gcc1-with-libs.unc' or any '-dev.{dsl|unc}' extension. Basically, any extension that contains a large amount of header files can be drastically reduced in size.
For example, 'libwxgtk1-dev.uci' currently in 'mydsl/testing' is 804K. By stripping comments from all headers and scripts, the size is 368K.

Posted by curaga on Dec. 08 2007,08:38

Firefox and other programs which use XML can also benefit greatly

Posted by mikshaw on Dec. 08 2007,13:51

Quote

Actually, I was thinking about extensions like 'gcc1-with-libs.unc' or any '-dev.{dsl|unc}' extension. Basically, any extension that contains a large amount of header files can be drastically reduced in size.

Yes, I know that's exactlty what you meant, as did I. Those headers are used to build software beyond just "configure, make, make install", and often those comments are the *only* documentation available on how to use them. I guess I exaggerated that particular issue, though. A person wanting to know what those comments are could just as easily find a full copy of a particular header online. The savings in filesize are well worth the effort.

HOWEVER,
You will need to make sure that your script doesn't strip out copyright notices and license texts if you plan to distribute a stripped header package. That in itself may be your biggest challenge, considering a notice typically looks like any other comment and are often found in individual headers rather than just a README file..

Posted by john.martzouco on Dec. 08 2007,16:43

Is this only to save space for the compressed tarballs?... or to pack more in the RamDisk when running the LiveCD version?

Is there such a thing as a compressed folder (a la NTFS) that could be used for savings in RamDisk?

If it's for the tarball sizes (uci sizes), how much difference does it make after compression? An unstripped xyz is how much bigger than a stripped xyz after it's compressed?

Posted by stupid_idiot on Dec. 08 2007,17:07

Quote (mikshaw @ Dec. 08 2007,16:51)

HOWEVER,
You will need to make sure that your script doesn't strip out copyright notices and license texts if you plan to distribute a stripped header package. That in itself may be your biggest challenge, considering a notice typically looks like any other comment and are often found in individual headers rather than just a README file..

Yes, I agree. That does seem to be a requirement of distributing any source code.
To quote a section of the < GNU General Public License (GPL) >:

Quote

4. Conveying Verbatim Copies.

You may convey verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program.

You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee.

However, hypothetical question: Can the term "copy" above refers to, in our case, the MyDSL extension as a whole, or does it refers to each and every file in the source code?
To answer my own question: "keep intact all notices" seems to mean that all copyright headings must be kept as they are.
I have a wishful thought: Can we reduce this requirement

Quote

keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty

to a single notice (e.g. files called 'LICENSE' and 'WARRANTY') and remove the notices from the source code?
Yes, I know -- that would contradict "keep intact all", so I cannot do that. Just a wishful thought, though.

Posted by curaga on Dec. 08 2007,17:21

Quote

Is this only to save space for the compressed tarballs?... or to pack more in the RamDisk when running the LiveCD version?

Both..

Quote

Is there such a thing as a compressed folder (a la NTFS) that could be used for savings in RamDisk?

Nope, not with vanilla ext2.. There is a gzip compression patch, but that would use too much computing power to have in DSL.

Quote

If it's for the tarball sizes (uci sizes), how much difference does it make after compression? An unstripped xyz is how much bigger than a stripped xyz after it's compressed?

Unstripped apps are generally 2-5 times larger than stripped (for example: bash 3.2 not stripped 1.5M, stripped 480kb)
Of course the difference isn't as big compressed..

Posted by stupid_idiot on Dec. 08 2007,17:41

Quote (john.martzouco @ Dec. 08 2007,19:43)

If it's for the tarball sizes (uci sizes), how much difference does it make after compression? An unstripped xyz is how much bigger than a stripped xyz after it's compressed?

For example (based on what I am working on):
(1)
gtk2-dev.unc (unstripped) -- Size: 516K
gtk2-dev.unc (stripped) -- Size: 240K
Contents and configuration:
atk-1.9.1: < Config > [pastebin.com]
cairo-1.4.12: < Config > [pastebin.com]
glib-2.12.13: < Config > [pastebin.com]
gtk+-2.10.14: < Config > [pastebin.com]
pango-1.12.4: < Config > [pastebin.com]
Source code:
< atk-1.9.1.tar.gz > [ftp.gnome.org]
< cairo-1.4.12.tar.gz > [cairographics.org]
< glib-2.12.13.tar.gz > [ftp.gnome.org]
< gtk+-2.10.14.tar.gz > [ftp.gnome.org]
< pango-1.12.4.tar.gz > [ftp.gnome.org]
What was stripped:
- C headers ('.h'). [comments]
- m4 macros used by aclocal ('.m4') [comments]
- libtool library files ('.la') [comments]
- bash and perl scripts [comments]
- pkgconfig files ('.pc') [empty lines]

(2)
libwxgtk2-dev.unc (unstripped) -- Size: 780K
libwxgtk2-dev.unc (stripped) -- Size: 368K
Contents and configuration:
wxGTK-2.6.4: < Config > [pastebin.com]
Source code:
< wxGTK-2.6.4.tar.gz > [downloads.sourceforge.net]
What was stripped:
- C++ headers ('.h') [comments]
- m4 macros used by aclocal ('.m4') [comments]
- bash scripts [comments]
- XML files ('/usr/local/share/bakefile/presets/*.bkl') [XML-style comments]

Posted by john.martzouco on Dec. 08 2007,17:53

Quote (stupid_idiot @ Dec. 08 2007,12:41)

Quote (john.martzouco @ Dec. 08 2007,19:43)

If it's for the tarball sizes (uci sizes), how much difference does it make after compression? An unstripped xyz is how much bigger than a stripped xyz after it's compressed?

For example (based on what I am working on):
(1)
gtk2-dev.unc (unstripped) -- Size: 516K
gtk2-dev.unc (stripped) -- Size: 240K

These are all uncompressed numbers, yes?

Posted by stupid_idiot on Dec. 08 2007,18:01

Sorry John, I neglected to mention those are COMPRESSED numbers.
Uncompressed sizes:
(1)
(directory) gtk2-dev/ (unstripped) -- Size: 3612K / 3.6M
(directory) gtk2-dev/ (stripped) -- Size: 2548K / 2.5M
(2)
(directory) libwxgtk2-dev/ (unstripped) -- Size: 4428K / 4.4M
(directory) libwxgtk2-dev/ (stripped) -- Size: 3060K / 3.0M

This command was used to make the '.unc' extensions:

Code Sample

mkisofs -R -hide-rr-moved -cache-inodes -pad <directory>/ | create_compressed_fs -m -B65536 -L9 -v - <foo>.unc

(BTW, I use 'miso' as a bash alias for 'mkisofs -R -hide-rr-moved -cache-inodes -pad'.)

Posted by mikshaw on Dec. 09 2007,14:37

Quote

An unstripped xyz is how much bigger than a stripped xyz after it's compressed?

There's really no way to give you even an approximate reduction amount for "xyz", since it depends mostly on the amount of commenting the author did. All that can be done is give examples of what the reduction is for a specific package, as stupid_idiot did with gtk

Posted by john.martzouco on Dec. 09 2007,14:52

My experience with zip compressions (on Windows) has been that they are incredibly good with ascii text files. I'll pull together some numbers when I get the chance, but we're talking high magnitudes. Of course, compression on binaries is almost innefctive because the binaries don't have many repeated patterns in them and so they cannot be replaced by shorter placeholders by the algorithm.

A huge ascii file with repeated patterns in it should compress very highly, and I don't think that adding comments should have a high impact on that. What is the best compression tool that can be used on Linux?

Posted by curaga on Dec. 09 2007,17:08

lzma.. Then there's bzip2, then gzip, then zip, then compress.

Posted by stupid_idiot on Dec. 11 2007,15:36

To everybody:
This method is presently unusable, because 'cpp' strips all lines that begin with directives -- for example: '#define' and '#undef'.

Posted by stupid_idiot on Dec. 12 2007,00:30

-- work in progress --

Posted by WDef on Dec. 18 2007,15:13

Hi s_i,

I could be missing something but I don't quite follow why you want to do this sort of thing to uci/unc files at all. They don't use ramdisk. In particular, running scripts to prune header files seems risky to me. You only have to inadvertently bork one character to break the header for some build.

Personally I think it's better to leave all files in these two extension types alone. I don't prune them at all, and like to be able to find the readmes etc in these extensions and often refer to them. It only means a bigger download. Leaving the files in place also can provide dependency headers and libs ready to use for compiling an upgrade, and might provide useful evidence about the source of problems with an extension.

And I'm not sure I trust stripped binaries unless the build does it for you anyway, but maybe that's not entirely rational ...? I suppose a stripped binary may have a smaller footprint once loaded into memory.

Pruning is a good idea for .dsl extensions, so you can apply all of these techniques to that extension type.

I can see there is a type of aesthetic pleasure in getting a package size down for its own sake though?

Posted by chaostic on Jan. 08 2008,06:55

A good way to skip #! statements in sed is to not skip them. In fact, change them twice. Add:

Code Sample

-e /'#!'/s//"STRING THAT WON'T MATCH OTHER SED STATEMENTS"/

before any other sed statements/commands, then add"

Code Sample

-e /"STRING THAT WON'T MATCH OTHER SED STATEMENTS"/s//'#!'/

The sed "b" or Branch command would work too, but that can only be used in sed scripts/commandlists (I think).

Posted by stupid_idiot on Jan. 08 2008,12:24

Hi chaostic:
Thanks alot for the very helpful information.

Posted by stupid_idiot on Jan. 08 2008,13:20

Quote (WDef @ Dec. 18 2007,18:13)

Hi WDef:
My apologies -- I totally missed your post and didn't read it until just now.

Firstly, thanks for the very well-thought-out reply.
I agree with everything you said, especially what you said concerning README files. I mean, most of the time, no one reads them, except in those situations where you really can't figure out how to use the software, which is when they can really save you a lot of time. Also, I just thought of an important factor: People who are on dial-up are not so keen to hunt for docs online.
So, I think I will try to put README files in extensions.

Also: Yes, I agree that there is an aesthetic pleasure solely in reducing the package size. But the real Big Idea is to have the whole distro working altogether, that is to say, all the little pieces working together.

Posted by WDef on Jan. 08 2008,19:35

Go for it stupid_idiot! (I still feel a little strange typing your nic). As it happens I've just stripped gnupg2 in a version update to gnupg2.uci, and halved its size. But I'm still leaving the dependencies and headers etc in there intact; these made it so very easy to compile and build the update (5 mins work max). And also because gnupg2 is still somewhat of a mystery ...