Recursive-strip C commentsForum: DSL Tips and Tricks Topic: Recursive-strip C comments started by: stupid_idiot Posted by stupid_idiot on Dec. 07 2007,12:41
Big Fat Warning:Hi all, None of the methods described here actually work. I haven't found any workable solution yet. Everything I've posted are just preliminary attempts. Please do not use ANY of these examples in their unmodified form on any source code!!! Problems: -- No reliable way to preserve directives ("#define", "#ifdef", "#endif", etc). -- "Missing semicolon" compilation errors after deleting comment blocks (the comment blocks were acting the role of semicolons by signifying the end of a function). -- Most probably, there are many, many more problems which I haven't ran into yet. I am a newbie who is very new to sed and regular expressions, and: I am not even sure sed is the proper tool for this purpose (stripping source code). I am also a total Perl newbie (only just started reading the 'Llama Book' AKA < 'Learning Perl, Third Edition' > from O'Reilly). (I wholeheartedly recommend the 'Llama Book' to anyone who is new to Perl -- It is very well written!) Speaking as a total Perl newbie: IMHO, I think someone who is versed in Perl may be able to produce a more ideal solution in Perl.
Any improvements are most welcome! Thanks. Explanation:
Possibly useful also: 1. 'whitespace_stripper.sh'
2. 'stripsh' (strip shell scripts)
NOTE: This script has a problem -- It deletes the first line of any script; for example:
Question for everyone: How do we make 'sed' ignore lines that begin with "#!"? Thank you very much! Explanation of above script: 's/\t*#.*//g' -- (Substitution) Pattern: Begins with any number of TABs ("\t"), followed by a "#", followed by any number of any character (".*"). Replacement: Null (i.e. deletes the relevant part of matching lines -- does NOT mean deleting the entire line). '/^\t*#/d' -- (Deletion) Delete lines that: begin with any number of TABs, followed by a "#" (i.e. bash and perl comments). '/^ *#/d' -- (Deletion) Delete lines that: begin with any number of SPACEs, followed by a "#" (i.e. bash and perl comments). '/^ *$/d' -- (Deletion) Delete lines that: begin with any number of SPACEs, followed by an end-of-line ("$") -- i.e. delete empty lines. Posted by curaga on Dec. 07 2007,18:49
well, find uses the current dir as a default, so the first command could only befind -name "*.h" | Nice anyway How about adding a help function, and then calling this as remove-comments.sh? There's also this issue: if the file has bash-style comments, aka lines starting with #, cpp will bail out and the .tmp file will be left there.. Here's my go:
Posted by mikshaw on Dec. 08 2007,01:54
I don't really see a whole lot of use for removing comments from headers, unless you see headers as being nothing more than dependencies for compiling. If you are a software developer, those comments can be very useful.The final product can easily be reduced with the strip command. Posted by jpeters on Dec. 08 2007,04:00
One difference, stupid's version doesn't remove lines beginning with '#!' (maintaining the space after # request).Edit: I guess it needs line 14 joined to line 15 "\ && ", then it works. (as long as everyone places spaces in their comments after the '#') Interesting code (although I'd agree with mikshaw in favor of comments and spaces) Posted by stupid_idiot on Dec. 08 2007,06:56
Edited original post:Changed 'cpp' command to
Posted by stupid_idiot on Dec. 08 2007,07:30
Actually, I was thinking about extensions like 'gcc1-with-libs.unc' or any '-dev.{dsl|unc}' extension. Basically, any extension that contains a large amount of header files can be drastically reduced in size. For example, 'libwxgtk1-dev.uci' currently in 'mydsl/testing' is 804K. By stripping comments from all headers and scripts, the size is 368K. Posted by curaga on Dec. 08 2007,08:38
Firefox and other programs which use XML can also benefit greatly
Posted by mikshaw on Dec. 08 2007,13:51
HOWEVER, You will need to make sure that your script doesn't strip out copyright notices and license texts if you plan to distribute a stripped header package. That in itself may be your biggest challenge, considering a notice typically looks like any other comment and are often found in individual headers rather than just a README file.. Posted by john.martzouco on Dec. 08 2007,16:43
Is this only to save space for the compressed tarballs?... or to pack more in the RamDisk when running the LiveCD version?Is there such a thing as a compressed folder (a la NTFS) that could be used for savings in RamDisk? If it's for the tarball sizes (uci sizes), how much difference does it make after compression? An unstripped xyz is how much bigger than a stripped xyz after it's compressed? Posted by stupid_idiot on Dec. 08 2007,17:07
To quote a section of the < GNU General Public License (GPL) >:
To answer my own question: "keep intact all notices" seems to mean that all copyright headings must be kept as they are. I have a wishful thought: Can we reduce this requirement
Yes, I know -- that would contradict "keep intact all", so I cannot do that. Just a wishful thought, though. Posted by curaga on Dec. 08 2007,17:21
Of course the difference isn't as big compressed.. Posted by stupid_idiot on Dec. 08 2007,17:41
(1) gtk2-dev.unc (unstripped) -- Size: 516K gtk2-dev.unc (stripped) -- Size: 240K Contents and configuration: atk-1.9.1: < Config > [pastebin.com] cairo-1.4.12: < Config > [pastebin.com] glib-2.12.13: < Config > [pastebin.com] gtk+-2.10.14: < Config > [pastebin.com] pango-1.12.4: < Config > [pastebin.com] Source code: < atk-1.9.1.tar.gz > [ftp.gnome.org] < cairo-1.4.12.tar.gz > [cairographics.org] < glib-2.12.13.tar.gz > [ftp.gnome.org] < gtk+-2.10.14.tar.gz > [ftp.gnome.org] < pango-1.12.4.tar.gz > [ftp.gnome.org] What was stripped: - C headers ('.h'). [comments] - m4 macros used by aclocal ('.m4') [comments] - libtool library files ('.la') [comments] - bash and perl scripts [comments] - pkgconfig files ('.pc') [empty lines] (2) libwxgtk2-dev.unc (unstripped) -- Size: 780K libwxgtk2-dev.unc (stripped) -- Size: 368K Contents and configuration: wxGTK-2.6.4: < Config > [pastebin.com] Source code: < wxGTK-2.6.4.tar.gz > [downloads.sourceforge.net] What was stripped: - C++ headers ('.h') [comments] - m4 macros used by aclocal ('.m4') [comments] - bash scripts [comments] - XML files ('/usr/local/share/bakefile/presets/*.bkl') [XML-style comments] Posted by john.martzouco on Dec. 08 2007,17:53
These are all uncompressed numbers, yes? Posted by stupid_idiot on Dec. 08 2007,18:01
Sorry John, I neglected to mention those are COMPRESSED numbers.Uncompressed sizes: (1) (directory) gtk2-dev/ (unstripped) -- Size: 3612K / 3.6M (directory) gtk2-dev/ (stripped) -- Size: 2548K / 2.5M (2) (directory) libwxgtk2-dev/ (unstripped) -- Size: 4428K / 4.4M (directory) libwxgtk2-dev/ (stripped) -- Size: 3060K / 3.0M This command was used to make the '.unc' extensions:
Posted by mikshaw on Dec. 09 2007,14:37
Posted by john.martzouco on Dec. 09 2007,14:52
My experience with zip compressions (on Windows) has been that they are incredibly good with ascii text files. I'll pull together some numbers when I get the chance, but we're talking high magnitudes. Of course, compression on binaries is almost innefctive because the binaries don't have many repeated patterns in them and so they cannot be replaced by shorter placeholders by the algorithm.A huge ascii file with repeated patterns in it should compress very highly, and I don't think that adding comments should have a high impact on that. What is the best compression tool that can be used on Linux? Posted by curaga on Dec. 09 2007,17:08
lzma.. Then there's bzip2, then gzip, then zip, then compress.
Posted by stupid_idiot on Dec. 11 2007,15:36
To everybody:This method is presently unusable, because 'cpp' strips all lines that begin with directives -- for example: '#define' and '#undef'. Posted by stupid_idiot on Dec. 12 2007,00:30
-- work in progress --
Posted by WDef on Dec. 18 2007,15:13
Hi s_i,I could be missing something but I don't quite follow why you want to do this sort of thing to uci/unc files at all. They don't use ramdisk. In particular, running scripts to prune header files seems risky to me. You only have to inadvertently bork one character to break the header for some build. Personally I think it's better to leave all files in these two extension types alone. I don't prune them at all, and like to be able to find the readmes etc in these extensions and often refer to them. It only means a bigger download. Leaving the files in place also can provide dependency headers and libs ready to use for compiling an upgrade, and might provide useful evidence about the source of problems with an extension. And I'm not sure I trust stripped binaries unless the build does it for you anyway, but maybe that's not entirely rational ...? I suppose a stripped binary may have a smaller footprint once loaded into memory. Pruning is a good idea for .dsl extensions, so you can apply all of these techniques to that extension type. I can see there is a type of aesthetic pleasure in getting a package size down for its own sake though? Posted by chaostic on Jan. 08 2008,06:55
A good way to skip #! statements in sed is to not skip them. In fact, change them twice. Add:
before any other sed statements/commands, then add"
The sed "b" or Branch command would work too, but that can only be used in sed scripts/commandlists (I think). Posted by stupid_idiot on Jan. 08 2008,12:24
Hi chaostic:Thanks alot for the very helpful information. Posted by stupid_idiot on Jan. 08 2008,13:20
Hi WDef: My apologies -- I totally missed your post and didn't read it until just now. Firstly, thanks for the very well-thought-out reply. I agree with everything you said, especially what you said concerning README files. I mean, most of the time, no one reads them, except in those situations where you really can't figure out how to use the software, which is when they can really save you a lot of time. Also, I just thought of an important factor: People who are on dial-up are not so keen to hunt for docs online. So, I think I will try to put README files in extensions. Also: Yes, I agree that there is an aesthetic pleasure solely in reducing the package size. But the real Big Idea is to have the whole distro working altogether, that is to say, all the little pieces working together. Posted by WDef on Jan. 08 2008,19:35
Go for it stupid_idiot! (I still feel a little strange typing your nic). As it happens I've just stripped gnupg2 in a version update to gnupg2.uci, and halved its size. But I'm still leaving the dependencies and headers etc in there intact; these made it so very easy to compile and build the update (5 mins work max). And also because gnupg2 is still somewhat of a mystery ...
|