Warren Falk

Tip: make checkinstall

July 19th, 2009 by Warren Falk

I’ve been creating several python scripts lately for various things, and I’m installing them all right in /usr. (This is ubuntu, btw).

To keep my system clean, I’ve been creating debian packages with checkinstall and installing those. To make that easier, I’ve come up with a bit of a system. I create a make file with an install target, and then add a “checkinstall” target which runs the checkinstall command for me. This works pretty well

In addition to this, I’ve added the ability to get the subversion working copy revision and use it for the pkg version.

I put something like the following at the beginning of Makefile:

REVISION=`svn info | python -c "import re, sys;\
print [m.group(1) for m in [re.match('^Revision: ([0-9]+)', line) for line in sys.__stdin__] if m][0]"`

and then use $(REVISION) inside Makefile wherever I need the subversion revision number.

So I have something like the following, later on:

checkinstall:
    checkinstall --pkgversion r$(REVISION)\
     --pkgname trivial-backup\
     --maintainer warren@warrenfalk.com

buy adobe acrobat 9 pro extended
order microsoft office access 2007
OEM Microsoft Office Outlook 2007
Download Adobe Photoshop CS4 Extended
Buy Windows XP Professional sp3
Download Microsoft Office 2010 Professional
Download Windows 7 Professional
Top Soft

Photo sorter

July 15th, 2009 by Warren Falk

I finished an EXIF-based image sorting script (in python). It’s rather trivial. It takes two folders as parameters. It reads all images from the first folder and moves them to the second folder, sorted into folders by year and then month, and then renamed all according to the date the picture was taken (from the EXIF data).

It handles things like discarding exact duplicate images (by comparing checksums) and handling images with the same time stamped on them, but that are different pictures.

It’s really nothing special, but I’m including it here for anyone looking for an example of anything like this.

Lynn downloads our pictures from the camera to our image server, then runs a script (via a shortcut on her desktop that runs an ssh command) which does the sorting.

Read the rest of this entry »

PicsFS: Fuse Filesystem

July 9th, 2009 by Warren Falk

I keep our family pictures on a linux server and make them available via samba share.  I have a script that automatically sorts the pictures by date taken, via exif.  This system works quite well.  Lynn manages the family photos, and blogging, mostly, and to protect her from herself, once the pictures are sorted, permissions prevent her from deleting them by accident.  This has proved to be a good move on a couple of occasions.  Unfortunately, it also means that no Thumbs.db file is created (she uses Windows, a habit she isn’t willing to break).  This means that she has to wait for thumbnail creation over the network every visit to a folder.

There are better ways of solving this trivial problem, probably, and versions of Windows after XP don’t use the retarded thumbs.db method, so this problem will solve itself eventually when she’s upgraded (or when I break her of her windows habit).

I chose to create a user mode filesystem, because this seemed like a good way to cut my teeth on the process.  I used python-fuse, and created my “picfs” file system which simply lets you mirror another filesystem but with the added ability to specify a “thumbnails” folder.  Any file named “Thumbs.db” is stored in that folder, instead of the shared folder.  So I bind a copy of our picture share in her home directory, and specify a separate folder for her thumbnails.  Permissions prevent her from modifying the pictures or the folders they’re in, but picfs makes it look like the folder is writable.  (This is necessary because Windows won’t even attempt it if it thinks it’s read-only).  The only write that succeeds is the Thumbs.db file, and it is stored in a separate Lynn-only folder.

The source can be downloaded here: http://www.warrenfalk.com/anonrepos/picfs/trunk.  Seeing how the code works may be more useful than the code.  I’m posting it here for that reason.

(by the way, for those who want Thumbs.db on a samba share, I found the following smb.conf settings are ideal for this sort of thing:

 nt acl support = no
 writable = yes
 hide files = /.*/Desktop.ini/desktop.ini/RECYCLER/thumbs.db/Thumbs.db/
 hide unreadable = yes

)

Idea: Flow Control Functions

May 13th, 2009 by Warren Falk

Background: Flow Control Syntax

Your average conventional programming language can be logically divided into two parts, (1) the metadata part and (2) the instruction part.  These are usually mixed together in one unifying syntax, but it is easy to logically separate the two in your mind.  The instruction part is the body of the functions.  The metadata is everything else (or, well, usually.  C#, for example, lets you put expressions into initializers which is technically instructional syntax, albeit crippled syntactically – for the purposes of this post, this is irrelevant so we’ll ignore it).  We don’t care about the metadata for this post, just the instructional code.

You can then further logically divide the instructional syntax into the (1) flow control and (2) statements.  The statements are the expressions (including math like “a + b * 3” and function calls like “foo()“) and assignments (such as “variable = 0“).  This leaves the flow control statements, for, foreach, do, while, if/else, try/catch, etc…  Not every language names them identically, and not every language has every construct, but every language has flow control syntax.  Flow control is what this post is about.

Traditionally, these statements are built into the language at the language level and are available even if the standard libraries are unavailable.  There are some exceptions, such as in more functional languages like Lisp, and even Ruby.

This is precisely the basis for this post.  I want to explore what it would take to build all of these into the standard library, while simultaneously making it possible for developers to create their own.  What are the pros, and what are the cons?

“Functionizing” Control Flow

Ruby allows something like this.  What you might do with a for loop in one of the traditional languages, in Ruby, you do with a function that takes another function as a callback.  So instead of:

for (i = 0; i < 4; i++) { do(stuff); }

in ruby, you use the “times” method of integers:

4.times { do stuff }

There is a price to be paid in performance for this ability, in Ruby.  That block of code is an anonymous method in Ruby and requires a managed object which has to be allocated on the heap, then later collected for every usage.  The for loop, above is all inline.  The block of code that makes up the body of the loop is not used anywhere else, so there’s no point in setting up an object on the heap etc.  This makes the for loop above much faster than the Ruby alternative.

A smart enough compiler, on the other hand, could give you the convenience of the function-based control flow with the native speed of the implementations in the conventional procedural languages.

Additionally, I think that with a slick enough language, it should be possible to implement most, if not all, of the control flow models listed before.  (i.e. not just for, and while, but also if/else, and try/catch).  This way, things like while would be implemented as special “control flow” methods included in the standard library.  This detail would be unnoticeable to most developers at first, but it opens up a whole new world in terms of language extensibility.

The while implementation is straightforward (what the syntax should look like, I am not yet sure), but basically it defines a way to test a condition which would act like an input parameter, and then based on that condition, run a block of code, after which it would test the condition again and so on.  It’s simple.  A foreach would work similarly in that it would take a block of code, but instead of a condition would take an enumerable object.  It may also, for example, initialize a variable before every loop to represent the current object.

Those are straightforward ideas, but the best part about this is that it should allow you to invent your own control flows, such as a switch-like statement which worked by hashtable, or something like a custom synchronization lock which allows you to do something within a block of code after which the lock is released (like C#’s “lock” keyword, but customizable).  You could also create a with-like statement that I’ve described in my page about reinventing the development environment.

There are probably other uses for this I haven’t yet thought of.  These are the pros.  There are potential cons, however.  The one I worry about the most is that, as a general rule, if you design a tool for good, bad developers will find a way to use it for evil in ways it was never designed.  (Sometimes they seem to do these things for no other reason than to accomplish tasks in the most obscure way possible).  Extra precaution should be taken to curb that where possible, and also make it intuitive for developers to still be able to read, and navigate someone else’s code.

ANTLR

May 13th, 2009 by Warren Falk

I’m trying to write a parser to recognize Rich Text Format (RTF) and parse it into a meaningful syntax tree.  I’ve decided to use ANTLR which is a tool which is used to write a lexer/parser/tree parser combination.  It is supposed to make this task easier.  So far it has not been.  Either this is because I don’t know how to use the tool, or the tool is just not very useful.  I am still presuming the former, at this point.

ANTLR makes a distinction between “lexing” and “parsing”.  One uses a lexer to break the character stream of a file into “tokens” which are groups of one (actually zero) or more characters.  A parser, then, should be able to take the new stream of tokens and create a tree out of them.  This tree is known as the “abstract syntax tree” or AST.  Now the hard part here, if you’re new to this, as I am, is trying to figure out just what the difference is between a token and a node in the AST, because you have to create “lexer” rules to tease tokens out of the character stream, and then later use “parser” rules to further group/assemble those tokens into nodes in an AST.

The distinction is not obvious, and probably partially subjective, which makes figuring out how to accomplish this quite difficult using the standard try/fail/adjust/retry loop.

For example, here’s a very simple rtf document:

{\rtf\asc\lang1024 {\par This is a test}{\par of the \{emergency\} broadcast system}}

…which would be rendered, roughly, as follows:

This is a test
of the {emergency} broadcast system

Now the “\rtf” and “\lang1024″ are called “controls”.  The “This is a test” and “of the {emergency} broadcast system” are the regular text, and the whol document is in a hierarchy of “groups” delimited by curly braces.  This is a bit of an over simplification, but not by much.  The question is now, when the lexing is complete, where should the lines be drawn between tokens?

Eventually I began to think the tokens should be separated into something like the following:

{, \, rtf, \, asc, \, lang, 1024 , {, \, par, This is a test, }, {, \, par, of the \{emergency\} broadcast system, }

But then I finally got it to work in a reasonably useful manner by breaking it up like this:

{, \rtf, \asc, \lang1024 , {, \par, This is a test, }, {, \par, of the , \{, emergency, \}, broadcast system, }

But this was not obvious, and I’m not even sure whether this is the best way to break this up into tokens, but it ended up working the best.  Some interesting things arose here.  Such as that I expected to break up the “controls” into their constituent parts (the ‘\’ the name and the optional number) but it turns out that the whole control construct works out, nicely, as a single token.  Unfortunately, I’m not quite sure why.  Also, I tried to get the text to be one big token, but had to break it up into strings of text without control sequences and strings of text with control sequences.  I’m not sure why, in this case, either.

Language Support for Concurrency

May 12th, 2009 by Warren Falk

I posted yesterday about an idea for language support for automated thread pool usage.  (I also have published an entire page about reinventing the whole development environment which remains a draft, but is published on the site nevertheless and viewable here.  It doesn’t yet contain my half-hatched thread pool idea, yet).

On the same day I published my post about the thread pool idea, I read on Slashdot that Microsoft have an “incubator” project called, “Axum“, which is their concurrency-friendly language.  I took a quick glance – much too quick to form an opinion.  Their idea is not the same as my thread pool idea, but there are some conceptual overlaps.  Interestingly, the new language does automatically take advantage of Microsoft’s APM (Asynchronous Programming Model) functions, also known as “Begin/End”.  Perhaps worth checking out.

Idea: Automatic Thread Pooling

May 11th, 2009 by Warren Falk

I had an idea for integrating automatic thread pool usage into a programming language.

First, if you don’t understand thread pools and how they are best used, and how they can be superior to manual thread management, then you should read up on how and why thread pools work – but basically keep in mind that thread pools, if used optimally, can get you closer to linear scalability than any other model.

Also, it’s worth reminding yourself that your application spends almost no time on the processor.  Unless you’re doing 3D rendering, or video encoding, the processor probably sees brief flashes of your application separated by what are relatively large stretches (a virtual eternity from a processor’s perspective) where every thread in your application is in a wait state.  Most importantly, this is usually true even during the heaviest use of your application, because the processor is simply not the bottleneck these days (except in the rare cases already noted).

If you aren’t using a thread pool, or you are using a thread pool but doing so suboptimally, your application may be in a wait state when work actually needs to be done.  This is what destroys scalability.  In a thread pooled environment, this happens when all the threads in your pool are stuck in a wait state, waiting for one of the things threads wait for (disk access, network access, synchronization lock, etc).  More advanced pool management can detect this situation and increase the number of available threads so that work can get done while those other threads sleep.  This helps with scalability, but comes with its own expense thereby making the scalability nonlinear (i.e. the processing power you were able to tap didn’t go as far as the processing power you already had available).

This situation is avoidable.  You can avoid it by breaking your application up into segments of code that never ever go into a kernel wait state.  You do that by using the async requests available for those requests.  Instead of calling Read() on a stream, you call BeginRead() at the end of one segment, and then EndRead() at the beginning of another.  When your segment calls BeginRead() and quits, the thread in the thread pool can then move on to the next work item without delay, and when the read completes, the next segment can be queued to handle it.

The problem is that coding like this is difficult.  It’s difficult because breaking up your code into linear functions is natural, but dividing a function into multiple functions along every blocking call is just not intuitive, and it becomes quite difficult to do things like loop.  How do you start a for loop in one function that terminates in another?  You can achieve this result by passing a state variable between calls, but this becomes extremely complicated.

With version 2.0 of C#, we’ve seen the compiler rig up some interesting code for us to support automatic enumerators.  We can create a function that returns IEnumerable<string> and then yield return strings from our procedure as many times as we want.  This is much easier than the alternative which is to create a new class that implements IEnumerable<string>, and another class that implements IEnumerator<string> which has our member variables, and responding to Next() requests instead of just calling yield return.  That’s trouble enough, but try doing a for loop in the middle of that and you’ll see just what the compiler is doing for you automatically when you use yield return.

If the compiler can break up the developer’s enumerating function into a couple of class definitions and multiple functions, then the compiler should be able to do the same thing for our function’s blocking calls.

Here is an example function which takes an array of HTTP URLs, fetches them, and writes them all out to a file, “output.txt”.

void MyFunction(string[] requests)
{
    using (Stream output = File.OpenWrite("output.txt"))
    {
        byte[] buffer = new byte[1024];
        foreach (string request in requests)
        {
            HttpWebRequest hreq = (HttpWebRequest)WebRequest.Create(request);
            HttpWebResponse hresp = (HttpWebResponse)hreq.GetResponse();
            using (Stream input = hresp.GetResponseStream())
            {
                int len;
                while (0 != (len = input.Read(buffer, 0, buffer.Length)))
                {
                    output.Write(buffer, 0, len);
                }
            }
        }
    }
}

I’ve highlighted in red the functions which are blockers.  Calls to these functions will put the thread to sleep for what could be a very long time in processor terms.  For this reason, these functions have asynchronous, Begin/End versions which can be used instead.  However, try to do that with the above code and you are left with the following abomination:

class MyFunctionState
{
    internal Stream output;
    internal Stream input;
    internal byte[] buffer;
    internal string[] requests;
    internal IEnumerator request_enumerator;
    internal HttpWebRequest hreq;
    internal HttpWebResponse hresp;
    internal string request;
    internal int len;
}

void MyFunction(string[] requests)
{
    MyFunctionState state = new MyFunctionState();
    state.requests = requests;
    state.output = File.OpenWrite("output.txt");
    state.buffer = new byte[1024];
    state.request_enumerator = requests.GetEnumerator(); // for loop
    MyFunction_Segment1(state);
}

void MyFunction_Segment1(MyFunctionState state)
{
    if (!state.request_enumerator.MoveNext()) // for loop
    {
        // after for loop
        state.output.Dispose();
        return;
    }
    state.request = (string)state.request_enumerator.Current;
    state.hreq = (HttpWebRequest)WebRequest.Create(state.request);
    state.hreq.BeginGetResponse(new System.AsyncCallback(MyFunction_Segment2), state);
}

void MyFunction_Segment2(IAsyncResult result)
{
    MyFunctionState state = (MyFunctionState)result.AsyncState;
    state.hresp = (HttpWebResponse)state.hreq.EndGetResponse(result);
    state.input = state.hresp.GetResponseStream();
    MyFunction_Segment3(state);
}

void MyFunction_Segment3(MyFunctionState state)
{
    state.input.BeginRead(state.buffer, 0, state.buffer.Length, new AsyncCallback(MyFunction_Segment4), state);
}

void MyFunction_Segment4(IAsyncResult result)
{
    MyFunctionState state = (MyFunctionState)result.AsyncState;
    state.len = state.input.EndRead(result);
    if (state.len == 0) // while condition
    {
        // after while loop
        state.input.Dispose();
        MyFunction_Segment3(state);
        return;
    }
    state.output.BeginWrite(state.buffer, 0, state.len, new AsyncCallback(MyFunction_Segment5), state);
}

void MyFunction_Segment5(IAsyncResult result)
{
    MyFunctionState state = (MyFunctionState)result.AsyncState;
    state.output.EndWrite(result);
    MyFunction_Segment3(state); // top of while
}

If this made you cringe, then you’re not alone.  There are several things to notice here:

  1. The result is extremely ugly and hard to follow, but does achieve scalability.
  2. You have an extra state class and several extra functions which have no purpose except to implement what would conceptually be part of the original function.
  3. The using, while, and foreach all had to be manually implemented, and the using was actually not implemented as a try/catch as it should be (it probably could, but that would have been even more confusing).

I propose that the language provide a keyword by which you could have all this done for you; much like yield return does so for you with automatic enumerators.  Just what the keyword should look like, and how this should be implemented, I’m not 100% sure yet, but I wanted to get this idea down before I forgot it.

In this example, by the way, the MyFunction() method returns immediately before it is complete.  This may not be ideally how this would work (i.e. you might even want to return a value… I’ve left off this detail for now.  Adding this should be fairly trivial).

Singletons by delegate

February 2nd, 2008 by Warren Falk

Singletons are a common programming model, wherein a program ever only uses a single instance of a certain class, and that instance is shared program-wide.

The typical implementation of this is an if-null/lock/if-null setup that looks something like this:

class OnlyOne
{
  private OnlyOne _instance;
  public OnlyOne Instance
  {
    get
    {
      if (_instance == null)
      {
        lock (typeof(OnlyOne))
        {
          if (_instance == null)
          {
            _instance = new OnlyOne();
          }
        }
      }
      return _instance;
    }
  }
}

The reason for the if-null/lock/if-null is one for a different discussion, but it is basically the fastest portable way to allow thread-safe access to the single instance. However, it requires that the first if-null be executed every time a request for the instance is made. If your instance never changes, this is needless overhead.

I developed a quick and dirty solution using delegates. The basic idea is that when you create the first instance and you know it can no longer be null, you change the “GetInstance” routine to point to a routine that does not check for null. I created a generic for this. Here’s the result:

public class Singleton<T> where T: new()
{
  public delegate T InstanceGetter();

  private static T _instance;
  public static InstanceGetter GetInstance = new InstanceGetter(_GetNewInstance);
  private static T _GetNewInstance()
  {
    if (_instance == null)
    {
      lock (typeof(T))
      {
        if (_instance == null)
        {
          _instance = new T();
          GetInstance = new InstanceGetter(_GetInstance);
        }
      }
    }
    return _instance;
  }
  private static T _GetInstance() { return _instance; }
}

And then reusing it is quite easy:

class OnlyOne : Singleton<OnlyOne>
{
}

Now only the first call to OnlyOne.GetInstance() will actually check for null in a single threaded environment. And in the worst-case multithreaded environment, only the first simultaneous calls will check, the rest will just return the instance immediately.

My tests showed about a 25% performance gain. This is actually of questionable usefulness because you should not be calling the GetInstance method in a tight loop, you should call it once and assign it to a local. But I thought I’d share the basic idea because using delegates in this fashion might be useful in other scenarios.

Monodevelop

December 11th, 2007 by Warren Falk

I’m still relatively new to Mono, which is a cross platform, open-source implementation of .Net. The implementation has proved surprisingly gap free for the most part.

But, what Mono suffers from is its lack of Visual Studio. Much of what makes .Net development so effortless is what is a really good implementation of an IDE (refactoring, a really good debugger, smart tags, and several other nice things). An IDE is beyond the scope of the Mono project, and another project, “Monodevelop” is filling in that gap.

Unfortunately, there are still gaping holes which makes it frustrating to anyone who has been spoiled by Visual Studio. I miss the refactoring, the smart tags which prompt you to effortlessly add a “Using” if necessary, but most of all, a debugger.

It is not hard to write complex code, it isn’t hard to write code quickly, but if you need to do both, you really need a debugger to get it right. And this is where Monodevelop (and also Mono) are way short. It’s relatively young still and free open source; and most importantly the developers are planning to implement all these things. I may see about getting in on that myself. But for now, Monodevelop is slightly refreshing but mostly frustrating.

Latest mono on Ubuntu

December 5th, 2007 by Warren Falk

The current version of mono on Ubuntu 7.10 is 1.2.4.2. There are fixes in 1.2.5.2 which I need. Getting 1.2.5.2 built is no problem, but getting it installed without causing all kinds of problems appears to be. This is something I miss about Gentoo.

I’ve tried building my own package, but the official packages for mono are split up into many smaller packages which made that impossible.

I think what I’ll have to do in order to upgrade is to get the source package using apt-get source, unpack it all, and see if I can just replace the 1.4.2.2 code with the 1.2.5.2 code.

I have already pulled the source package down successfully and unpacked it. It looks like the Ubuntu guys applied many patches to it. I don’t know if those patches will work with the new version of code, or if the new version of code contains those patches. It’s probably some combination of the two. This will take some time. I’ll post a follow up if I ever get this working.

.
Entries (RSS) and Comments (RSS).