Theory D versus Theory P

July 6th, 2007

When I read Which theory fits the evidence?, a small tear wet my keyboard. It´s all about deterministic and probabilistic schools of software  development. It´s worth a  read.

Squeak Smalltalk Recipes

July 5th, 2007

Ok, I found my Squeak Cookbook .

Did you ever had the sensation that you always are one step back? Ok, jokes apart, I will contribute to this project, after all, is a Wiki. Spread the word!
Back to the blueprints, searching the next crazy project…

Squeak Smalltalk tutorial

July 3rd, 2007

It´s difficult to find good squeak smalltalk tutorials. There´s a lot of good tutorials about the Smalltalk language, but the tutorials about Squeak specifically are majority outdated. But this one is wonderfull: A Development Example for Squeak 3.9. Thanks Stephan Wessels for this great work.
But I miss a tutorial like Java Cookbook yet. Something showing examples of how to do some common tasks like opening files, multithreading, sockets, etc. I would like to start this type of work, but right now, time is a exparse resource. Someday, someday…

Java Cookbook, Second Edition

The most impostants acronyms

June 18th, 2007

I like to keep some usefull acronyms pasted in my cubicle´s walls. They help me to remind what is good software and how to do good software (or to impress the uninitiated…;-)). Some of them:

What´s your list?

Smalltalk passion

May 13th, 2007

I’m very rational programmer, with a academic bias. I know C, C++, Java, Perl, PHP, VBScript, Object Pascal and Prolog languages. But my language of choice is beyond the rational. Smalltalk.

Smalltalk is much more than a language, is a concept, a vision. Much of what we use today was inspired in the Smalltalk ideas and environment. Steve Jobs only saw the mouse and the window system, if he saw the Smalltalk language too, our lifes would be better now.

“But if Smalltalk is so good, why isn’t a mainstream language?” you can ask me. I guess it was a matter of investments, a big company behind, the resistance against the paradigm shift… Java is the main object oriented language today, and cleverly the language designer choose a syntax similar with C++. Smalltalk always was and is today a big lab beyond the common place.

If you don’t know Smalltalk, I invite you to take some time to know a free implementation, Squeak. I will try to present some features here, soon.

Approximate strings joins in a database - Part 2

May 13th, 2007

Continuing with my search to algorithms to do approximate string match in databases, I found some interesting relationship between the edit-distance algorithm and the n-gram concept.

A very useful concept in statistics, natural language processing and genetic sequence analysis is n-grams. An n-gram is a sub-sequence of n items from a given sequence, where the sequence can be anything you like.

A word is a sequence of symbols from a alphabet. Given a word s with characters in a alphabet A, its positional n-grams are obtained by “sliding” a window of length n over the characters of s. Since some n-grams at the beginning and the end of the string can have fewer than n characters from s, we introduce new characters “#” and “$” not in A, and extend s by prefixing it with n - 1 occurrences of “#” and suffixing it with n - 1 occurrences of “$”. Then, all the n-grams of s have exactly n characters, some of these may not be from the alphabet A. Example:

Word s: foobar
Alphabet A: [a-zA-Z]+ (regular expression)

if searching for all 2-grams, prefix and suffix the word with 2 - 1 = 1 characters not in A:
s extended: $foobar#

2-gram set of $foobar#: {$f, fo, oo, ob, ba, ar, r#}

A positional n-gram of a word s is a pair (i, s[i, i + n - 1]), where s[i, i + n - 1] is the n-gram of s starting at position i on the extended string. The set Gs is the set with all |s| + n - 1 pairs of positional n-grams of the word s. Example:

G2 of foobar: {(1, $f), (2, fo), (3, oo), (4, ob), (5, ba), (6, ar), (7, r#)}

The intuition behind the use of n-grams as a foundation for approximate string match is that when two strings s1 and s2 are within a small edit distance of each other, they share a large number of n-grams in common. Example:

The positional n-grams of length n=2 for string “world” are {(1, $w), (2, wo), (3, or), (4, rl), (5, ld), (6, d#)}. Similarly, the positional n-grams of length n = 2 for the string “word”, which is at an edit distance of one from “world”, are {(1, $w), (2, wo), (3, or), (4, rd), (5, d#)}. If we ignore the positional information, the two n-gram sets have 4 n-grams in common. Interestingly, only the first 3 positional n-grams of the first string are also positional n-grams of the second string. However, an additional two positional n-gram in the two strings differ in their position by just one position. This show us that the use of positional n-grams will involve comparing positions of “matching” n-grams within a certain “band”.

In the next step, I will show some n-gram properties and its use for approximate string match in database.

See the part 1

Software Architecture Pearls

February 2nd, 2007

I found some software architecture pearls while reading the Art of Unix Programming by Eric Steven Raymond.

Although it was writed to the Unix community, these simple “rules” are so universal that I put here to always remember and apply in my projects. Here they are:

  1. Rule of Modularity: Write simple parts connected by clean interfaces.
  2. Rule of Clarity: Clarity is better than cleverness.
  3. Rule of Composition: Design programs to be connected to other programs.
  4. Rule of Separation: Separate policy from mechanism; separate interfaces from engines.
  5. Rule of Simplicity: Design for simplicity; add complexity only where you must.
  6. Rule of Parsimony: Write a big program only when it is clear by demonstration that nothing else will do.
  7. Rule of Transparency: Design for visibility to make inspection and debugging easier.
  8. Rule of Robustness: Robustness is the child of transparency and simplicity.
  9. Rule of Representation: Fold knowledge into data so program logic can be stupid and robust.
  10. Rule of Least Surprise: In interface design, always do the least surprising thing.
  11. Rule of Silence: When a program has nothing surprising to say, it should say nothing.
  12. Rule of Repair: When you must fail, fail noisily and as soon as possible.
  13. Rule of Economy: Programmer time is expensive; conserve it in preference to machine time.
  14. Rule of Generation: Avoid hand-hacking; write programs to write programs when you can.
  15. Rule of Optimization: Prototype before polishing. Get it working before you optimize it.
  16. Rule of Diversity: Distrust all claims for “one true way”.
  17. Rule of Extensibility: Design for the future, because it will be here sooner than you think.

My “Wow!” moment learning Ruby

January 12th, 2007

I´m a absolutely beginner in Ruby, but recently, after I changed to my new web host, I was tempted to make a try.

My background is mainly in Java language, with some SmallTalk just because since I left my graduation, I never saw a so beatifull syntax. And until now, I write sometimes somethings in SmallTalk, when I need to think about the problem in hand without worry about the environment or language little tricks and details. After all, I think compiled and typed languages are doomed to become history.

My new web host gave me a lot of space and bandwidth, but only 2 practical choices of programming languages for dynamic web sites: PHP or Ruby. For personal reasons I hate PHP (please, I don´t want to start a flame war about what is the best language, this is just my personal opinion). My first job was about programming in PHP, and I don´t have good memories…

I´m studying right now, and for the people which come here looking for Java tips or tricks, I want to say I had a ‘Wow!’ moment right now. Look for this little piece of code:

1.class Product < ActiveRecord::Base
2.
3.	validates_presence_of :title, :description, :image_url
4.	validates_numericality_of :price, :only_integer => true
5.
6.	protected
7.	def validate
8.		errors.add(:price, “should be positive”) if price.nil? || price <= 0
9.	end
10.end

Don´t worry about what this code do. But the syntax is very cool, isn´t it? For me, some constructions doesn´t seem so strange, because they look like Smalltalk, but for a JavaMan… You can see the ‘if’ block? Or the method call in a null reference?

I see you in my next “Wow!” moment…

A SCJP question about Java enumerations

January 11th, 2007

Given the following:

1.public enum Wallpaper {
2.  BROWN, BLUE, YELLOW;
3.}

Which of the following are legal?

  1. enum PatternedWallpaper extends Wallpaper {
    STRIPES, DOTS, PLAIN;
    }
  2. Wallpaper wp = Wallpaper.BLUE;
  3. Wallpaper wp = new Wallpaper(Wallpaper.BLUE);
  4. void aMethod(Wallpaper wp) {
    System.out.println(wp);
    }
  5. int hcode = Wallpaper.BLUE.hashCode();

Answer: only items 2, 4 and 6 are corrects. We can´t extend or instantiate an enumeration. The item 2 is the correct way to get a enumeration reference, the item 4 shows a method receiving a enumeration reference, and finally item 6 shows a call for the hashCode() method that all enums inherit from Object.

Approximate strings joins in a database - Part 1

January 10th, 2007

Strings are ubiquitous and ambiguous. When a communication channel is established between two people, inevitable noise and misunderstanding can introduce many errors when transferring textual data.

In any enterprise system, there are many places where this type of error can occur. Client names, addresses, company names, etc. This kind of error can become impracticable the exact match against a query for a registry.

Think about the poor life of the mega-action star Arnold Schwarzenegger, how many times he needs to repeat his name to the operator?

To solve this type of problem, a number of phonetic algorithms was developed. A phonetic algorithm use rules to transform substrings into phonemes, trying to unify two strings written as spoken. Some algorithms:

But all these algorithms suffer from the same problems. The rules are written specifically for one target language, and the most common target language is English. This isn’t a matter if your native spoken language is English, or if you don’t need to do internationalized applications…

There are another solutions? Sure. Searching for alternatives, I found the approximate string search algorithms, mainly the Levenshtein distance (or Edit-Distance) algorithm. More robust and reliable, can be used without changes. With a little, but very important, advantage: we can use to sort the result candidates by your distance to the searched string. This little advantage can be the difference between a poor result, with pages and pages of useless results, or a well ordered list, with the most relevant results first.

I expect to show how to use this great algorithm to construct a full text search engine on databases, soon.

References: