Archive for the ‘Code’ Category

Fast Factoring for 64-bit Integers

Sunday, October 19th, 2008

Some of the Project Euler problems involve factoring numbers which are either large or small depending on your perspective. Problems are generally limited to 64-bit integers (about 18 digits) which are big numbers for most of us, but in the field of factorization those numbers are terribly small compared to the 100+ digit numbers security protocols deal with. Most advanced methods deal with optimizing the factoring of those huge numbers and don’t mind significant amount of overhead, but I want to know what’s fastest for 64-bit integers.

To find out, I ran some tests on some variations on three basic, low-overhead methods: Trial Division, Fermat’s method, and Pollard’s Rho method. All of these take a long time if the number being factored is actually prime, so it’s worthwhile to add in a fourth component which is a Miller-Rabin primality check. Here are my timing results for 400,000 random 64-bit integers. Actually, only the first test uses 400,000 numbers since I limited each test to 1 hour and extrapolated beyond that.

Seconds Method
811 Rho + Trial Division + MR
6359 Fermat + Trial Division + MR
6393 Trial Division + MR after each factor found
29397 Trial Division +MR at start
71195 Trial Division without MR

I was really surprised at how well the Rho method worked in practice. It’s a probabilistic method that’s basically like trial division except it chooses numbers at random instead of sequentially. However, the “random” generator uses a polynomial such that lots of the values can be tested at once using some fancy number theory.

Fermat’s Method works best when there are two divisors near √n, which apparently doesn’t happen very often. Here is my Rho code, which is adapted from some pseudocode in a gamedev forum thread.

   long rhoFactor(long n, int c, int max) {
        int check = 10;
        long x1 = 2;
        long x2 = 4 + c;
        long range = 1;
        long product = 1;
        int terms = 0;

        for (int i = 0; i < max; i++) {
            for (int j = 0; j < range; j++) {
                x2 = (squareMod(x2, n) + c) % n;
                long next = multiplyMod(product, Math.abs(x1 - x2), n);
                if (next == 0) {
                    return Math.max(gcd(product, n), gcd(Math.abs(x1 - x2), n));
                }
                else if (++terms == check || j == range - 1) {
                    long g = gcd(next, n);
                    if (g > 1)
                        return g;
                    product = 1;
                    terms = 0;
                }
                else
                    product = next;
            }

            x1 = x2;
            range *= 2;
            for (int j = 0; j < range; j++) {
                x2 = (squareMod(x2, n) + c) % n;
            }
        }
        return 1;
    }

For the parameters, I used small odd numbers for c, the polynomial constant term, and 16 – 20 for max which limits the generated values at around 2^max. If the factorization fails, I increase c by 2 and try again. For max = 16, it failed to find a factor about once for every 10,000 numbers and never failed twice in my tests. And those numbers had already had any small factors (less than about 50,000) removed with trial division.

Xcode Non-Distributed Builds

Friday, March 21st, 2008

One of my hobbies since buying an 8-core Mac Pro is to actually make use of all of the processing power. So far I haven’t had much success. And, of course, the web and disk are not any faster, so the extra cores don’t do much good in general.

The only app I use that’s really built for that kind of thing is Xcode, which will spawn off separate compiler threads in parallel. However, I wasn’t seeing more the two cores busy when compiling a large project, though Xcode claimed to be working on 8 files at a time. I was thinking that disk or memory constraints were the problem, but it turned out to be a configuration issue.

CPU Monitor of Mac ProAt some point in the past, I had innocently turned on the Distributed Builds option, which is meant for distributing tasks to other machines on a local network. However, a side effect is that the compiling is throttled, presumably with thread priorities, so the build is more of a background activity. Turning off Distributed Builds gets all 8 local cores working in parallel.

Now I just need to get my own programs to use those cores…

XML Schema Type Tables and Substitution Groups

Sunday, February 10th, 2008

The XML Schema 1.1 was already running behind when I left the Working Group in 2004, and it’s still a work in progress. Though I no longer write XML tools, I try to keep up with the group’s activities and provide hopefully useful comments to public working drafts. However, knowing the WG is so far behind schedule, I’m hesitant to make too many official comments since each comment must be addressed by the group, adding to the delay.

Many of my comments have been resolved recently in a relative flurry of activity. (The comment archive shows more activity last month than any previous three month period.) When a comment is resolved, the original poster can (silently) accept the resolution or appeal to the W3C director. I disagreed with the resolution of my comment on type tables and substitution groups, but just registered my dissent and closed it anyway rather than appeal — I trust the working group’s expertise over my casual interest.

Substitution groups have always been questionable in my mind. I’d prefer typed wildcards or, at least, an opt-in mechanism rather than opt-out for substitution groups to limit their unintended use.

Type tables is a big feature added late in the game and doesn’t seem to interact well with substitution groups. Type tables allow alternative types to apply to an element based on its context, such as an attribute value. I thought such context-based constraints should be in a separate layer, as is done with Schematron, but it seems like half the schema-dev questions are about how to impose such constraints within XML Schema, so I can understand why the Working Group would want to add it.

The problem, as I see it, is that type alternatives live as element declaration properties rather than within the type hierarchy. Substitution group members must have types in proper derivation relationships, but that only applies to the declared types, not the alternatives types. So combining type tables with substitution groups can break the spirit of the derivation hierarchy, if not the letter of it.

Updated Matrix Multiply Times

Monday, October 1st, 2007

I finally got around to running my matrix multiply microbenchmarks on Windows as well as Mac. Here’s the summary of times on Mac OS X and Windows XP on the same Mac Book Pro.

200×200 Matrix Multiply times in milliseconds
Mac OS X Windows XP
Java 1.5 C Java 1.6 C
ms ms ms ms
22 11.2 12.2 12.7

(Though not a big table, I’m using it to test out support for HTML 4 (ca. 1999) table elements like THEAD.)

Microbenchmark on Java and C Matrix Multiplication

Thursday, June 21st, 2007

My Java bug entry on faster numerics has been summarily closed as unreproducible. There is a short evaluation showing Java to be faster that C at matrix multiplication, but the comparison misses the mark because:

  • The C code (not shown) appears to be using nested arrays for matrix storage, which I have never seen in practice for C matrix libraries.
  • The Java code is run under the server version of Java, which is fine for a small benchmark but usually not practical for a large (client) app, such as many existing C++ numerics apps are.

Nonetheless, the activity prompted me to rerun my matrix benchmarks. Actually, I had to recreate them since I couldn’t find my Java 1.3 era versions that showed a 15x disadvantage for Java. This time Java fared much better. The tests (C++ Code Java Code) just multiply the same 200×200 matrices over and over with some add operations mixed in for good measure. I tried to make the code fast without working too hard. For instance, I didn’t get GCC to do vectorization, and I didn’t control my environment too well.

I tried two variations of the Java matrix representation and two variations of the Java multiplication code for four total Java variations, of which I took the fastest. The representation variation was whether to use a single array and multiplied indexing or to use nested arrays. The multiply variation was whether to transpose the B matrix or not — transposing it lets the inner loop traverse a one sub-array each from A and B. Arrays of arrays with transposing worked best on my Intel Mac, but flat arrays without transposing was best on the PowerPC Mac. I think the Intel JVM is better about the range check elimination optimization, so it favors the extra arrays with simpler indexing.


2GHz PowerPC Mac

Java 1.5 56.7ms
C++/GCC 26.2 ms


2.33GHz Intel MacBook

Java 1.5/client 32.5 ms
Java 1.5/server 12.6 ms
C++/GCC 11.0 ms

On the PowerPC Mac, there was no difference between client and server Java.

So client Java 1.5 is only 3x of C++ and server Java 1.5 is almost the same as C++. Not bad, but not quite what Sun’s evaluation showed. Still need to try Java 6 and turning on GCC vector processing.

Prime Benchmarks Updated

Tuesday, April 3rd, 2007

After getting a new MacBook Pro and installing Boot Camp on it so I could boot into Windows or Mac OS X, I’ve rerun some of my prime number mini-benchmarks. I only ran the tests in C++ and latest Java on each OS, skipping C# this time. The charts below summarize the results. Times are in seconds and shorter bars are better. (CSV Data)

Java vs. C++ Prime Finding Benchmarks

The first chart puts the C++ and Java times side by side for each host. The Linked List allocates tons of small objects (one for each prime number found), and the C++ allocator seemed to have significantly improved to be twice as fast as Java and C++ on G5, but otherwise there is not too much difference.

G5 vs Intel Prime Finding Benchmarks

The second chart shows the same data as the first but grouped for comparison of OS/CPU combinations. The Intel CPU (Core 2 Duo @ 2.33 GHz) is noticeably faster than the G5 (2 GHz), but there isn’t a real difference between OSs.

I used latest public versions of Java, which were Java 5 for Mac OS X and Java 6 for Windows XP.

Java 7 Wish List Item: Static Analysis Optimizations

Monday, January 8th, 2007

Shortly after Java 6 came out last year, there were a number of Java 7 wish lists floating around (example). Most wishes are for syntax improvements. I don’t use Java as much as I used to, and I only have one wish list item: static analysis optimizations. That is, optimizations made possible with the help of static (not run-time) optimizations.

I’ve written before on the promising range check elimination optimization which strives to eliminate array bounds checking when provably unnecessary. But that optimization doesn’t quite live up to its potential because any nontrivial array usage is beyond the scope of a run-time proof. So the idea of static analysis optimization is for a compile-time (or post-compile-time) process to do an in-depth analysis of array usage and other potential optimizations and insert the suggested optimization and finished proof into the byte-code. Then the JVM only needs to verify the proof at run-time rather that construct a proof from scratch. And for that matter, the JVM could never generate a proof, since it could assume that if there is no proof provided already, there’s no use looking for one at run-time.

My motivation is scientific computing (especially linear algebra), which still seems to be unnecessarily handicapped in Java.