Stack Overflow: Down Votes vs. Up Votes vs. Reputation

Where do the ‘personalities’ and ‘forces’ of Stack Overflow reside in a three dimentional plot of Down Votes, Up Votes and Reputation scores?

My fascination with up and down voting patterns persists.  Below are some statistics and a graph derived from asking the data “How do the top 1000 Stack Overflow users relate to each other with regards to up votes and down votes?”  Some results:

  • Random statistics on all users (not top 1000)
    • Only 278 of SOpedians have voted down more than 100 times
    • A puny 121 have down voted more than up voted
    • A mere 11 have down voted more than 500 times
  • Mr. Down Voter: Rich B  (.52 Up votes for every down vote)
    • 1796 Down
    • 932 Up
  • Mr. Up Voter: JB King (298 Up vs. Down)
    • 15 Down
    • 4474 Up
  • Mr. Even Handed: (There are two)
    • First Place: : Neil Butterworth (1.00!)
      • 838 Down
      • 837 Up
    • Second place: Andrew Grant (1.00!)
      • 498 Down
      • 500 Up
  • Other Personalities
    • Jeff Atwood: 4.09 Up versus Down
    • Joel Spolsky: 4.84
    • Jon Skeet: 21.0
    • Stu Thompson (me!): 2.08

Larger Image

Notes on the above graph:

  • The x-axis is Down Votes,
  • the y-axis is Up Votes, and
  • the z-axis (bubble size) is Reputation Score
  • The x-axis and y-axis are not proportional, meaning if one were to draw a line from the orgin at a 45° angle, that line would not represent a 1:1 x:y relationship

Impressions and Analysis:

  • By and large, the Top 10 SOpedians are more likely to vote up than top 1000
  • By and large, everybody in the top 1000 is more likely to vote up than down
  • Joel Spolsky doesn’t vote so much, relative to Jeff Atwood–they are the public faces of stackoverflow

Stack Overflow: Voting Patterns in Detail

Up, Down, all around. Offensive? Close? Spam! Inform Moderator…

Continuing to investigate user voting patterns on Stack Overflow has become a hobby (obsession?) of mine.  Thanks in part to my curiosity and in part to nobody_(known know mysteriously as ‘Kyle Cronin‘; the administrator of the Unofficial Stack Overflow Meta Discussion Forum) egging me on, I quickly whipped up a graph showing the propensity to “Up Vote versus Reputation”.

Up Vote (as a percentage; % = up / (up + down)) for five reputation tiers of users with at least one up vote and one down vote and a reputation of at least 100 (when one is allowed to down vote.) The x-axis represent five user tiers.
The first three represent ~5,000 each, the fourth ~1,500 and the fifth ~125.

It is very clear that users with higher reputations are more likely to down vote.  But this lead to other questions, such as:

  • Do the users with older accounts, especially beta users, make up the negative voting club?
  • Did the down votes shift downwards because of new features introduced to Stack Overflow, specifically new voting options like ‘Spam’, ‘Offensive’, ‘Inform Moderator’ and ‘Close’?

To try and answer the first question, I queried the database, tinkered with it in Excel and the resulting graph is below.  The blue line is “Average % Up Votes of All Votes by User Join Date”.  The red series is the “Average Reputation by Join Date”.

(Larger Image)

Notes on the above graph:

  • The percentage is only for users with
    • at least one up vote
    • one down vote
    • reputation of at least 100
  • The yellow data point on the Average Reputation series represents the day Stack Overflow sign-ups were open to the general public.  Our own little Eternal September, if you will.  (Not that bad, actually.)
  • The purple spike in the %-Up Votes is caused by Niel Butterworth, who has both many votes (~1700 in the data dump) and a 50/50 Up Vote versus Down Vote ratio.
  • The leveling of the average reputation curve a few weeks after (end of September/early October) Stack Overflow went public is interesting.  It seems, to no surprise, that beta users and the initial public users are much more into SO than the follow up users.
  • The far left data point represents seven users who got accounts on 31 July 2008, and are the movers-n-shakers of Stack Overflow.  (Jeff Atwood, Jarrod Dixon, Joel Spolsky, and Jon Galloway) who understandably have very high reputation scores.

To try and answer the second question above (“Did down votes shift to other types of voting options as they came on line”), I queried and graphed the data again to produce a view of “Vote Type as a Percentage of All Votes Cast”. Example: Where people in the past SOpedians would down vote a question or answer if they found it was spam, later on they could mark the post as spam or offensive. While the two options are not mutually exclusive, the down vote costs the user a reputation point. Since roughly 90% of votes are up votes, the graph zooms in on the top 10%.

(Larger Image)

It seems that the new voting options do not impact up/down voting patterns significantly.  Note the sudden growth of ‘close’ votes on the trailing week or two of the graph.  It seems to be that this is a change in the raw data rather than a sudden burst of close votes, but am not sure because I myself did not rank the power to vote for closing a question until around that time.   Also, ‘close’ votes are only valid for questions and not answers, unlike up and down votes.

The last few weeks of the data dump look interesting, so I zoomed in there and produced the below graph.

(Larger Image)

The burst and subsequent tapering off of Spam, Offensive and Inform Moderator votes seems very suspicious to me.  Was this actual activity?  Or was there a data collection issue?  Or was that when these voting features were created?   I’m guessing it was a data collection issue.  Future data dumps will show this to be true or not, I hope.

Whatever the cause, the number of votes here is still to small to impact the percentage of up votes over time to any degree.  My conclusion is that

  1. SOpedians with high reputations are more likely to vote down questions and answers
  2. As Stack Overflow gains more and more users with lower reputations, these users are less likely to vote down and bring up the over all percentage of up votes against all votes over time.

Stack Overflow: Up and Down Voting Pattern Analysis

SOpedians are getting nicer as time goes on, except for the occational flair up

The kids at stackoverflow.com, most prominently Jeff Atwood, recently released the Creative Commons licensed data behind Stack Overflow via bit torrent, and I eagerly downloaded the database dump and imported into MySql for some analysis of voting patters.

Since the beta, I have always been a fan of the down vote. Many SOpedians find them hostile and mean–to the point off getting their knickers all bunched up. My belief is that they have a cleansing effect of the questions and answers. Down votes are in the spirit of (what I have interpreted the founder’s goals to be for) Stack Overflow.

After whipping up an embarrassingly crude Python script to import the voting data into MySql, I ran a simple query that gave me the daily up and down vote totals for each day. Then I graphed it all in Excel and added three new series: up-to-down ratio, 9-day up-to-down ratio average, and a up-to-down trend line.

(larger image)

My interpretations:

  • Stack Over flow went live in mid-September, hence the huge jump in votes then. No surprise there.
  • The humps are weekdays, the troughs are weekends, and the winter holidays are clearly visible.
  • For every down vote there are 10 to 12 up votes
  • The up vote to down vote ratio is increasing over time. My gut tells me that this is related to the introduction and expansion of post closing, deleting and moderator warning functionality. Or maybe “Down Voting Fatigue” sets in with many users? The ideas that maybe SOpedians are posting less junk or that they are just being nicer as time goes on are ridiculous!
  • There is a huge spike in down votes and/or a corresponding drop in up votes on 21 February. There does not seem to be any one post that sparked this. Interesting.
  • The single most down voted post is in response to What is the most spectacular way to shoot yourself in the foot with C++? with (as of 6 June 2009) 39 down votes! (This does not show in the graph…just an ad hoc ‘I wonder…’ query.)

Interesting stuff. It will be entertaining to comb over the Stack Overflow data in more detail in the future.

Building up & testing a new 9TB SATA RAID10 NFSv4 NAS, part II

Initial DRBD Sync – 9TB in 4.5 days

After much blood, sweat and tears getting xtendx up and running in our new data center with Green.ch this summer, I finally had time to complete a major “high availability” aspect of our new file server platform. Our basic architecture is fairly simple: a couple of application servers backed by the network attached storage.

The network attached storage (NAS) is comprised of a pair of nearly identical self-built (Building up & testing a new 9TB SATA RAID10 NFSv4 NAS, part I) file servers. The next, more complex step was to configure a Class C (Primary + Secondary) DRBD cluster. In an effort to mitigate some risk and spread the workload over the course of the summer, I staged the entire installation:

  1. May: Build NAS 0 (thanks to my wife, Robyn, too!)
  2. Early June: Install NAS 0 into data center, copy over production data
  3. Mid-June: Put NAS 0 into production as a simple ext4 + NSFv4 file server
  4. July: Build NAS 1 (Again, a thanks to my wife)
  5. July + August: Install NAS 1 into data center, configure as a HA NAS: drbd + ext4 + NSFv4.
  6. September: Copy data to NAS 1, take NAS 1 into production
  7. Late-September: configure NAS 0 as secondary node in a HA NAS: drbd + ext4 + NSFv4.
  8. Early-October: Initialize DRBD synchronization.

Kicking off the block-level disk synchronization was a big deal. These servers are a running production system and it is paramount that existing service delivery was not impacted. At first I had left the DRBD sync rate unaltered, which I believe effectively means ‘fast as you can’. This quickly resulted in poor read times for production applications, so I kicked it down dramatically while the sync was in progress with the drbdsetupcommand:

 sudo drbdsetup /dev/drbd1 syncer -r 24M

After fooling around with various rates, I settled on 24M. It’s much lower than the system could theoretically synchronize at, but that is not the goal. This production value is also now configured in /etc/drbd.d/global_common.conf:

common {
    #snip
    syncer {
        # rate after al-extents use-rle cpu-mask verify-alg csums-alg
        rate 24M; 
    }
}

Interestingly, the synchronization rate does not match disk IO as measured by Cacti via SNMP; in fact it is roughly half. (I have no idea why.) As you can see below, it took a solid 4.5 days for the synchronization of a single 9TB device to complete:

Primary DRBD Node Read and Write IO (MBps)

Secondary DRBD Node Read and Write IO (MBps)

A bonus to this work on a production system is that the DRBD sync reserved as a real-world production load test. I now know with a respectable degree of certainty what our NAS is capable of, and when it is approaching stressed.

Arduino-Based Intervalometer v0.1

The World Most Simple Homemade Intervalometer Project and a Sample First Project.

A while back I stumbled across the incredibly cool Arduino project. After some thinking and research, I decided on a modest first project: an intervalometer for my Canon 350D. Timm Suess’s Intervaluino blog post was a big inspiration and a great starting point.

First Step: Buy some parts! A Bern company, dshop.ch had Arduino Duemilanove (USB) boards in stock for SFr49 each. And a local shop Pusterla had the other little bits (power switch, battery connection, solid wires, wiring board, male 3.5mm jack) required. The remaining bits were items I had about the house, namely an old business card box as a housing and an old hairband donated by my wife to hold everything together

Second Step: Assembly. I had three goals in mind with the construction step:

  • The intervalometer must be small. It should fit readily into my camera bag
  • The intervalometer should be somewhat robust. It needs to take a licking and keep on clicking.
  • The intervalometer must be easy to create and modify. Me being me, it is going to be modified at some point.

My initial attempt at wiring everything up relied upon a schematic from Timm Suess, and annotated by a friend, Lincoln Stoll. The problem was the relay was just not relaying for me. So I ripped it out and wired up the trigger directly to the Duemilanove. Reckless, maybe. But it worked.

(Larger Image)
And here it is all boxed up. The plastic is hard enough to resist rough treatment, yet soft enough that I can easily cut and drill into it. The hairband means I can tinker with the insides with no tools. And the clear case means I can see the lights. Green is power, and the orange LED blinks when the trigger should be pulled.

(Larger Image)
Third Step: Program the Intervalometer. The Arduino project home page has a simple IDE that can be downloaded for free. Windows, OSX and Linux are supported. The programming itself, for a limited fixed interval, is ridiculously easy.

  // Stu's Magic Intervalometer code
  int shutter_on   = 300;  //time to press shutter, set between 100 and 300
  int shutter_wait = 2500; //interval between button up & button down

  int outpin = 12;         //output for shutter relay from pin 11
  int ledPin =  13;        // LED connected to digital pin 13

  void setup() {
     pinMode(outpin, OUTPUT); //outpin gives output
  }

  void loop(){
     digitalWrite(outpin, HIGH); //press the shutter
     digitalWrite(ledPin, HIGH); //turn ON orange LED
  
     delay(shutter_on);          //wait the shutter release time
    
     digitalWrite(outpin, LOW);  //release shutter
     digitalWrite(ledPin, LOW);  //turn OFF orange LED

     delay(shutter_wait);        //wait for next round
  }

Once this is written, and after saving, it is quick and easy to connected the  Arduino to my Mac with a USB cable and upload the program.

Fourth Step: Capture the stuff that will be used as individual frames in a movie.  Our flat overlooks a park with a playground which seemed like a good subject.

  • Set up the Canon EOS 350D with f1.8 lens on a tripod looking out the window
  • Set image quality to low (1728 × 1152)
  • Manual focus lens, turn of auto-focus
  • Turn off post-shoot image view
  • Connect Intervalometer
  • Flip the switch!

After about an hour there were ~1000 shots.

Fifth Step: Process the stuff into a movie.  This is not as hard as it sounds, but the approach I took does require some familiarity with the command line.

  • Import photos from camera to MacBook Pro
  • Resize stuff to 1920×1080 (native wide screen HD video)
  • Rename stuff into a sequentially numbered series of files
  • Compile stuff into movie

Moving photos from a camera to computer is trivial.  Resizing 1000+ stuff is a bit trickier, as is renaming the files.  The original names are from the camera.  E.g.: IMG_9350.JPG.  To convert into a movie, we need them to start at 0.  E.g.: 0000.jpg.  To do this I wrote a small bash script for the task:

  #!/bin/bash
set -x

id=0
for photo in ./orig/*
do
id=$(($id+1))
newPhoto="$(printf "%04d" $id)"
sips -z 1080 1920 $
photo --out "$newPhoto.jpg"
done

Once the above script has run, the last thing to do is turn all the stuff into a video with FFmpeg. FFmpeg is a big, complicated program with a zillion options and features. Basically, the command below instructions FFmpeg to compile the stuff into a video at 25 frames per second with bit rate of 1024kbps using the h264 codec.

ffmpeg -i %04d.jpg -vcodec libx264 -f mp4 -b 1024k -r 25 -y playground.mp4 

What’s next?  I’ve been thinking about this a while and plan the following:

  • Securing the wiring to the Arduino board.  At the moment, when changing the battery or reprogramming the device, the wires can come out of the circuit board.
  • Add some big flashing lights.  I like teh blinky lights!
  • Find a more interesting subject to photograph
  • Use a shallow depth of field for interesting effects.  “The Sandpit”, also known as “A Day in the Life of New York City in Miniature”, by Sam O’Hare is kind of in the direction I’d like to go.
  • Add some sound or music

Oh noooooes! javax.el.ELException: [class] is not a valid Java identifier on Apache Tomcat 7

org.apache.el.parser.SKIP_IDENTIFIER_CHECK default changed from true to false.

I recently moved my Pebble 2.3.1 blog from one machine to another. In the process of doing so a couple of things broke, like comment validation. Trying to review and approve comments resulted in a mostly blank screen.

A check of the logs gave me my first clue:

Mar 22, 2011 2:24:37 PM org.apache.catalina.core.ApplicationDispatcher invoke
SEVERE: Servlet.service() for servlet jsp threw exception
org.apache.jasper.JasperException: /WEB-INF/jsp/viewResponses.jsp(65,0) "${response.class.name == 'pebble.blog.Comment'}" 
contains invalid expression(s): javax.el.ELException: [class] is not a valid Java identifier
	at org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:40)
	at org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:407)
	at org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:198)
 {snip}

My Google searches quickly lead me believe other people, not all of them Pebble users, were having a similar issue. And folks were blaming the more recent releases of Apache Tomcat. As it turns out, the Tomcat developers changed the default value for org.apache.el.parser.SKIP_IDENTIFIER_CHECK from true to false. What it does is tell the JVM that identifiers should not be checked to ensure that they conform to the Java Language Specification for Java identifiers. This was done to more closely comply with the JLS. Unfortunately, it has lead to many JSP-based webapps breaking.

With Pebble, the problem was the use of JLS reserved keywords for identifies. In the below snipped from WEB-INF/jsp/viewResponses.jsp, Pebble uses the reserved word “class”.

  <c:if test="${response.class.name == 'pebble.blog.Comment'}">

If you are a developer of a webapp that is experiencing this issue, it is best if you just work through the problems and change your app. If you are a plain old users like me, try upgrading…or adding this system property to your Tomcat startup script. (E.g.: {tomcat-home}/bin/startup.sh) and restart.

JAVA_OPTS="$JAVA_OPTS -Dorg.apache.el.parser.SKIP_IDENTIFIER_CHECK=true" 
export JAVA_OPTS

And presto! Problem solved. Hopefully. For now. Until it breaks again. Because this is a workaround, not a solution.

Optimal Buffer and Destination Byte Array Size for java.io.BufferedInputStream Reads (for a fast disk)

Micro-Bechmark Results on a MacBook Pro with an Intel X25M SSD

When implementing file reading Java code with Java IO’s BufferedInputStream, what buffer size should one choose? Should we just not specific it and go with the default? And what destination byte array size is best?

These questions pop up from time to time when I have the opportunity to write such code. And I’ve seen folks ask similar questions on Stack Overflow and The Java Ranch. So, with my trusty new SSD drive, and a bit of spare time this holiday season, I set out to answer those questions.

The methodology for the below micro-benchmark results was simple: Graph the maximum speed of a simple algorithm with various BufferedInputStream buffer and destination byte array payload sizes. They algorithm was just as simple: add up all the bytes in a file. This is both computationally light weight and serves as a simple checksum (to ensure the consistency of my algorithm in with various parameters.) The code: OptimalBufferSizeSquentialReader.java

The target file read in these metrics was a 31MB video clip. To prevent the OS file cache from mucking up the results, there is a unique file for each buffer size + destination byte array size, for a total of 7GB in test data.

Destination Byte Array Size vs. KBps (for 16 different Buffer Sizes + the default)


(Larger Image)
Detailed description of the graph:

  • x-axis: The destination byte array size used in individual read method calls as defined by payloadSize in this snippet:
        byte[] payload = new byte[payloadSize];
        int readIn = is.read(payload);
  • y-axis: The speed in Kilobytes per second for a complete build up, file opening and reading, and the algorithm’s computation.
  • series (individual lines): Individual BufferedInputStream buffer size’s as set with buffSize during class initialization.
        FileInputStream fis = FileInputStream(file);
        InputStream is = new BufferedInputStream(fis, buffSize)

An interesting graph. My conclusions:

  • One generally cannot go wrong with a destination byte array size of 512 or 1024 bytes, regardless of what the BufferedInputStream’s buffer size has been initialized to.
  • BufferedInputStream’s default buffer size is pretty well tuned, as long as one does not use a destination byte array size smaller than 8 or 16 bytes
  • There is no point in having BufferedInputStream’s default buffer size initialized to anything larger than 2KB. In fact, if an application is going to have many concurrent threads running this code (like in a web application) then large values will only wastefully consume memory, limiting overall scalability.
  • The lower destination byte array sizes seem to have some sort of Sigmoid Function.
  • My new SSD is freaky fast. ~130,000KBps is ~125MBps! Yeah, baby!

One thing I don’t get is that the default buffer size is 8KB, but the 8KB series does not match the default series. Humph.

Update! I was curious aoubt two other aspects of the buffer size and destination byte array size, CPU load and the impact of speed of the disk. To that end, I’ve followed up with a second, similar set of metrics in Optimal Buffer and Destination Byte Array Size for java.io.BufferedInputStream Reads (for a slow disk)

Real world performance metrics: java.io vs. java.nio (The Sequel)

About 275% faster for my particular use case

The first set of results (September 2008) measuring the performance improvement gained by the switch to java.nio for FLV indexing were not particularly scientific.  Each data point was from a different file, of dramatically different sizes, with dramatically different key-frame spacing.  The improvements are visible, but fuzzy.

From the ever powerful yet flawed Wikipedia, there is a concept to help bring these metrics into focus:

Cēterīs paribus is a Latin phrase, literally translated as “with other things the same.” It is commonly rendered in English as “all other things being equal.” A prediction, or a statement about causal or logical connections between two states of affairs, is qualified by ceteris paribus in order to acknowledge, and to rule out, the possibility of other factors which could override the relationship between the antecedent and the consequent.

ceteris paribus assumption is often fundamental to the predictive purpose of scientific inquiry. In order to formulate scientific laws, it is usually necessary to rule out factors which interfere with examining a specific causal relationship. Experimentally, the ceteris paribus assumption is realized when a scientist controls for all of the independent variables other than the one under study, so that the effect of a singleindependent variable on the dependent variable can be isolated. By holding all the other relevant factors constant, a scientist is able to focus on the unique effects of a given factor in a complex causal situation.

Blah, blah, blah.  OK, back to gathering more data with this in mind.

With a single set of 8 files from production webcasts, more results were captured in two series.  The second series was measured immediately after the first series on physically separate copies of the files.  Measurements were taken one fine evening last November on the production server described in the first post–activity was not too busy at the time, maybe 10% of capacity.

(Drum roll please…) The results:

 
 File Size (KB)  Speed (KBps)
v2.2 (java.io) v2.2 (java.io) v2.4 (java.nio) v2.4 (java.nio)
61,793 67,965 72,647 200,239 202,806
70,079 31,418 30,798 73,225 91,648
82,645 30,529 31,286 84,291 91,887
82,951 53,388 50,290 144,458 154,158
88,086 29,106 29,134 71,360 69,012
101,500 28,491 28,935 75,644 80,758
122,606 30,839 31,954 84,035 92,383
289,423 42,374 41,479 112,543 112,773


Interpretations:

  1. Much more consistent, although in hindsight I should have encoded a video from a single source to various different qualities.  This would have made the number of index points consistent across each file.
  2. On average, java.nio performed 273% faster than java.io.
  3. The mean performance increase was 277%
  4. The minimum was 241%
  5. The maximum improvement was 288%

Real world performance metrics: java.io vs. java.nio

A before and after comparison of my application’s FLV indexing performance between java.io streams and java.nio file channels.

In my application at xtendx AG, there is some code that indexes uploaded FLV (Flash video) files to determine the byte mark of each key-frame which are spaced out every second or so by our encoding process.  The index allows client players to request an flash video at an arbitrary second position with in the FLV. This indexing needs to be done, for the sake of simplicity, once and only once before a file is streamed to an end user.  Therefor, the first attempt to view a file incurs a non-trivial delay as the server indexes the file.  It needs to be fast!  OK, that is the “why”.

The indexing process is very straight forward:

  1. Open the file, validate the header, read in meta data
  2. Read in key frame meta data (type, temporal position, frame size)
  3. Take note of temporal position and absolute position of start byte of key frame
  4. Skip ahead to start of next key frame
  5. Jump to step 2 until end of file
  6. Save temporal position / byte position map

As one can guess, the file access is not exactly serial nor is it exactly random in nature.  Because of the relatively large distance between key frames (up to thousands of bytes), I am thinking that the file access would be more appropriately categorized as random despite always moving forward.

Before changing the code, I whipped up some quick just-read-every-byte-in-file-quickly micro-benchmarks on my old MacBook to see what I could expect, and understand how to use FileChannels correctly.  Every looked good.

In the latest release of Simplex Media Server, v2.3, the indexing code was refactored to use java.nio’s FileChannels.  A little bit of strategic logging in both the new and old versions captured some performance metrics from one of our production systems.  Real world numbers rock!

It is probably worth documenting the relevant hardware and software involved:

Make & Model HP DL380 G5
CPU (1) 1.86GHz Xeon Quad-core
Memory  4GB
Operating System  RHEL4 (64-bit)
File System  SAS RAID-10 w/ (4) 10K RPM HDD
Java Version   Sun JDK 1.5.12 (64-bit)
Relevant JVM Options -Xms768m -Xmx768m -Xincgc

 

Raw Data
java.io results
 File Size (MB)  Speed (MBps)
4.8 31.9
6.9 61.5
8.6 11.8
9.2 15.6
10.8 102.3
11.5 40.1
11.5 36.9
13.7 10.6
15.5 17.4
25.9 31.3
30.0 15.2
50.0 13.9
101.2 38.8
107.3 40.2
113.5 38.8
114.1 32.9
116.6 38.4
134.8 37.1
148.6 41.0
161.3 38.5
222.5 40.9
java.nio results
 File Size (MB)  Speed (MBps)
5.9 90.8
6.8 110.7
6.9 42.2
8.5 111.2
9.4 93.4
10.6 79.6
10.9 70.9
11.6 87.6
12.5 91.7
15.7 63.9
15.8 71.1
18.5 97.6
19.0 132.6
26.4 89.8
68.4 69.5
80.7 77.5
81.0 117.0
86.0 44.6
99.1 65.0
119.7 72.6
282.6 96.7

 

x-axis = file size in MB
y-axis = indexing rate in MBps
Interpretations:

  1. I had graphed the results file size vs. index rate with the expectation of seeing slow rates for smaller files sizes due to the overhead of setting up the i/o.  That cost probably is only apparent with very small files relative to what we are working with here.
  2. As a consequence of the above, both trend lines are relatively flat, with java.io performing ~37MBps and java.nio at ~80MBps.  My refactoring has doubled the speed!
  3. It would be nice to have a larger data set, especially for file sizes larger than 50MB.  That would make me comfortable about the general accuracy of y-axis rate values
  4. The minimum/maximum range for the indexing rate is larger than I expected, even for a production system under load.   I suspect it was a mistake to ignore the video characteristics of the media files themselves, namely the bit rate.  (A higher bit rates mean larger gaps between key frames.)  A follow up post comparing the above values with number of index points on a z-axis would be interesting.

Jumping on the ‘Klout is Stupid’ train

I first learned of Klout a few months ago. “Might be interesting”, I thought…and signed up. Fast forward to today, and there is a lot of noise about Klout, their recent algorithm changes, and all the snake-oil pawners with knickers in a twist.

In mid-August I noticed someone in my “Influences” that I did not know. That’s strange. So I posed a question on the Klout community forums:

How can I be influenced by someone I am not following and have never heard of? On my “Influenced by” list is one person I don’t know and don’t follow. How is this possible?

Megan, a Klout Employee answers with Robert McNamara’s ol’ Fog of War approach: Never answer the question that is asked of you. Answer the question that you wish had been asked of you.

You can remove anyone from your influenced by list by hitting the “x” next to the person you want to remove in your influencers tab.

Question closed. Today that person is gone from my list. Whether it was an explicate action by Megan, or an algorithm change I don’t know. But why were they ever there? Maybe its the leader of the illuminati who is influencing me and I didn’t even know it. (Sarcasm.)

More entertaining are the topics people are allegedly influential on.

  • My wife is reportedly influential on only one topic: Germany. Really? She’s been there once, for a weekend getaway to see old friends of mine in Kaiserslautern. Maybe Klout is grouping Switzerland in with Germany because it is sort of the same to many I-don’t-get-out-much thinking folks. (We live in Zurich.)
  • Lincoln Stoll, a code wrangling buddy of mine, is influential about teeth. Wha??? He has no idea why.

  • Clive Thompson (no relation), a technology author I follow on Twitter is, to his surprise, influential about anthropology and the taliban.

“Klout measures influence online” says the website. Ummm…maybe not so much. I’m officially jumping on the “Klout is stupid” train.