wolf87 - Slashdot User

Lies, Damned Lies and Cat Statistics 175

Posted by samzenpus on Wednesday August 18, 2010 @06:53PM from the nine-lies dept.

spopepro writes "While un-captioned cats might be of limited interest to the /. community, I found this column on how a fabricated statistic takes on a life of its own interesting. Starting with the Humane Society of the United States' (HSUS) claim that the unsterilized offspring of a cat will '...result in 420,000 cats in 5 years,' the author looks at other erroneous numbers, where they came from and why they won't go away."

Comment Re:Another vote for NoSQL and some experience (Score 1) 235

by wolf87 on Sunday August 15, 2010 @06:14PM (#33259062) Attached to: How Do You Organize Your Experimental Data?

I have seen these kinds of situations happen a lot (I'm a statistician who works on computationally-intensive physical science applications), and the best solution I have seen was a BerkeleyDB setup. One group I work with had a very, very large number of ASCII data files (order of 10-100 million) in a directory tree. One of their researchers consolidated them to a BerkeleyDB, which greatly improved data management and access [...] I think the general idea of a key-value store that lets you keep your data in the original structure would work well.

A file system *is* a key-value store.

I suspect those 100,000,000 files were in fact tiny pieces of data which didn't make sense to access using normal tools (from ls to MS Word). That the conversion worked out for *you* doesn't mean that it would be useful to convert *every* set of files into a BerkeleyDB. Especially not sets of (say) 500 files, 10GB each.

I completely agree. If you have a lot of small datasets that break ls and such (as was the case in my situation), BerkeleyDB provided a great solution. If you have a smaller set of very large files, a different solution is needed (perhaps just the file system with some kind of automated indexing).

Comment Another vote for NoSQL and some experience (Score 2, Informative) 235

by wolf87 on Sunday August 15, 2010 @01:25PM (#33257496) Attached to: How Do You Organize Your Experimental Data?

I have seen these kinds of situations happen a lot (I'm a statistician who works on computationally-intensive physical science applications), and the best solution I have seen was a BerkeleyDB setup. One group I work with had a very, very large number of ASCII data files (order of 10-100 million) in a directory tree. One of their researchers consolidated them to a BerkeleyDB, which greatly improved data management and access. CouchDB or the like could also work, but I think the general idea of a key-value store that lets you keep your data in the original structure would work well.

Comment Check with university libraries (Score 1) 4

by wolf87 on Sunday September 28, 2008 @12:18AM (#25182179) Attached to: Academic applications of OCR

I believe that at least one or two university libraries in the Boston area have digitization equipment for their collections. You could probably work out some kind of arrangement with them to OCR the manuscript within your funding constraints.

Slashdot Top Deals