I7 2600 (4 cores/ 8 threads)
16 GB of ram
1 TB HD
plus a beautiful new 27″ monitor.
Building out some VM’s and other stuff now… having fun.
Used Sketchflow to build some mock-ups for a meeting today. The tools rocks, wish I could have spent more time using it so I could have built something more than a throw away. Really wish it was part of MSDN. Here is an example of one of the screens I did last night:
A couple of links:
Neural Networks and machine learning:
Latest useful links…
Was considering using HeiarcharyId for my table, but was getting confused by it’s usage. My plan had always been to have all leaf nodes of the tree in one (or many tables). The one table now is docs, but that could easily get expanded to links and other items that could be leaf nodes. So every doc would store it’s direct parent and it’s root node (a little more storage here to store the root node, but will improve certain queries).
The root and intermediary nodes would be stored in the ‘folders’ table. With folders that have a NULL parent would be the root nodes and are of a specific type (root node.
Microsoft throws a bit of guidance out there in the MSDN article:
Comparing Parent/Child and hierarchyid for Common Operations
- Subtree queries are significantly faster with hierarchyid.
- Direct descendant queries are slightly slower with hierarchyid.
- Moving non-leaf nodes is slower with hierarchyid. Inserting non-leaf nodes and inserting or moving leaf nodes has the same complexity with hierarchyid.
Parent/Child might be superior when the following conditions exist:
- The size of the key is very critical. For the same number of nodes, a hierarchyid value is equal to or larger than an integer-family (smallint, int, bigint) value. This is only a reason to use Parent/Child in rare cases, because hierarchyid has significantly better locality of I/O and CPU complexity than the common table expressions required when you are using a Parent/Child structure.
- Queries rarely query across sections of the hierarchy. In other words if queries usually address only a single point in the hierarchy. In these cases co-location is not important. For example, Parent/Child is superior if the organization table is only used for running payroll for individual employees.
- Non-leaf subtrees move frequently and performance is very important. In a parent/child representation changing the location of a row in a hierarchy affects a single row. Changing the location of a row in a hierarchyid usage affects n rows, where n is number of nodes in the sub-tree being moved.
If this the non-leaf subtrees move frequently and performance is very important, but most of the moves are at a well-defined level of the hierarchy, consider splitting the higher and lower levels into two hierarchies. This makes all moves into leaf-levels of the higher hierarchy. For instance, consider a hierarchy of Web sites hosted by a service. Sites contain many pages arranged in a hierarchical manner. Hosted sites might be moved to other locations in the site hierarchy, but the subordinate pages are rarely re-arranged. This could be represented via:
99% of the time I am just asking ‘what are the descendants of this object’, or what are all of the documents in the ‘case’ or all the documents from the parent. It may make sense to have hierarchyid used with the folders, to allow for a better algorithm to answer a question like ‘give me all descendants of this folder (that could be 10 deep). But I’ll typically be asking the question like this: Give me all root nodes, give me all child nodes from a parent node (root or otherwise), give me all leaf nodes for a give root node. I think using a simple parent/child hierarchy will beter serve my purposes… caveat that with, it looks easy to change schema’s to support heirarchyid after having gone parent/child.
A Google like SQL Server grammer for FTS:
Performance tips on querying Xml in Sql Server:
I really wanted to make a case for version 1 to use a NoSql backend. Alas, that is going to have to wait. I’ve spent more time reading the case for NoSql, and I’m even more sold it on it now than before, however one of the large pieces of this project is the storing of thousands to millions of binary files. Looking at RavenDb, CouchDb and MongoDb, they have the ability to do this, but I’m not sure they will handle the kind of scaling I am looking for storing binary files. Atter more research I started leaning more towards mongodb as a solution, but there were a couple things that are still keeping me from pulling the trigger.
I will still make things generic enough that back ends can be swapped out (and compared), but will start out with a Sql server solution using the filestream built in to 2008 (and Denali) for binary storage.
As an aside… I am pretty sure building a binary store in to large concatenated files would also work well for this project, since the majority of the binary files are read-only this would likely work well.
I sort of see an evolution to the backend choices: Sql-Server (2008 or denali?), Sql Server FTS, Lucene (or Sphinx), MySql, MongoDb, RavenDb or CouchDb no sql solutions. I could easily stop at step 2, but for the sake of learning I’d like to step through all of these options.
Next step in this process is now database and service design… off to write some code.
Not surprisingly I’m having trouble making a decision on several key design issues. Do I go Sql or NoSql (Ravendb). If I go SQL do I use MSSql (2008 or denali) or go free with MySql. Do I use Lucene or Sphinx? Do I use a pre-built service, or roll my own, and if I roll my own, is it Rest http, tcp or something else. Do I do this project as open source or do I keep it private. These questions have been rolling through my head for a while now, which is good since I’m considering all options, but bad since I’m not making any progress on code.
Sql vs. No Sql:
The problem set fits a NoSql solution. Documents that are of an unknown type/unkown metadata set until they are received.
I have no experience with NoSql solutions so I really need to play with RavenDb a bit more to decide whether that is the direction I plan to go. The use of Lucene directly in this product was appealing, as I thought I would get all of my indexing done withinRavebdb, but the more I look at it, it appears that I will need to do my full text document indexing externally with my own service like I have done before (Solr is not preferred way of going, but Sphinx may be a possibility).
So I started looking at Ms Sql server again with the possibility of using a hybrid solution with some traditional RDBMS structures plus some NoSql type structures. So several things that are appealing about Sql Server
1) Well established product
2) .Net integration is top notch
3) Sql Filestream
4) My own familiarity with the product.
So I can do No sql like storage like in the following series of articles:
But with that type of solution, I feel like I’m losing some of the great scalability of the NoSql sharding models.
The key here is really in the index creation, what we create and why. Ravendb handles this very well (at least from what I’ve read in the documentation…unfortunately I’m not writing from real experience). Raven also handles the updating when a change is made. I could use SQL indexes over xml documents, Sql Free Text indexes over a content column (or content stored on the file system). The Sql Free Text option is awfully appealing, plus I’ve read that they have made great improvements to the engine in Denali.
The big negative with Sql server is cost and scability (I need to learn more about Denali’s scalability to have a better assessment of this).
The other option that has been running through my head is NoSql with Lucene or Sphinx for indexing, with a Rest http protocol service in the middle handling things like security. Of course the downside of this is that I may lose some of the great MsSql + .net integration. If I write this up correctly, it could potentially be totally database/indexing engine agnostic. Figuring how to make this scale in all situations could be increasingly difficult. Still considering, but may have enough now to at least start playing with services and backend databases.