So after several days of messing around with Cloudera's training VM, I finally got my hands a real 6 node cluster.
Our team has been discussing how we want to handle programming in this environment. This is due to the fact that while we all have a programming background, none of us has ever worked with Java. So for the benefit of getting this project up and running as quickly as possible, we decided to pursue the severely under-documented streaming python method.
So to get started I reviewed a great blog by Michael Gnoll who goes over the basics: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
Now in our environment, Hadoop will be consuming vast amount of serialized or delimited data. So we're fortunate enough to have some "structure" on the input. This also means we can easily deserialize on the map phase. So for my first test I wanted to take a simple pipe (|) delimited file and get a count of one of the column values in this case the 6th column.
So I first constructed my super simple mapper.py and reducer.py files:
Mapper.py
Reducer.py
So I loaded my test data up to HDFS (~ 50 gigs worth outta do it). I then loaded my python up to my local directory. I ran a quick test file (same file layout obviously, but fewer records) to ensure my python worked.
This returned some data, so I was ready to go.
So I ran the following command to initiate a stream mapreduce:
And..... FAILURE. sigh
Yup... not very helpful. However, after some digging around, I discovered what the issue was.
dos2unix
/facepalm
Gotta love mixed environments! Anyway after running simple "dos2unix mapper.py" and "dos2unix reducer.py", I reran the execution code and:
Success! While the performance wasn't incredibly awesome (~18 minutes to process 458,848,279 rows, compared to 5 minutes in SQL Server), its a start. Going to start messing around with more advanced python methods and see what we can do.
Common Sense for Business Intelligence
Tuesday, July 31, 2012
Friday, July 27, 2012
Data Architect's journey with Hadoop
Posting here has been pretty lame. Mostly b/c I've been busy as hell with work. We've had two large released in the span of 6 months that while awesome, has left me with ZERO time to do anything here.
Anyway, its finally time to have some fun. We're getting ready to roll out our own API platform so each team can call our service to record events they want us to track. Considering the amount of data this service will be consuming (a metric shit-ton), we're looking at Hadoop as the back-end.
When offered the opportunity to POC Hadoop, I jumped at the chance. I love new stuff, and this is totally different from anything I've used before. And it relies on two things I'm exceptionally WEAK at; Linux and Java. What better way to strengthen my weaknesses than to jump head first into something that will test the hell out of it (its how I got to where I am today).
Anyway enough of that. My Hadoop cluster is currently being setup by our IT department. In the meantime I'm prepping as much as I can. I figured I would catalog my journey here. As I believe the way the data industry is going, many traditional data warehouse developers will be looking at Hadoop as a solution.
So to get started, I hit the books and the net.
Anyway, its finally time to have some fun. We're getting ready to roll out our own API platform so each team can call our service to record events they want us to track. Considering the amount of data this service will be consuming (a metric shit-ton), we're looking at Hadoop as the back-end.
When offered the opportunity to POC Hadoop, I jumped at the chance. I love new stuff, and this is totally different from anything I've used before. And it relies on two things I'm exceptionally WEAK at; Linux and Java. What better way to strengthen my weaknesses than to jump head first into something that will test the hell out of it (its how I got to where I am today).
Anyway enough of that. My Hadoop cluster is currently being setup by our IT department. In the meantime I'm prepping as much as I can. I figured I would catalog my journey here. As I believe the way the data industry is going, many traditional data warehouse developers will be looking at Hadoop as a solution.
So to get started, I hit the books and the net.
- Brent Ozar gives a really good overview of what Hadoop is, and how as a data warehouse developer, we may interact with it: http://www.brentozar.com/archive/2011/11/hadoop-basics-for-sql-server-dbas/
- Hadoop, The Definitive Guide (Tom White). Its the first edition, so its a few years old, but it is of course still relevant.
- Prior to getting our POC environment, I want to setup a local dev Hadoop environment. As we're going to POC a Cloudera version, I'm starting here: https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+Demo+VM+CDH3u3
- I'm also setting up Eclipse as my Java IDE. I'm going to rib something up on windows using the following a guide. When I get this up and running, I'll make sure to post about it: http://www.cloudera.com/blog/2009/04/configuring-eclipse-for-hadoop-development-a-screencast/
So that's it for now. I will update as this project moves forward over the coming weeks. Hopefully I won't suck at updating this blog, and more importantly, hopefully I won't fail more than usual getting this project off the ground.
GLHF
Thursday, February 23, 2012
C# multi-threading and crappy performance: A crash course in server settings
Damn, I've been slacking... oh well.
So I had to create a app that had to make several million web request calls. As I wanted this app to actually complete running in my lifetime, I decided to using a multi-threading application approach. In C# I used the following code block to queue up the work.
Now when I ran this on my development machine, it hummed along quite nicely. However, when I deployed this out to a regional VM (overseas for less latency in the IP call), performance absolutely tanked. Heck, I was getting better performance half a world away on my PC!
So I started to dig. CPU... flat lined, memory... not capped. So WTF? Dug into perfmon. Networking and everything was all good. Then I finally stumbled on it.... I was getting MASSIVE page faults on memory! Remember this was a VM that was stood up for me by some dude who has no idea what I'm using it for.
So I hop into advanced server settings and low and behold, the page file was dynamically set by windows. So I switched this to set it at a base size of several gigs with a max of 16 (double the VM's physical memory).
After all of that, performance went from processing 1000 records in 30 minutes, to processing it in 30 seconds.... I'd call that an improvement!
Wednesday, August 31, 2011
Migrating data in SQL Server instantly with SWITCH
*Note: This only pertains to the ENTERPRISE edition of SQL Server 2005 +
I was working the other day with a very large table (10 billion + records). I was tasked with migrating this table to a new paritioned environment. When all was said and done the new table had 409 partitions (that's a story for another day).
In any case, when all of the data was migrated over (took 5 days as some transformation had to be done). I realized I was a total noob and forgot to set the identity column in the new table (du'oh!).
So what to do?
Creating a new column via ALTER TABLE would have placed locks on the database that would have locked out various schema bound functions other developers were running for at least a day... not the best option. I like my co-workers.
So I came across SWITCH.
In the most basic sense, SWITCH can be used to swap data between two tables:
CREATE TABLE TestTable ( id int,
column1 varchar(100),
column2 varchar(100) );
CREATE TABLE TestTable2 ( id int identity(1,1),
column1 varchar(100),
column2 varchar(100) );
ALTER TABLE TestTable SWITCH TO TestTable2;
But remember that my table has 409 partitions? Yeah this would give the awesome error:
Msg 4911, Level 16, State 1, Line 2 Cannot specify a partitioned table without partition number in ALTER TABLE SWITCH statement. The table 'TestTable' is partitioned.
To work around this we would need to loop through the partitions. In the end with the below code, I migrated 10 billion+ rows across 409 partitions, in under 30 seconds. Good times.
Also excuse the poor code formatting... I'm new to this blog thing.
DECLARE @i INT = 1
DECLARE @sql varchar(max)
WHILE @i < 410
BEGIN
SET @sql = 'ALTER TABLE PartitionedTable1 SWITCH partition '+CAST(@i as varchar(5)) + 'TO PartitionedTable2 partition '+CAST(@i as varchar(5))
EXEC(@sql)
SELECT @i = @i + 1
END
Cheers!
Subscribe to:
Posts (Atom)