[Novalug] hadoop

Chris Rogers toxicpig@gmail.com
Sat Jul 14 01:49:14 EDT 2012


Greg, you and I are in about the same boat right now.  I'm curious as to 
what your motivations are.  I'm looking to set up a good distributed 
filesystem.  Partially for data resiliency and ease of access, but 
mostly for things relevant to what I do for a living.  I'm still into 
virtualization, and centralized shared storage has become the bane of my 
existence.  Looking for a better way of doing storage using a 
decentralized, distributed system.  I started with XtreemFS, and was 
very displeased.  The Java code is an utter pig.  Pegged four of my 
cores per machine.  Just started looking into Hadoop this week, so 
imagine my pleasant surprise when a thread comes up on it.

I'm having a hard time getting my head wrapped around it as well.  The 
information so far in this thread has been great, so thanks for that.  
I'm much clearer now on MapReduce.  I got the fact that it was the 
brains of the search algorithm, and sort of stopped there, as I wasn't 
as interested in searching the contents of the filesystem so much as the 
filesystem itself, so I went further down the GFS route.  I'm still 
digesting the information out there on GFS, and I'm not completely sure 
it will do what I want.  Heh, I'm not completely sure there is ANYTHING 
that will do what I want.  Decoupling the data from a given physical 
disk, but making it quickly accessible from any reasonably connected 
location is what I'm looking for.  Sounds like that may be the same 
motivation for you, Greg.

Is MapReduce needed to make the access of the data more efficient?  Can 
you just get away with GFS only, and call that good?

Would love to see a talk on Hadoop and/or other distributed 
filesystems.  Amazon plus DropBox plus local disk is where I'm headed.  
Backups on steroids kind of thing.  I want to get past the idea of 
relying on a single cloud provider, but making the data access 
seamless.  Yes, it's a tall order.  Cloud vendors are trying to do this 
now, but they only talk to themselves, not each other.  There is no 
standard right now for cross-provider data sharing.  Rsync via SSH is my 
solution now, but it's not exactly real-time nor is it very efficient.

Chris


On 07/13/2012 07:28 PM, Jeremy Trimble wrote:
> Though I'm not an expert on MapReduce, I've read several academic 
> papers on it and figured I'd throw a few clarifications into the 
> discussion:
>
> The "map" in "MapReduce" doesn't refer to a "map" data structure, but 
> rather to a function (or more correctly, a certain category of 
> functions) from functional programming languages like ML, LISP, 
> Haskell, etc.  Likewise, the "reduce" also refers to a function 
> originating in functional programming languages.  MapReduce is not a 
> storage mechanism -- it is a "framework" for implementing algorithms 
> in a way that makes it easy to scale to hundreds or thousands of 
> computers executing in parallel.
>
> Implementing an application (such as web search) in MapReduce amounts 
> to implementing two functions -- a function that fits the "map" 
> function prototype, and a function that fits the "reduce" function 
> prototype.  The input data for your computation is first processed by 
> your "map" function, producing some intermediate results which are 
> then passed to the "reduce" function.  Once you've written your 
> application-specific "map" and "reduce" functions, you turn them over 
> to the MapReduce framework and it does the hard work of delegating 
> computation among nodes, coordinating results, and handling any 
> failures that occur along the way.
>
> The advantage of this framework is that if you can dream up a way to 
> implement your algorithm as a "map" and "reduce" pair, the framework 
> can take your "map" and "reduce" functions and execute them in 
> parallel across hundreds or thousands of computers.  In practice, this 
> requires a way to distribute the input data and intermediate data to 
> the different machines, which, in Google's implementation of MapReduce 
> is provided by a distributed filesystem called "GFS," which may be 
> more along the lines of what some other folks have been discussing.  
> (Google also has a distributed system for storing structured data 
> called "BigTable" -- while not exactly a relational database, it 
> serves some of the same needs.)
>
> When executing an algorithm on so many independent nodes, failures are 
> common, so Google's implementation of MapReduce (and Hadoop) have some 
> interesting fault-tolerance mechanisms built in as well.
>
> -Jeremy Trimble
>
> On Fri, Jul 13, 2012 at 6:42 PM, cliff@palmercs.com 
> <mailto:cliff@palmercs.com> <cliff@palmercs.com 
> <mailto:cliff@palmercs.com>> wrote:
>
>     We use MapReduce daily and I will be glad to give you a brief
>     overview tomorrow after the meeting.
>
>     If it's a topic of interest we can talk about doing a presentation
>     at a future meeting.
>
>
>     See you tomorrow
>
>     Cliff Palmer
>
>
>     On July 13, 2012 at 3:25 PM greg pryzby <greg@pryzby.org
>     <mailto:greg@pryzby.org>> wrote:
>
>     > On Fri, Jul 13, 2012 at 2:52 PM, Peter Larsen
>     > <plarsen@famlarsen.homelinux.com
>     <mailto:plarsen@famlarsen.homelinux.com>> wrote:
>     > > On Fri, 2012-07-13 at 11:44 -0400, greg pryzby wrote:
>     > >> MapReduce is a Java class (maybe classes, I haven't gotten
>     that far)
>     > >> that allow distributed search. So it can run the Java app on
>     any (or
>     > >> multiple) nodes and search in parallel. In the end there is a
>     dataset
>     > >> that contains the information that matches the search
>     criteria. This
>     > >> can be put into an RDBMS or other store for future reference
>     if the
>     > >> results need to last or further parsed with other advanced
>     tools.
>     > >
>     > > MapReduce is the methodology used by Google to make the internet
>     > > searchable. Granted, it's not the Hadoop way of MapReduce but
>     it's the
>     > > same principle.
>     > >
>     > > http://en.wikipedia.org/wiki/MapReduce
>     > >
>     > > There's no need for a RDBMS in traditional way here. The point
>     is that
>     > > maps are a natural component of most languages, and the
>     structures
>     > > returned by MapReduce are simple "map" collections. It's
>     native to the
>     > > code, and hence extremely well adapted for processing in the
>     native
>     > > language.
>     >
>     > my english needs work.
>     >
>     > What I was trying to say was the results COULD be store in an RDBMS
>     > (saw that in a pic which led me to believe it is common).
>     >
>     >
>     > > Further more, it's not like a clustered RDBMS either. Even with
>     > > clustering, you would always have a whole record at the very
>     least
>     >
>     > don't think I said that. It wasn't what I thought at all. I do
>     find it
>     > interesting that it is 64M blocks (vs 4 or 8k with more fs). I
>     > understand the nameNode and replication to the DataNodes.
>     >
>     > the only RDBMS was to store the results of mapreduce IF desired.
>     >
>     >
>     > > located at a given node. MapReduce operates on attribute
>     levels and can
>     > > spread out a record over multiple nodes and hence read it
>     concurrently
>     > > on all nodes, to join (reduce) them together in the result as
>     a single
>     > > record. With RDBMSes, you would only get one node to return
>     the whole
>     > > record. MapReduce allows you to locate data optimal depending
>     on how you
>     > > access them. It's a very different approach from traditional
>     relational
>     > > distributed systems where the DBA estimated locations and once
>     data was
>     > > "tagged" it never moved from it's logical position. MapReduce
>     does this
>     > > dynamically and hence for _some_ datasets it's extremely
>     efficient. As
>     > > google has proven, it works great for it's purposes.
>     >
>     >
>     > yep...
>     >
>     >
>     > > I must admit I giggled when I saw your oversimplied grep
>     example. It's
>     > > not even close - doesn't even address the "map" side of the
>     equation and
>     > > certainly, only a very small piece of the reduce (query) piece
>     - in
>     > > particular I wonder how you solve the "key" vs. "data" with grep.
>     >
>     >
>     > it doesn't map, but it does reduce, i think. If all weblogs were in
>     > directories this would definitely reduce to the common thread. it
>     > depends on wait you are looking for. correct?
>     >
>     >
>     > > That said - your initial thought "this is not new" is quite
>     right.
>     > > Clustered data sources has been around for a long time, and
>     the idea of
>     > > distributing by attribute values isn't new either. But the
>     > > implementation/concept of MapReduce is rather new.
>     >
>     >
>     > I think MapReduce and Hadoop are two pieces, correct? We can
>     discuss
>     > that they go together, but hadoop can stand alone.
>     >
>     > hadoop is 'better' nfs (for a number of reasons)
>     > MapReduce is better grid/hpc/MPI for specific data sets
>     >
>     > Would you buy that?
>     >
>     >
>     > > It's not just for Java though - lots of languages have a "map"
>     data
>     > > structure and this works well for them too.
>     >
>     > To date, (and I haven't gotten to MapReduce yet) all mr stuff
>     > referenced says java.
>     >
>     > --
>     > greg pryzby                              greg at pryzby dot org
>     > http://www.linkedin.com/in/gpryzby
>     >
>     > WEB: http://www.MakeRoomForArt.com/
>     > TWTR: gpryzby
>     > _______________________________________________
>     > Novalug mailing list
>     > Novalug@calypso.tux.org <mailto:Novalug@calypso.tux.org>
>     > http://calypso.tux.org/mailman/listinfo/novalug
>
>     _______________________________________________
>     Novalug mailing list
>     Novalug@calypso.tux.org <mailto:Novalug@calypso.tux.org>
>     http://calypso.tux.org/mailman/listinfo/novalug
>
>
>
> _______________________________________________
> Novalug mailing list
> Novalug@calypso.tux.org
> http://calypso.tux.org/mailman/listinfo/novalug
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.firemountain.net/pipermail/novalug/attachments/20120714/e6572aaa/attachment.htm>


More information about the Novalug mailing list