`
ckwang17
  • 浏览: 25963 次
  • 性别: Icon_minigender_1
  • 来自: 深圳
社区版块
存档分类
最新评论

hadoop 入门

阅读更多

转载的。

原文出自 http://www.infosci.cornell.edu/hadoop/mac.html 

 

 

Cornell University

NOTICE: The Web Lab Hadoop cluster was closed at the end of September 2011

Quick Guide to Developing and Running Hadoop Jobs (Mac OS X 10.6)

This guide is written to help Cornell students using Mac OS X 10.6 with setting up a development environment for working with Hadoopand running Hadoop jobs on the Cornell Center for Advanced Computing (CAC) Hadoop cluster. This guide will walk you through compiling and running a simple example Hadoop job. More information is available at the official Hadoop Map-Reduce Tutorial.

The overall process of developing a Hadoop job is as follows:

  1. Install Hadoop on your development machine (personal or lab computer)
  2. Compile the Hadoop job, create a JAR file
  3. Run the Hadoop job JAR file on your development machine, for testing and debugging
  4. Run the Hadoop job JAR file on the CAC Hadoop cluster, for production

1. Installing Hadoop

This section shows you how to download Hadoop and prepare it for use on a Mac machine. Note: Hadoop versions after 0.19.2 require Java version 1.6. The following instructions take this into account.

  1. Obtain the latest stable Hadoop release. The file is named hadoop-version.tar.gz and can be obtained here. Unzip the downloaded file and place the resulting folder on your Desktop (or other location).

  2. To make hadoop run on a Mac, you will need to edit two files. Open the file conf/hadoop-env.sh within the hadoop folder you just unzipped in your favorite text editor. Find the following line in the file: 

    # export JAVA_HOME=/usr/lib/j2sdk1.6-sun 

    and change it to: 

    export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/ 

    Save the file. Second, open the file bin/hadoop within the hadoop folder in your favorite text editor. Search the file for the following line: 

    JAVA=$JAVA_HOME/bin/java 

    and change it to: 

    JAVA=$JAVA_HOME/Commands/java 

    Save the file and exit the editor. You have now set up Hadoop for development purposes on your computer.

2. Compiling a Hadoop job into a JAR file

This section guides you through compiling the WordCount example available in the Hadoop Map-Reduce Tutorial. This section assumes you are using the Eclipse IDE. If this is not the case, you should be able to adapt these instructions for your IDE.

  1. Create a new Java Project.
    Launch Eclipse, and from the File Menu select New, then use the Wizard to create a new Java Project. Enter a project name, in this example WordCount. Make sure you that the selected JRE is of version 1.6.0. Click Finish.

  2. Add hadoop library to project
    In Eclipse, right-click (control-click), on your project, go to Build Paths then Add External Archives. Browse to the hadoop folder on your desktop and select the file hadoop-version-core.jar, click Open.

  3. Add source code file
    From the File Menu, select New, then File. Select the parent folder WordCount/src (make sure this is right or you will encounter trouble when exporting the JAR file below.) and name the new file WordCount.java click Finish. Copy this code and paste it into the new file and save it. Eclipse will compile the file as soon as you save it.
  4. Export JAR file
    From the File Menu, select Export. From under Java select JAR file, click Next. Select all resources to be exported. In this case, select the entire WordCount project. Make sure the export classes checkbox is checked. Select an export destination for your JAR file - you can use your Desktop, or some other directory. For simplicity, name the file WordCount.jar and export it to your Desktop.

3. Running a Hadoop job on your development machine

This section shows you how to run your job on your own machine, for testing purposes. Hadoop will run in "standalone mode", which means that it will run within a single process, not taking advantage of any parallel processing. This will be much slower than running on the cluster, so you may want to reduce the data size set for testing.

  1. Create or obtain test data
    For this example, the input data will be this web page. Copy this entire web page, and using your favorite text editor save it as a plain text file named testing.txt. Place this file within a folder called input on your Desktop.

  2. Run the job
    First, go to the command line. (To access the command line, go to the finder, then to "Applications", then "Utilities" and finally launch "Terminal"). If you are not familiar with the UNIX command line, here is a basic guide. Change into your hadoop directory~/Desktop/hadoop-0.19.2 or similar. Execute the following command

    ./bin/hadoop jar ~/Desktop/WordCount.jar WordCount ~/Desktop/input ~/Desktop/output

    You may need to alter the paths if any of the files were saved to different places.

  3. Retrieve the results
    The results have been written to a new folder called output on your Desktop. There should be one file, named part-00000 which lists all the words on this web page, along with their occurrence count. Note, that before running hadoop again you will need to delete the entire output folder, since hadoop will not do this for you.

4. Running a Hadoop job on the CAC cluster

This section shows you how to take the JAR file you created above along with the test data, and run the job on the CAC cluster.

  1. Obtain a CAC account
    If you are taking a course which requires the use of the cluster, the instructor should organize the CAC account for you. If you are using the cluster for research, the Principal Investigator will add you to their CAC project. In either case, you will receive an email to your Cornell email address with your username and password for the CAC.

  2. Use SSH to connect to the job tracker node
    To connect to the cluster and run Hadoop jobs, use SSH in the Macintosh Terminal window, which provides a Bash shell. Run the following from the Bash command line. First, connect to the CAC: 

    ssh netid@wl01.cac.cornell.edu 

    where netid is replaced by your CAC username. Note: the address starts with doubleu-el-zero-one NOT doubleu-zero-one-zero. Enter your CAC password when prompted. The first time that you log in you will be required to change your password to something secure and easy to remember. Once you are logged in you will be placed in your CAC home directory. 

  3. Copy JAR and input files to CAC
    Copy your WordCount.jar file and input folder from your Desktop into your CAC home directory. You can use scp from the Terminal window to copy files, or you can mount the CAC directory on your Macintosh. To do this from the Macintosh Finder, then select the Connect to Server option from the Go menu. Enter the following path:

    smb://cacfs01.cac.cornell.edu/netid 

    and replace netid with your CAC user name. Enter your CAC username and password when prompted. You should then see a new Finder window, showing the contents of your CAC home directory. Please note, this directory is only accessible from within the Cornell firewall. If you wish to access it from off-campus, you will first need to VPN into Cornell

  4. Copy input files into HDFS
    Make a directory in the Hadoop Distributed File System (dfs) for your input files. You can see the list of commands available for working on the dfs by executing the following: 

    hadoop dfs 

    More information about the commands is available here. Note, that to execute any hadoop dfs command, you must type hadoop dfs -command, where command is the dfs command to run.

    To copy input data files into dfs from your home directory, do the following: 

    hadoop dfs -copyFromLocal input .

  5. Run your job
    Perform the following: 

    hadoop jar WordCount.jar WordCount input output 

    This will place the result files in a directory called "output" in the dfs. You can then copy these files back to your CAC home directory by executing the following: 

    hadoop dfs -copyToLocal output output

    Now you can retrieve the output files in the same fashion that you copied the input files to your home directory. Note, that one output file is produced for each reduce job you run. The WordCount example uses the system-configured limit of the number of reduce jobs, so do not be surprised to see 10-20 output files (the exact number depends on the number of cluster nodes running and their configuration). You can control this limit programmatically via the setNumReduceTasks() method of the JobConf class in the hadoop API. Refer to the map reduce tutorial for more details on running map reduce jobs. 

    When you are finished with the output files, you should delete the output directory. Hadoop will not automatically do this for you, and it will throw an error if you run it while there is an old output directory. To do this, execute: 

    hadoop dfs -rmr output

Last revised: March 15, 2010 
bjk/wya

 

分享到:
评论

相关推荐

    Hadoop入门实战手册

    Hadoop入门实战手册,Hadoop入门实战手册是搭建HADOOP的详细介绍手册。

    Hadoop入门手册.chm

    Hadoop入门手册 简单入门Hadoop入门手册 简单入门Hadoop入门手册 简单入门Hadoop入门手册 简单入门

    hadoop 入门

    hadoop入门,新手入门(InfoQ Hadoop基本流程与应用开发,InfoQ Hadoop中的集群配置和使用技巧,InfoQ 分布式计算开源框架Hadoop介绍)

    Hadoop入门到精通

    Hadoop入门到精通(带目录)--很不错的HADOOP学习资料

    hadoop入门书籍1

    hadoop的入门书籍,本人认为一共有以下五本书比较好: 1.云计算资料大全(了解云计算者必读).pdf 2.Hadoop开发者入门专刊 3.Hadoop权威指南%28第2版%29中文版 4.hadoop实战中文版+电子版pdf 5.精通HADOOP 由于上传...

    hadoop入门

    hadoop入门是初学者,这只是一个入门的教程,让你了解hadoop到底是什么。

    Hadoop入门实战手册 中文版)

    Hadoop入门实战手册,本手册是中文版,且较详细

    hadoop入门学习 天气数据 2002年整年数据

    hadoop入门学习 mapreduce求解 天气数据 2002年整年数据的最高气温

    Hadoop入门教程

    Hadoop入门教程 Hadoop开发者 2010入门专刊 出品Hadoop技术论坛

    hadoop_tutorial hadoop入门经典

    hadoop_tutorial hadoop入门经典 Hadoop 是一个能够对大量数据进行分布式处理的软件框架。Hadoop 是可靠的,因为它假设计算元素和存储会失败,因此它维护多个工作数据副本,确保能够针对失败的节点重新分布处理。...

    hadoop入门经典书籍

    hadoop的经验入门书籍,适合刚刚开始了解学习hadoop技术的人

    Hadoop入门手册

    Hadoop入门手册 chm格式的 很适合初学者.Hadoop入门手册.zip

    Hadoop入门

    Hadoop的源起——Lucene ,Doug Cutting开创的开源软件,用java书写代码,实现与Google类似的全文搜索功能,提供了全文检索引擎的架构,包括完整的查询引擎和索引引擎 。

    非常好的hadoop入门资料

    非常好的hadoop入门资料;非常好的hadoop入门资料;非常好的hadoop入门资料;非常好的hadoop入门资料;非常好的hadoop入门资料;非常好的hadoop入门资料;非常好的hadoop入门资料;非常好的hadoop入门资料;非常好的...

    Hadoop入门程序java源码

    Hadoop集群搭建好后,这是用于测试用的入门级java程序源码,也是我博文的一个补充,欢迎查看下载

    Hadoop入门中文手册

    Hadoop入门中文手册 目的是帮助你快速完成单机上的Hadoop安装与使用以便你对Hadoop分布式文件系统(HDFS)和Map-Reduce框架有所体会,比如在HDFS上运行示例程序或简单作业等,同样也介绍了Hive,HBase详细安装应用! ...

    史上最全面的hadoop入门视频教程

    1、对大数据Hadoop感兴趣的在校生及应届毕业生。 2、对目前职业有进一步提升要求,希望从事大数据行业高薪工作的在职人员。 3、对大数据行业感兴趣的相关人员。 课程介绍: 第一章 大数据基础和Zookeeper入门 第二...

Global site tag (gtag.js) - Google Analytics