Adam学习1之环境搭建(含window下eclipse配置)

更多代码请见:https://github.com/xubo245/SparkLearning


Adam环境搭建(含window下eclipse配置)

环境:

集群:Ubuntu14.04 +Spark 1.5.2 +scala2.10

本地:window7 64 +eclipse4.3.2+scala2.10.4


1.Adam安装:参考【1】

$ git clone https://github.com/bigdatagenomics/adam.git
$ cd adam
$ export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=256m"
$ mvn clean package -DskipTests
更多配置请参考【1】,不细讲

2.eclipse下Spark环境搭建请参考【2】


3.在Adam的adam-apis  adam-cli  adam-core的target下分别下载:

adam-core_2.10-0.18.3-SNAPSHOT.jar
adam-cli_2.10-0.18.3-SNAPSHOT.jar
adam-apis_2.10-0.18.3-SNAPSHOT.jar
然后在新建的Scala Project中add jar


4.环境示例:


jdk1.7+scala2.10.4+Spark的jar包+3中的三个jar包


5.集群运行:

(1)

输入adam-shell进入shell界面

代码:参考【1】 但将/data/NA21144.chrom11.ILLUMINA.adam", 数据换了下,变成了"hdfs://Master:9000/xubo/adam/output/small.adam",而且需要将Master切换成自己集群的IP

import org.bdgenomics.adam.rdd.ADAMContext
import org.bdgenomics.adam.projections.{AlignmentRecordField, Projection}

val ac = new ADAMContext(sc)
// Load alignments from disk
val reads = ac.loadAlignments(
  "hdfs://<strong>Master</strong>:9000/xubo/adam/output/small.adam" //"/data/NA21144.chrom11.ILLUMINA.adam", 数据换了下,而且需要将Master切换成自己集群的IP
  projection = Some(
    Projection(
      AlignmentRecordField.sequence,
      AlignmentRecordField.readMapped,
      AlignmentRecordField.mapq
    )
  )
)

// Generate, count and sort 21-mers
val kmers =reads.flatMap(_.getSequence.sliding(21).map(k => (k, 1L))).reduceByKey(_ + _).map(_.swap).sortByKey(ascending = false)

// Print the top 10 most common 21-mers
kmers.take(10).foreach(println)

输入结果:

scala> kmers.take(10).foreach(println)
(4,TCTTTCTTTCTTTCTTTCTTT)
(4,TTTCTTTCTTTCTTTCTTTCT)
(3,CTTTCTTTCTTTCTTTCTTTC)
(3,TTCTTTCTTTCTTTCTTTCTT)
(2,TCTTTTTCTTTCTTTCTTTCT)
(2,TTCTTTTTCTTTCTTTCTTTC)
(2,TTTCTTTTTCTTTCTTTCTTT)
(1,ATTGGATATCCTCCCAAATTT)
(1,AGGCATGAGGCACCGCGCCTG)
(1,CTACTGCCCAACAAGTCCCTA)

hdfs://Master:9000/xubo/adam/output/small.adam的获取:

在adam的安装目录下的/adam-core/src/test/resources文件夹中的small.sam,我的为:/home/hadoop/cloud/adam/adam-core/src/test/resources/small.sam


hadoop fs -put small.sam <span style="font-size: 13.3333px;"> /xubo/adam/dataAdam/</span>
上传到集群

然偶使用adam-submit的transform将其转换成adam文件:

adam-submit transform /xubo/adam/dataAdam/small.sam /xubo/adam/output/small.adam
该small.adam即为代码中的small.adam文件

(2)验证(不完善)

然后使用adam-submit指令将其转换成k-mer:

adam-submit count_kmers /xubo/adam/output/small.adam /xubo/adam/output/kmerSmallK21.adam 21


然后在adam-shell中将其打印出来,或者自己去 kmerSmallK21.adam中去看
sdam-shell:

val kmer21=sc.textFile("/xubo/adam/output/kmerSmallK21.adam")
kmer21.foreach(println)
 kmer21.count
结果:很长,省略

。。。。。。
(GCCTTGCAGGTTGAGTAGGAT,1)
(CATTATAAATATATTTAACAG,1)
(TTTTGAGCATGAAAGTAATAT,1)
(AAGTCAAAAAGAAAAAAAAGG,1)
(ACGGGGTTTCACCATGTTGGC,1)
(TCACAATGCCAACAGCTAAAT,1)
(CAACAGCTAAATGTACCCAAG,1)
(GCCTTGCAAGAATCTCTACTG,1)
(TCTCACTATGTTGCCTAGGCT,1)
(ATAAATGTTGATTGTCCTATT,1)
(ATTCCCAGGTCTTAGGTGCTG,1)
(CAGCCTTATTCCTATTTATAA,1)
(ACAAGATAGTACTTGAGCTAA,1)
(ACTCTCATTGACTGTTCAATG,1)
(TGTAAATTCAAATTGGATATC,1)
(AAAGTTTGGCTTTCAGTTGTA,1)
(ATAAGAGCAGCCTTATTCCTA,1)
(CAAACTCCTGGGCTCAAGTGA,1)
(CAGTGGGAGGTGGTGGCCATG,1)
(TAAGGTTTTTTTTGTTTGTTT,1)
(CATGAGGCACCGCGCCTGGCC,1)
(TCAAACATCACACTCCACGTT,1)
(CCGCCTCGGCCTCCCAAAGTG,1)

scala> kmer21.count
res12: Long = 1087


6.本地运行:

代码:

package testAdam
import org.apache.spark._
import org.bdgenomics.adam.rdd.ADAMContext
import org.bdgenomics.adam.projections.{AlignmentRecordField, Projection}
object kmer {
def main(args:Array[String]){

  val conf=new SparkConf().setAppName("test Adam kmer").setMaster("local")
  val sc=new SparkContext(conf)
val ac = new ADAMContext(sc)
// Load alignments from disk
//val reads = ac.loadAlignments("/data/NA21144.chrom11.ILLUMINA.adam",
//  val reads = ac.loadAlignments("/xubo/adam/output/small.adam",
val reads = ac.loadAlignments("hdfs://Master:9000/xubo/adam/output/small.adam",
  projection = Some(
    Projection(
      AlignmentRecordField.sequence,
      AlignmentRecordField.readMapped,
      AlignmentRecordField.mapq
    )
  )
)

// Generate, count and sort 21-mers
val kmers =reads.flatMap(_.getSequence.sliding(21).map(k => (k, 1L))).reduceByKey(_ + _).map(_.swap).sortByKey(ascending = false)

kmers.take(10).foreach(println)

// Print the top 10 most common 21-mers

}

}

运行结果:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/G:/149/jar%e9%87%8d%e8%a6%81/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/D:/1win7/java/otherJar/adam-cli_2.10-0.18.3-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2016-03-05 20:38:22 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-03-05 20:38:24 WARN  MetricsSystem:71 - Using default name DAGScheduler for source because spark.app.id is not set.
2016-03-05 20:38:26 WARN  :139 - Your hostname, xubo-PC resolves to a loopback/non-reachable address: fe80:0:0:0:200:5efe:ca26:54fd%30, but we couldn't find any external IP address!
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
(4,TCTTTCTTTCTTTCTTTCTTT)
(4,TTTCTTTCTTTCTTTCTTTCT)
(3,CTTTCTTTCTTTCTTTCTTTC)
(3,TTCTTTCTTTCTTTCTTTCTT)
(2,TCTTTTTCTTTCTTTCTTTCT)
(2,TTCTTTTTCTTTCTTTCTTTC)
(2,TTTCTTTTTCTTTCTTTCTTT)
(1,ATTGGATATCCTCCCAAATTT)
(1,AGGCATGAGGCACCGCGCCTG)
(1,CTACTGCCCAACAAGTCCCTA)
2016-3-5 20:38:46 INFO: org.apache.parquet.hadoop.ParquetInputFormat: Total input paths to process : 1
2016-3-5 20:38:49 WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
2016-3-5 20:38:50 INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 20 records.
2016-3-5 20:38:50 INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
2016-3-5 20:38:50 INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 233 ms. row count = 20


运行截图:




参考:

【1】 https://github.com/bigdatagenomics/adam

【2】 http://blog.csdn.net/xubo245/article/details/50789983

【3】 /adam/docs/source/01_intro.md

【4】 https://github.com/ga4gh/gastore

已标记关键词 清除标记
相关推荐
©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页