Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
BMMB 852: Applied Bioinforma0cs
Week 4, Lecture 8
István Albert
Bioinforma0cs Consul0ng Center Penn State, 2015
You’ll need a “good” text editor
Absolutely essen0al feature: • Needs to be able to show you white-‐space (allow you to
dis0nguish between tabs and spaces)
• Needs to be able to allow you to change line ending formats (Windows/Unix/Mac)
Handy features: • Syntax highligh0ng • Needs to be able to show line numbers
There are many op0ons one possible choice
The most annoying problems are caused by invisible characters
• Tabs vs spaces (when you copy/paster from the web it turns tabs into spaces!)
• New lines of wrong type (yes invisible lines can have types) à Unix, Mac, Windows
• Always use UNIX line endings!
Short Read Archive
It is (par0ally) documented and “sort of logical” – but only “sort of”
SRA – Sequence Read Archive naming conven0ons
NCBI BioProject: PRJN... -‐ the overall descrip0on of a single research ini0a0ve; a project will typically relate to mul0ple samples and datasets
NCBI BioSample: SAMN… and/or SRS… in SRA -‐ a descrip0on of biological source material; each physically unique specimen should be registered as a single BioSample with a unique set of a`ributes
SRA Experiment: SRX… -‐ a unique sequencing library for a specific sample
SRA Run: SRR… ERR… -‐ a manifest of data file(s) linked to a given sequencing library (experiment)
There is a cross linking between SRA and NCBI
Full list of prefixes
Visit the BioProject for the data
Web based download of the data
That’s not ALL – when it comes to biological data distribu0on confusion is the rule.
• The Gene Expression Omnibus also stores results from func0onal genomic experiments à but the raw data links back to SRA.
• GEO was originally designed for microarray data, later augmented for high throughput sequencing
• These organiza0ons appear to be monolithic and it is not clear what en0ty is responsible for them, who makes what decisions and why.
• This is why groups of scien0sts want to form their own independently run informa0on repositories.
GEO nomenclature
Words that start with G usually refer to GEO: • GPL… will be a plahorm • GSM… indicates a sample • GSE… indicates a series The sequencing data links back to SRA – there are other tools to read GEO data.
Geing data from SRA
• You will need to install a sojware package called sra-‐toolkit
• This package can fetch and unpack data from SRA
Download and accessing fastq data
• Work through the SRA tookit examples
• Become familiar with the terminology, accessing data, iden0fying runs
Homework 8
• Download and unpack at least five SRR runs (use subsets if it seems too slow).
• Run a fastqc report on each.
• Which run do you like most and why? Show one plot that you think shows good quality data.
• How many sequences are in each run? Check the number for at least one run via SRA website.
• What does the following command do:
fastq-‐dump -‐X 10 -‐Z SRR1553610