5
Last updated: May, 2017 BCHM 6280 2017 Excel Tutorial Page 1 of 5 Tutorial 1: Using Excel to find unique values in a list It is not uncommon to have a list of data that contains redundant values. Genes with multiple transcript isoforms is one example. If you are only interested in the genes and not the different transcripts, then you will probably want to filter the list to remove the redundant values. I did a search of the UCSC human genome browser with the query “colon cancer” and got back >500 matches. I created a text file listing the first 500 matches. You can download this data from the Exercise 1 home page by clicking on the link ListofGenesfromUCSC.txt. The file has 2 columns: Gene Name and Chromosome Location. You will filter on Gene Name. Once you’ve downloaded the text file, do the following: Open Excel and from within Excel open the text document. If the file you want to open is greyed out, change the drop down menu to Enable: All Readable Documents. Double-click the file you want to open and this should bring up the Text Import Wizard It should recognize it as delimited. Click the Next button to define the delimiters. By default, Excel assumes a .txt file is tab-delimited Click Next and then Finish to finish the import. Advanced filter: Select the column of gene names Click on the Data menu and select Advanced filter (if you get a warning about being unable to determine which row contains column labels and you have a column header in row 1, just click OK). Check the radio button “Copy to another locationThis should move our mouse to the “Copy to” text box. Select a column (not Columns A-C) Check the box “Unique records onlyClick the OK button. This should produce a list of 208 genes from the original 500 genes.

Tutorial 1: Using Excel to find unique values in a listbiochem.slu.edu/bchm628/handouts/ExcelTutorials_2017.pdf · Tutorial 1: Using Excel to find unique values in a list It is not

Embed Size (px)

Citation preview

Last updated: May, 2017

BCHM 6280 2017 Excel Tutorial Page 1 of 5

Tutorial 1: Using Excel to find unique values in a list Itisnotuncommontohavealistofdatathatcontainsredundantvalues.Geneswithmultipletranscriptisoformsisoneexample.Ifyouareonlyinterestedinthegenesandnotthedifferenttranscripts,thenyouwillprobablywanttofilterthelisttoremovetheredundantvalues.IdidasearchoftheUCSChumangenomebrowserwiththequery“coloncancer”andgotback>500matches.Icreatedatextfilelistingthefirst500matches.YoucandownloadthisdatafromtheExercise1homepagebyclickingonthelinkListofGenesfromUCSC.txt.Thefilehas2columns:GeneNameandChromosomeLocation.YouwillfilteronGeneName.Onceyou’vedownloadedthetextfile,dothefollowing:

• OpenExcelandfromwithinExcelopenthetextdocument.Ifthefileyouwanttoopenisgreyedout,changethedropdownmenutoEnable:AllReadableDocuments.

• Double-clickthefileyouwanttoopenandthisshouldbringuptheTextImportWizard• Itshouldrecognizeitasdelimited.ClicktheNextbuttontodefinethedelimiters.• Bydefault,Excelassumesa.txtfileistab-delimited• ClickNextandthenFinishtofinishtheimport.

Advancedfilter:SelectthecolumnofgenenamesClickontheDatamenuandselectAdvancedfilter(ifyougetawarningaboutbeingunabletodeterminewhichrowcontainscolumnlabelsandyouhaveacolumnheaderinrow1,justclickOK).Checktheradiobutton“Copytoanotherlocation”Thisshouldmoveourmousetothe“Copyto”textbox.Selectacolumn(notColumnsA-C)Checkthebox“Uniquerecordsonly” ClicktheOKbutton.Thisshouldproducealistof208genesfromtheoriginal500genes.

Last updated: May, 2017

BCHM 6280 2017 Excel Tutorial Page 2 of 5

Tutorial 2: Using Excel to manage text data Anissuecommontogenenamesorgeneidentifiersisslightvariationsthatcanpreventtheiridentificationviaadatabaselookup.Anexampleisthatasgeneortranscriptrecordsarereviewedbycurators,theyareoftengivenanappendednumbersuchasNM_0012345.1orNM_0012345.3indicatingwhichversiontheyare.ThebaseidentifierofNM_0012345isthesamebetweenthembutifyourlisthastheappendedversionnumber,thedatabaselookuporExcellookupwon’trecognizethetwoasbeingthesamerecord.Inthisexample,therearetwoExcelfilesavailablefromtheExercise2homepage:ExpressionData.xlsxandGeneInfo.xlsxTheExpressionDatafilehastwocolumns.ThefirsthasEnsemblGeneIDswiththeversionnumber.ThesecondcolumncontainsgeneexpressioninformationintheformofLog2ratiooftreatment/control.TheGeneInfofilehasfourcolumns.ThefirsthasEnsembleGeneIDs,butasthestableidentifierratherthanasaversion.Theremainingcolumnshavethegenesymbol,NCBIGeneIDandgenedescription.YouwanttobeabletobringininformationfromtheGeneInfofileintotheExpressionDatafilebutatthemoment,theydonotsharethesameidentifiers.Tocorrectthis,youwilluseatext-relatedfunctioncalledLEFTtochangetheGeneIDsintheExpressionDatafiletomatchthoseintheGeneInfofile.

1. InsertacolumntotheleftoftheGeneIDcolumnintheExpressionDatafile.2. IncellA2,type=andselecttheLEFTfunction3. SelectcellB2forthetextboxintheFormulaBuilderdialogbox4. Tabtothenum_charsboxandtypein155. ThisshouldreturntheENSG##uptothe.asitwasoriginially6. SelectthenewlygeneratedIDinA2,thencopydowntotheendofthecolumn.TypeCtrl-D

tocopythefunctiondowntherestofthecolumn.7. ThenEdit->copythenewlygeneratedIDsanduseEdit->Paste->Special->Valuestoreplace

theformulawithvalues.8. NowyoucanusethetwofilesinthenextsectiontobringthedatafromGeneInfointothe

ExpressionDatafile Tutorial 3: Using Excel to compare lists of data. Averycommonprobleminbioinformaticsorinformationprocessingofanykindishavingmultiplelistsofdatathatyouwanttocomparetoeachother.InExcelisafunctioncalledVLOOKUPthatmakesthiseasytodo.Itisalsousefulfortransferringdatafrom1worksheettoanother.Forthispartofthetutorial,youwillusetheGeneInfoandyourmodifiedExpressionDatafilefromtheprevioussection.YoucandeletethecolumnfromtheExpressionDatafilethathadtheGeneIDswithversionnumberinthem.Inthispartofthetutorial,youwillbringintheGeneNameandNCBIGeneIDintotheExpressionDatafile.

Last updated: May, 2017

BCHM 6280 2017 Excel Tutorial Page 3 of 5

OpenbothworksheetsinExcel.o IntheExpressionDatafile,insertacolumnbetweencolumns1and2.o Inthesecondrowofcolumn2(cellB2),typeand“=”sign.Thengotothedropdownmenuin

theupperleftoftheworksheet,findthefunction“VLOOKUP”andselectit.IfyoudonotseeVLOOKUPonthemainmenu,scrolldownto“morefunctions”whichopensadialogboxwithalloftheavailableExcelfunctions.Under“lookupandreference”youwillfindVLOOKUP.

o Onceyou’veinsertedthefunction,youmustfillouttheargumentsforthefunctionusingthedialogboxthatopensup.SelectcellA2asthelookupvalue.

o Thenclickintothebox“Table_array”.GouptothewindowmenuandselectGeneInfor_ExcelTutorial.xlsxasshowninFigure2.

o ThiswillactivateGeneInfo.xlsx.

Figure1:InsertingaVLOOKUPfunctionintocolumn2ofExpressionDataworksheet.

Figure2:Selectingsecondworksheetforastable_arrayintheVLOOKUPfunction.

Last updated: May, 2017

BCHM 6280 2017 Excel Tutorial Page 4 of 5

o Selectthefirst2columnsofGeneInfo.xlsx.o Taborclickonthebox“Col_index_num.”Thistellstheargumentwhichcolumnofdatato

bringovertothefirstworksheet.Typeina2.o Inthefinalbox,“Range_lookup,”type“false”.IfA2intheExpressionDataworksheetmatches

A2inGeneInfoworksheet,thenthevaluefromcolumn2ofGeneInfowillbeenteredintocellB2ofExpressionData.Ifthe2cellsdonotmatch,itwillfillin“N/A”.

o Tofillintherestofthecolumn,selectfromcellB2throughthenendofthedataandundertheEditmenu,selectFillDownorusethekeyboardshortcutof“Ctl+D”.

Figure5:Fillingintherestofthecolumnwiththesamefunction. Whenyouaredone,yourExpressionDataworksheetshouldlooklikethatshowninFigure4:

Figure3:Fillingintherestofthecolumnwiththesamefunction.

Figure4:GeneExpressionworksheetaftercompletingVLOOKUP

Last updated: May, 2017

BCHM 6280 2017 Excel Tutorial Page 5 of 5

Atthispoint,thedataincolumn2isstilllinkedtotheGeneInfoworksheet.Youcanseethisifyouclickononeofthegenenamesandlookatwhatisdisplayedinthetextboxatthetopofthesheet.Youdonotwanttoleaveyourfilelikethat,otherwiseeverytimeyouopenitwillgothroughthedatalookupfunctionagain.Toavoidthis,selecttheentirecolumn,copyitandthendoaEdit->PasteSpecialandselect“values”inthe“Pastespecial”dialogbox.Thiswillreplacethefunctionwiththevalueofthefunction.Afteryoucompletethat,clickonagenename.Youshouldseejustthegenenamedisplayedinthetextboxatthetop.

TobringintheNCBIgeneID,justinsertanothercolumnintheExpressionDataworksheetandrepeattheVLOOKUPprocessbringingincolumn3datafromGeneInforatherthancolumn2.

Figure5:GeneExpressionworksheetaftercopyingandpastespecialwithvalues