7
Exploring estions Asked about Linux Filesystem APIs in Stack Overflow Zuhair AlSader University of Waterloo Waterloo, Ontario, Canada [email protected] ABSTRACT The virtual filesystem (VFS) is a key component of the Linux kernel, which allows programs to use standard Unix system calls to read and write to various filesystems locally and across the network. However, that interface is complicated for developers to learn from the documentation, due to the lack of examples and the similarity between parts of the API. This directs developers to ask their ques- tions on questions and answers communities like Stack Overflow. We study the questions asked about the Linux filesystem API on Stack Overflow, and demonstrate that it is a reliable source of in- formation on that API. We also show that the community is still engaged in questions and discussions about Linux filesystem APIs, even after about 30 years from its birth. In addition, we quantita- tively study which keywords are the most common in questions are asked about the APIs, and examine which of filesystem APIs are likely to be asked about together, indicating codependency or confusion. KEYWORDS Stack Overflow, Github, Questions ACM Reference Format: Zuhair AlSader. 2018. Exploring Questions Asked about Linux Filesystem APIs in Stack Overflow. In Proceedings of Empirical Software Evolution (CS846). ACM, New York, NY, USA, Article 4, 7 pages. https://doi.org/10. 475/123_4 1 INTRODUCTION Linux, which was introduced in 1991, is the most commonly de- ployed operating system in the market [11]. Even though it has many different distributions, they all share the same kernel module and interface [3]. A key component of the kernel is the Virtual Filesystem (VFS) [17], which implements the file and filesystem- related interfaces provided to user-space programs. This allows programs to use standard Unix system calls to read and write to various filesystems, even on different media or across the network. This allowed for the creation of many local and distributed filesys- tems, including Sun Microsystems’ NFS [17] and Google’s GFS [6], as well as many device drivers, like character device drivers [4, 16]. When software system developers need to learn about the Linux kernel APIs, they face a difficulty learning about it from the doc- umentation. This is because the documentation usually lacks suf- ficient examples and explanations. The widespread nature of this issue [10, 15] has lead developers to use other forms of documenta- tion, like wikis, blogs, and online question and answer communities CS846, April 2018, Waterloo, ON Canada 2018. ACM ISBN 123-4567-24-567/08/06. . . $0.00 https://doi.org/10.475/123_4 like Stack Overflow, where people ask questions and others answer them. The problem of documentation is especially prevalent for the filesystem APIs, because it contains many functions which seem to share the same goal, as well as many flags and arguments which can completely alter the behavior of the function [11]. Recently, many studies have shown that the POSIX standard API, which the Linux kernel filesystem APIs follow, have many drawbacks for modern operating systems. Zadok et al [20] highlight some of the issues in POSIX and propose some ideas to improve it or replace it. One major issue they mention is that, by design, POSIX interface methods are susceptible to “Time of Check To Time of Use” (TOCTTOU) vulnerabilities [19]. Additionally Atlidakis el al [2] explain how the common applications have designed around the POSIX APIs to overcome their shortcomings, and how this practice is not sustainable and can lead to suboptimal performance. This study highlights how the Linux kernel filesystem APIs, which follow POSIX standards, are asked about on Stack Overflow. This aims to understand the software engineering viewpoint on the filesystem APIs and whether developers discuss the design issues mentioned above on Stack Overflow. More generally, it aims to understand what questions are asked about these APIs on Stack Overflow. We do that by extracting single-word, bigram and trigram frequencies in questions that contain at least one of the filesystem APIs. We also examine which filesystem APIs are likely to be asked about together, indicating codependency or confusion. The contributions of this study are as follows: We demonstrate that Stack Overflow is a somewhat reliable source of information on Linux filesystem APIs. We show that the community is still engaged in questions and discussions about Linux filesystem APIs, even after about 30 years from its birth. We quantitatively study which keywords are the most com- mon in questions are asked about the APIs. We examine which filesystem APIs are likely to be asked about together, indicating codependency or confusion. 2 RELATED WORK Some studies try to use Stack Overflow to give suggestions on usage and code examples inside an IDE [13]. There is even a python module that loads examples from Stack Overflow as modules in the code 1 . Other studies on Stack Overflow have looked qualitatively and quantitatively into unanswered questions [1] to understand why they are still unanswered, and on deleted questions [5] to understand the reasons behind the deletion. Gruetze et al. [7] study the dynamics and topic shifts of Stack Overflow over time. Vasilescu 1 https://github.com/drathier/stack-overflow-import

Exploring Questions Asked about Linux Filesystem APIs in ... · API, which the Linux kernel filesystem APIs follow, have many ... categories they fall into from the official Linux

Embed Size (px)

Citation preview

Page 1: Exploring Questions Asked about Linux Filesystem APIs in ... · API, which the Linux kernel filesystem APIs follow, have many ... categories they fall into from the official Linux

ExploringQuestions Asked about Linux Filesystem APIs in StackOverflowZuhair AlSader

University of WaterlooWaterloo, Ontario, [email protected]

ABSTRACTThe virtual filesystem (VFS) is a key component of the Linux kernel,which allows programs to use standard Unix system calls to readand write to various filesystems locally and across the network.However, that interface is complicated for developers to learn fromthe documentation, due to the lack of examples and the similaritybetween parts of the API. This directs developers to ask their ques-tions on questions and answers communities like Stack Overflow.We study the questions asked about the Linux filesystem API onStack Overflow, and demonstrate that it is a reliable source of in-formation on that API. We also show that the community is stillengaged in questions and discussions about Linux filesystem APIs,even after about 30 years from its birth. In addition, we quantita-tively study which keywords are the most common in questionsare asked about the APIs, and examine which of filesystem APIsare likely to be asked about together, indicating codependency orconfusion.

KEYWORDSStack Overflow, Github, Questions

ACM Reference Format:Zuhair AlSader. 2018. Exploring Questions Asked about Linux FilesystemAPIs in Stack Overflow. In Proceedings of Empirical Software Evolution(CS846). ACM, New York, NY, USA, Article 4, 7 pages. https://doi.org/10.475/123_4

1 INTRODUCTIONLinux, which was introduced in 1991, is the most commonly de-ployed operating system in the market [11]. Even though it hasmany different distributions, they all share the same kernel moduleand interface [3]. A key component of the kernel is the VirtualFilesystem (VFS) [17], which implements the file and filesystem-related interfaces provided to user-space programs. This allowsprograms to use standard Unix system calls to read and write tovarious filesystems, even on different media or across the network.This allowed for the creation of many local and distributed filesys-tems, including Sun Microsystems’ NFS [17] and Google’s GFS [6],as well as many device drivers, like character device drivers [4, 16].

When software system developers need to learn about the Linuxkernel APIs, they face a difficulty learning about it from the doc-umentation. This is because the documentation usually lacks suf-ficient examples and explanations. The widespread nature of thisissue [10, 15] has lead developers to use other forms of documenta-tion, like wikis, blogs, and online question and answer communities

CS846, April 2018, Waterloo, ON Canada2018. ACM ISBN 123-4567-24-567/08/06. . . $0.00https://doi.org/10.475/123_4

like Stack Overflow, where people ask questions and others answerthem. The problem of documentation is especially prevalent for thefilesystem APIs, because it contains many functions which seem toshare the same goal, as well as many flags and arguments whichcan completely alter the behavior of the function [11].

Recently, many studies have shown that the POSIX standardAPI, which the Linux kernel filesystem APIs follow, have manydrawbacks for modern operating systems. Zadok et al [20] highlightsome of the issues in POSIX and propose some ideas to improve it orreplace it. One major issue they mention is that, by design, POSIXinterface methods are susceptible to “Time of Check To Time of Use”(TOCTTOU) vulnerabilities [19]. Additionally Atlidakis el al [2]explain how the common applications have designed around thePOSIX APIs to overcome their shortcomings, and how this practiceis not sustainable and can lead to suboptimal performance.

This study highlights how the Linux kernel filesystem APIs,which follow POSIX standards, are asked about on Stack Overflow.This aims to understand the software engineering viewpoint on thefilesystem APIs and whether developers discuss the design issuesmentioned above on Stack Overflow. More generally, it aims tounderstand what questions are asked about these APIs on StackOverflow.We do that by extracting single-word, bigram and trigramfrequencies in questions that contain at least one of the filesystemAPIs. We also examine which filesystem APIs are likely to be askedabout together, indicating codependency or confusion.

The contributions of this study are as follows:• We demonstrate that Stack Overflow is a somewhat reliablesource of information on Linux filesystem APIs.

• We show that the community is still engaged in questionsand discussions about Linux filesystemAPIs, even after about30 years from its birth.

• We quantitatively study which keywords are the most com-mon in questions are asked about the APIs.

• We examine which filesystem APIs are likely to be askedabout together, indicating codependency or confusion.

2 RELATEDWORKSome studies try to use Stack Overflow to give suggestions onusage and code examples inside an IDE [13]. There is even a pythonmodule that loads examples from Stack Overflow as modules in thecode1. Other studies on Stack Overflow have looked qualitativelyand quantitatively into unanswered questions [1] to understandwhy they are still unanswered, and on deleted questions [5] tounderstand the reasons behind the deletion. Gruetze et al. [7] studythe dynamics and topic shifts of Stack Overflow over time. Vasilescu

1https://github.com/drathier/stack-overflow-import

Page 2: Exploring Questions Asked about Linux Filesystem APIs in ... · API, which the Linux kernel filesystem APIs follow, have many ... categories they fall into from the official Linux

CS846, April 2018, Waterloo, ON Canada Zuhair AlSader

et al [18] study how the activity of users on Github affects theiractivity on Stack Overflow.

The closest study to ours is the early investigation into the cov-erage and dynamics of crowd documentation on Stack Overflowby Parnin et al. [12], where they studied several Java APIs andhow many questions asked about their classes on Stack Overflow.Another close study is by Kavaler et al. [9], which discusses therelationship between usage of an Adnroid API mined from realapplications to the questions asked about it on Stack Overflow. Thefirst half of our study discusses similar research questions but in adifferent language (C instead of Java), with different demographics,and a relatively older API.

3 METHODOLOGYIn this section, we explain our research questions, how we collectedour data, and how we analyzed it.

3.1 Research QuestionsFirst, it is important to study whether we can rely on the StackOverflow community as a documentation of Linux kernel filesystemAPIs, and whether we can rely on it to study questions related tothat API. To achieve that, we first examine the coverage of thefilesystem API on Stack Overflow. Coverage in our context refersto the percentage of filesystem functions and structures for whichthere are at least n questions that mention it on Stack Overflow. Wealso examine the relationship between the usage of a function inopensource projects on GitHub and the amount of questions thatmention it. Since the API is old, we wanted to see whether it is stilldiscussed in Q&A communities, and where these discussions aregoing. Additionally, this part studies if the results and the generaltrends identified in previous studies [12] also apply to our dataset.Second, we need to understand what kinds of problems are StackOverflow users face when dealing with the filesystem API, whichcould influence future designers of the APIs. Therefore, we askabout the types of questions asked about the filesystem APIs. Sincesuch question is broad, we focus on the most common keywords inquestions and how often do functions appear together in the samefunction.

We study the following questions:• RQ1 Can we rely on the Stack Overflow crowd to discussall of Linux kernel filesystem APIs?– RQ1.1 Are the API methods widely covered?– RQ1.2 Is there a relationship between number of discus-sions around a function and its usage in Github reposito-ries?

– RQ1.3 Are there still ongoing discussions about theseAPIs? How fast are they covered?

• RQ2What kind of questions are asked about the APIs?– RQ2.1 What are the most common keywords that appearin questions?

– RQ2.2How often do pairs of components appear togetherin questions?

3.2 Data Collection3.2.1 Filesystem API. To extract the filesystem API, we used

a script to scrape function and structure names, along with the

categories they fall into from the official Linux kernel v4.16.0 docu-mentation2, then we manually verified them. The latest documen-tation is well-formatted, so it was easy to extract header namesand using a script written in jQuery. The result was loaded to adatabase, where each function was given a unique id and assignedto a category. In total, we extracted 310 functions and structures,classified into 14 different categories.

3.2.2 Stack Overflow. Stack Exchange, the company behindStack Overflow and other Q&A websites, publishes data dumps ofthe databases of all its websites, under a Creative Commons licenseon archive.org3 once every three months. We downloaded a data-base dump of Stack Overflow that dates to 12/2017, converted theXML to SQL and loaded on a relational database using a tool calledSODDI4. Due to the huge size of data (42 Gb after loading to thedatabase, for 39 million records), we only considered posts whichcontained the following tags: c, c++, linux, kernel, kernel-module,posix, and driver. Since answers are not tagged, we included allthe answers on matched questions. All Stack Overflow questionsare required to be tagged, and most tags are accurate since StackOverflow is heavily moderated. The tags in the data dump are listedin plain text inside the posts table, and not in a separate table, so toimprove the performance of the data filtering, the tags column wasadded to a fulltext index [8]. This resulted in an order of magnitudeincrease in performance over a regular expression search (severalminutes instead of several hours) with a slight decrease in recall.The filtered dataset contains 3.1 million records and is 4.5 Gb insize.

3.2.3 Linking Methodology. To gather our data, we used a pro-cess adopted from Parnin et al. [12]. In our case, a link is a con-nection between a StackOverflow question and a filesystem APIfunction or structure. Figure 1 shows a sample Stack Overflowquestion annotated with the different types of links: word linkswhich occur in regular text, code links occur in <code></code>tags and include a single function or class name, and code blocklinks are enclosed within <pre></pre> tags representing examplesor stack traces. We searched the title and body of the questionsand related answers for any mentions of filesystem function andstructures using case-insensitive word-boundary full-text-indexedsearch, and recorded the results in the database. We found a totalof 981 matches between posts and functions.

3.2.4 Usage Data. The usage data was gathered by searchingfor the filesystem API functions on the Github code search API5. Weused a script that issued one request for each function and recordedthe total number of records reported by the response. We limitedour search to files written in C only to prevent any similarly namedfunctions from other programming languages from inflating thesearch result. Between each API call, we waited a random periodof time between 3-10 seconds to comply with the rate limits androbot detection imposed by the server. We note that Github codesearch does not include forked repositories unless they have morestars than the original repositories.

2https://www.kernel.org/doc/html/v4.16/filesystems/index.html3https://archive.org/details/stackexchange4https://github.com/BrentOzarULTD/soddi5https://developer.github.com/v3/search/#search-code

Page 3: Exploring Questions Asked about Linux Filesystem APIs in ... · API, which the Linux kernel filesystem APIs follow, have many ... categories they fall into from the official Linux

ExploringQuestions Asked about Linux Filesystem APIs in Stack Overflow CS846, April 2018, Waterloo, ON Canada

Figure 1: A sample Stack Overflow question with different types of links

3.3 Data AnalysisMost research questions (all RQ1 and RQ2.2) are satisfied by querieson the cached list of results recorded on the database joined withthe original question data. RQ2.1 includes a technique in naturallanguage processing called frequency analysis, described below.

3.3.1 Frequency Analysis. To answer RQ2.1, we performed abasic frequency analysis with single words, bigrams and trigramson the questions and their answers. First, we sanitized question textsby removing all html tags and code blocks since they are irrelevantto what the question asks. Then, we tokenized the texts by wordboundary, and stemmed them using porter stemmer [14]. After that,we filtered out words we considered stop words, including the listof default stop words in the English language, combined with otherstopwords common in StackOverflow conversations, such as "code","please", "thanks". Finally, we generated single words, bigrams andtrigrams, and fed each of them individually to a frequency counter.We applied this method for question titles and bodies without theanswers, and then to the answers only.

3.4 ReproducibilityAll our data is provided publically on GitHub. The following reposi-tory contains scripts, queries and result csv files used in this paper:

https://github.com/zalsader/LinuxFSQuestions

4 RESULTS AND DISCUSSION4.1 Data ReliabilityIn this subsection, we examine the results of our first researchquestion: RQ1 Can we rely on the Stack Overflow crowd todiscuss all of Linux kernel filesystem APIs?

4.1.1 Extent of Coverage. RQ1.1Are theAPImethodswidelycovered?

To answer our first question, we considered the coverage of APIfunctions and structures. We define coverage as the percentage ofAPI elements that have at least n questions discussing the func-tion/structure, where n is called the saturation level. We found thatless than third (32.9%) of the functions have at least one threaddiscussing them on Stack Overflow, while the percentage decreasesconsiderably as n increases. For n = 2, 15.8% of the functions arecovered, and for n = 5 and 10, the coverage is 8.4% and 4.2% respec-tively. From this, we conclude that Stack Overflow is reliable to amoderate extent for filesystem APIs, which allows us to continueour analysis. The moderate coverage could be due to the smallsize of novice programmers interested in these APIs, which gener-ates a smaller number of questions compared to other subjects likeAndroid for example.

Additionally, we examined the coverage of API functions percategory, as shown in Figure 2. The coverages varied between 0%

Page 4: Exploring Questions Asked about Linux Filesystem APIs in ... · API, which the Linux kernel filesystem APIs follow, have many ... categories they fall into from the official Linux

CS846, April 2018, Waterloo, ON Canada Zuhair AlSader

The proc filesystem

Events based onfile descriptors

The Filesystemfor Exporting

Kernel Objects

The debugfs filesystem

splice API

pipes API The Filesystemtypes The Directory Cache

Inode Handling Registrationand Superblocks File Locks

Other VFS Functions

Journalling APIData Types

Journalling APIFunctions

Figure 2: Coverage of Linux filesystem API by category, n = 1. Darker colors indicate higher coverage, and the size of each cellrepresents the number of functions in that category

to 60%, with a median of 44%. The figure shows that there are nobig differences in coverage between most categories.

4.1.2 Usage vs Discussion. RQ1.2 Is there a relationship be-tween number of discussions around a function and its us-age in Github repositories?

To explainwhymany functions have no or littlemention on StackOverflow, we compare the number of references to each functionfrom the previous step to the data we collected fromGitHub. Resultsare plotted in Figure 3 on a log-log scale. We find that there isa moderately strong correlation between the log of usage of afunction and the log of questions asked about it, with a slope of 0.51and Pearson’s regression coefficient of 0.65. This translates intoa sublinear dependence of y on x, meaning that, on average, thedemand for function documentation from Stack Overflow growsslower than the function usage in the code. This is consistent withpast studies, and can be attributed to the fact that there is a finiteamount of questions that could be asked about a function before allquestions are asked on Stack Overflow. The average slope in ourcase is higher than the average slope in the related studies (0.31in [9]), which can be explained because the average coverage is lessthan the projects considered by related studies.

4.1.3 Discussion over time. RQ1.3Are there still ongoing dis-cussions about these APIs? How fast are they covered?

To answer this question, we examine the coverage with differentvalues of n over time. We notice that coverage is increasing slowlybut steadily over time for n = 1, while there was a relatively short

104 105 106 107

Number of Github References

100

101

102

Num

ber o

f Que

stio

ns R

efer

ence

d

Figure 3: Usage vs number of references in Stack Overflowon a log-log scale. The dashed line represents the trend be-tween the points

period of stability for n > 2 between 2016 and 2017. It is inter-esting to see that the coverage is still growing steadily even afteraround 30 years from creation of the API. This means that usersare still facing new problems and asking new questions that no oneasked about before on the website. This also indicates that even the

Page 5: Exploring Questions Asked about Linux Filesystem APIs in ... · API, which the Linux kernel filesystem APIs follow, have many ... categories they fall into from the official Linux

ExploringQuestions Asked about Linux Filesystem APIs in Stack Overflow CS846, April 2018, Waterloo, ON Canada

01/2010 01/2011 01/2012 01/2013 01/2014 01/2015 01/2016 01/20170%

5%

10%

15%

20%

25%

30%

35%

40%

n = 1

n = 2

n = 5n = 10

Figure 4: The coverage for the API over time

newer versions of the documentation still have not kept up withthe questions people ask about the filesystem APIs.

4.2 Questions asked about the APIIn this subsection we examine and discuss the results from our sec-ond research question: RQ2 What kind of questions are askedabout the APIs?

4.2.1 Common Keywords. RQ2.1 What are the most com-mon keywords that appear in questions?

To answer this question, we identify the keywords that are mostcommonly found in questions and answers that contain any ofthe linux filesystem API functions and structures, excluding codesamples as defined in Subsection 3.2. Since the words are stemmedusing the Porter stemmer, many of the terms in the figure appearto be misspelled or incomplete. From the single word results, itseems that most questions ask about how to ‘use’ the functions,how they ‘work’, and some also ask for ‘examples’. Since we areasking questions about the filesystem API, the words ‘write’, ‘read’and ‘kernel’ appear very often. In addition, a lot of questions report‘problems’ or asked for ‘solutions’. The results frommining answersfor single word frequency do not seem very different.

The results for bigrams and trigrams result in more interestingterms, after the most common ‘kernel module’ and ‘linux kernelmodule’, many questions contain ‘device driver’, ‘block device’ and‘linux device driver’ in reference to questions about vfs functionsrelated to device drivers, like this post 6, which asks about “howsys_open works?” in the context of a character device driver. Ad-ditionally, many questions ask for a ‘better way’, or the ‘best way’to achieve a task. The results from mining bigrams and trigramsof answers are very similar, with the difference that ‘better way’,‘best way’ and ‘create new’ appear more often than other terms inthe question.

In any of these keyword frequencies, we do not find many ques-tions that directly reference the design of the API. This could be

6https://stackoverflow.com/q/11578504

because advanced users of the API use different methods of commu-nication when discussing design issues, like mailing lists or otheronline forums. It could also be that the model we used is too sim-ple to find intricate questions like these, and manual inspection ofquestions might be needed.

4.2.2 Collocated Functions. RQ2.2 How often do pairs ofcomponents appear together in questions?

This questions explores how often do questions ask about morethan one function or API, giving insights about the codependenceor confusion between functions. To achieve that, we search therecorded results from Subsection 3.3 for questions that contain twoor more API functions. Table 1 shows the functions that appear to-gether the most within the same category. From the results we cansee that, within the same category, sequential file processing meth-ods seq_* are very commonly asked about together. This is becausethey are commonly used together to build device drivers, and manyquestions provide examples on how to use them, see this question7,which discusses sequential file operations in device drivers. Ad-ditionally, debugfs_create_dir and debugfs_create_file arecommonly found together, since both of them can be used to createdirectories. The difference between them is that the former is the rec-ommended function to create directories in debugfs, while the lattercan be used to create directories and files alike. Most answers thatreference both functions explain this difference, including this one8. Debugfs9 is a simple-to-use RAM-based file system commonlyused for debugging. For that reason, we can find many questionsthat contain examples for using different debugfs_* functions atthe same time.

5 THREATS TO VALIDITYThere are some threats to the validity and applicability of this study,and we examine them in this section. Generalizability: The firstresearch question generalizes findings of previous work done onJava API and applications domain into the Linux filesystem APIdomain, but it is still not known if this generalizes to other domains.

Stack Overflow questions are unstructured natural languagetext, so extracting information from them is naturally a noisy pro-cess. Additionally, when we look for mentions of a function insideStack Overflow questions, finding a mention does not necessar-ily mean that the question is about that specific function, but justthat it is related to the question somehow. There is a chance thatthe function we find in Stack Overflow is a function with a simi-lar name from a different library, but it is very unlikely becausemost functions start with a ‘namespace’, like seq_*. Since StackOverflow questions are asked by humans, there is a possibility thatincorrect tags or formatting are used in a question; however, thewebsite is heavily moderated, and any such mistake would be fixedby moderators within a few days.

GitHub code search only includes open source projects, whichmeans that the usage data we derived could be biased towardsopen source projects. This data is however the best data we canget, and it is considered to be representative of the workloads.Additionally, GitHub allows any user to fork any repository they

7https://stackoverflow.com/q/176636928https://stackoverflow.com/q/422358009https://www.kernel.org/doc/Documentation/filesystems/debugfs.txt

Page 6: Exploring Questions Asked about Linux Filesystem APIs in ... · API, which the Linux kernel filesystem APIs follow, have many ... categories they fall into from the official Linux

CS846, April 2018, Waterloo, ON Canada Zuhair AlSader

(a) Single Words (b) Bigrams

Figure 5: Word clouds generated by the most common words in all matching questions using single words (a) and bigrams (b)

Table 1: Top 10 functions that appear together from the same category

Name Name Category Occurrences

seq_lseek seq_read Other VFS Functions 41seq_open seq_read Other VFS Functions 12seq_lseek seq_open Other VFS Functions 11

debugfs_create_dir debugfs_create_file The debugfs filesystem 11seq_lseek seq_release Other VFS Functions 10seq_read seq_release Other VFS Functions 10seq_open seq_release Other VFS Functions 9

debugfs_create_file debugfs_remove The debugfs filesystem 7debugfs_create_file debugfs_remove_recursive The debugfs filesystem 5debugfs_create_dir debugfs_remove_recursive The debugfs filesystem 4

want, so we might have duplicates in the search results, but this ismitigated by the github code search itself, since it does not includeany forks that have less stars than the original repositories. Thereis also the possibility of functions sharing the same name, but thefunction names we considered are unique enough as we concludedfrommanual inspection of the search results. Theword frequencyanalysis we did in RQ2.1 might be biased towards functions thathave more questions related to them. This was not found to be amajor concern in our analysis, but we could explore calculating theword frequency per function or category of functions.

6 CONCLUSIONIn this study, we examined the questions asked about the LinuxfilesystemAPI on Stack Overflow, and demonstrated that it is a mod-erately reliable source of information on that API. We also showedthat the community is still engaged in questions and discussionsabout Linux filesystem APIs, even after about a long time from itsbirth. In addition, we quantitatively studied which keywords arethe most common in questions are asked about the APIs. Most ofthese keywords were asking for examples, or for better ways to dooperations, or for explanations about device drivers. We also exam-ined which of filesystem APIs are likely to be asked about together,indicating codependency, as an seq_* functions or confusion, as indebugfs_create_dir and debugfs_create_file.

In the future, wewould like to expand our studywith a qualitativeexamination of the questions we found. Since the coverage of ourdata was low, we would like to include other sources, like Linuxmailing lists and issues in the study. We could also accompanythe qualitative study with interviews asking developers of kernelmodules to give their thoughts about the filesystem kernel functionsand other kernel functions. The same study could be applied toany other module of the kernel, or could be expanded to include allkernel functions.

REFERENCES[1] Muhammad Asaduzzaman, Ahmed ShahMashiyat, Chanchal K. Roy, and Kevin A.

Schneider. 2013. Answering Questions About Unanswered Questions of StackOverflow. In Proceedings of the 10th Working Conference on Mining SoftwareRepositories (MSR ’13). IEEE Press, Piscataway, NJ, USA, 97–100. http://dl.acm.org/citation.cfm?id=2487085.2487109

[2] Vaggelis Atlidakis, Jeremy Andrus, Roxana Geambasu, Dimitris Mitropoulos, andJason Nieh. 2016. POSIX Abstractions in Modern Operating Systems: The Old,the New, and the Missing. In Proceedings of the Eleventh European Conference onComputer Systems (EuroSys ’16). ACM, New York, NY, USA, Article 19, 17 pages.https://doi.org/10.1145/2901318.2901350

[3] Daniel P. Bovet and Marco Cesati. 2006. Understanding the Linux kernel (3. ed.ed.). O’Reilly, Beijing [u.a.].

[4] Jonathan Corbet, Greg Kroah-Hartman, and Alessandro Rubini. 2005. LinuxDevice Drivers (3. ed. ed.). O’Reilly, Sebastopol, Calif. [u.a.].

[5] Denzil Correa and Ashish Sureka. 2014. Chaff from the Wheat: Characterizationand Modeling of Deleted Questions on Stack Overflow. In Proceedings of the 23rdInternational Conference on World Wide Web (WWW ’14). ACM, New York, NY,USA, 631–642. https://doi.org/10.1145/2566486.2568036

Page 7: Exploring Questions Asked about Linux Filesystem APIs in ... · API, which the Linux kernel filesystem APIs follow, have many ... categories they fall into from the official Linux

ExploringQuestions Asked about Linux Filesystem APIs in Stack Overflow CS846, April 2018, Waterloo, ON Canada

[6] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. Oct 19, 2003. TheGoogle file system (SOSP ’03). ACM, 29–43. https://doi.org/10.1145/945445.945450

[7] Toni Gruetze, Ralf Krestel, and Felix Naumann. 2016. Topic Shifts in Stack-Overflow: Ask it Like Socrates. In Natural Language Processing and InformationSystems, Elisabeth Métais, Farid Meziane, Mohamad Saraee, Vijayan Sugumaran,and Sunil Vadera (Eds.). Springer International Publishing, Cham, 213–221.

[8] James R. Hamilton and Tapas K. Nayak. 2001. Microsoft SQL Server Full-TextSearch. IEEE Data Eng. Bull. 24, 4 (2001), 7–10. http://sites.computer.org/debull/A01DEC-CD.pdf

[9] David Kavaler, Daryl Posnett, Clint Gibler, Hao Chen, Premkumar Devanbu,and Vladimir Filkov. 2013. Using and Asking: APIs Used in the Android Marketand Asked about in StackOverflow. In Social Informatics, Adam Jatowt, Ee-PengLim, Ying Ding, Asako Miura, Taro Tezuka, Gaël Dias, Katsumi Tanaka, AndrewFlanagin, and Bing Tian Dai (Eds.). Springer International Publishing, Cham,405–418.

[10] T. C. Lethbridge, J. Singer, and A. Forward. 2003. How software engineers usedocumentation: the state of the practice. IEEE Software 20, 6 (Nov 2003), 35–39.https://doi.org/10.1109/MS.2003.1241364

[11] Robert Love. 2010. Linux kernel development (3. ed. ed.). Addison-Wesley, UpperSaddle River, NJ [u.a.].

[12] Chris Parnin, Christoph Treude, Lars Grammel, and Margaret-Anne Storey. 2012.Crowd Documentation: Exploring the Coverage and the Dynamics of API Dis-cussions on Stack Overflow.

[13] Luca Ponzanelli, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, andMichele Lanza. 2014. Mining StackOverflow to Turn the IDE into a Self-confidentProgramming Prompter. In Proceedings of the 11th Working Conference on MiningSoftware Repositories (MSR 2014). ACM, New York, NY, USA, 102–111. https://doi.org/10.1145/2597073.2597077

[14] M.F. Porter. 1980. An algorithm for suffix stripping. Program 14, 3 (1980), 130–137.https://doi.org/10.1108/eb046814 arXiv:https://doi.org/10.1108/eb046814

[15] M.P. Robillard. 2009. What Makes APIs Hard to Learn? Answers from Developers.IEEE Software 26, 6 (Nov.-Dec. 2009), 27–34.

[16] Peter Jay Salzman. 2009. The Linux Kernel Module Programming Guide. CreateS-pace, Paramount, CA.

[17] R. Sandberg, D. Golgberg, S. Kleiman, D. Walsh, and B. Lyon. 1988. Innovationsin Internetworking. Artech House, Inc., Norwood, MA, USA, Chapter Designand Implementation of the Sun Network Filesystem, 379–390. http://dl.acm.org/citation.cfm?id=59309.59338

[18] B. Vasilescu, V. Filkov, and A. Serebrenik. 2013. StackOverflow and GitHub:Associations between Software Development and Crowdsourced Knowledge. In2013 International Conference on Social Computing. 188–195. https://doi.org/10.1109/SocialCom.2013.35

[19] Jinpeng Wei and Calton Pu. 2005. TOCTTOU Vulnerabilities in UNIX-style FileSystems: An Anatomical Study. In Proceedings of the 4th Conference on USENIXConference on File and Storage Technologies - Volume 4 (FAST’05). USENIX Asso-ciation, Berkeley, CA, USA, 12–12. http://dl.acm.org/citation.cfm?id=1251028.1251040

[20] Erez Zadok, Dean Hildebrand, Geoff Kuenning, and Keith A. Smith. 2017. POSIXis Dead! Long Live... errr... What Exactly?. In 9th USENIX Workshop on Hot Topicsin Storage and File Systems (HotStorage 17). USENIX Association, Santa Clara, CA.https://www.usenix.org/conference/hotstorage17/program/presentation/zadok