7
An Effective Methodology for Processing and Managing Massive Spacecraft Datasets Haydar Teymourlouei Department of Computer Science Bowie State University, Bowie, MD, USA Abstract - The emergence of enormous and complex datasets has made existing data processing methods more strenuous. The growth in datasets continues to increase vastly. Despite the Interpolation technologies out there to manipulate efficient large data searching methods, the task to search for datasets expeditiously is still an obstacle. However, this research offers a more effective method to quickly search a large dataset within a timely manner. For this method to work, we used a technique where we create a directory file to catalogue and to retrieve data. The directory file is where the file can be acquired based on its time and date. The proposed method is intended to alleviate the process of searching a data’s content entirely and to scale down the search time in order to find the data file. Keywords: Big data, data processing, data set, interpolation search, raw data, unsorted data 1 Introduction As technology is expanding, so is the rapid acceleration of complex and diverse types of data results in the emergence of a fast paced algorithm. The sheer amount of data generated that must be ingested, analyzed, and managed is of relevant importance and must be considered when attempting to propose useful tools. The speed in which data must be received and be processed must also be considered. The rise of information coming from new sources has taken a toll on IT . Therefore, data management is a much more difficult task using only the traditional methodologies. Data has attained the form of continuous data streams rather than finite stored data sets, posing barriers to users that wish to obtain results at a preferred time. Data prescribed in this manner displays no bounds or limitations; thus, a delay in the retrieval of data can be expected. With today’s overflowing datasets, data management and analysis challenges are on a rise. Large data is described as a substantial amount of data accumulated over time that makes it difficult to examine and to process using various algorithms. Data analysis is the process of understanding the meaning of the data we have acquired, catalogued, and exhibited in representations form such as a table or a line graph. Working with millions or even billions of datasets has become problematic for researchers. If we can sort, approach, allocate, and evaluate the datasets competently, we can alleviate the trouble in searching for data. Many researchers believe that using various indexing methods to search for data expedites the search mechanism [2]. 2 Data Processing The term data is typically described as information. Data processing is basically conversion of raw data into meaningful information through process. Data is first gathered, then it gets processed. Large datasets usually refer to voluminous data beyond the capabilities of the current database technology. Data is used to refer to the vast amounts of information in a standardized format. Generally, data can include numbers, letters, equations, images, dates, figures, maps, documents, media files, and much more information. Particularly, data processing is a distinctive step in the information processing cycle. In information processing, “data is acquired, entered, validated and processed, stored and outputted, either in response to queries or in the form of routine reports. Data processing refers to the act of recording or otherwise handling one or more sets of data” [5]. The processing of data requires to be displayed in an understandable and efficient form. The levels must be given consecutively in order from the pathway to the result where is its readable to the reader. Nonetheless, large datasets are transforming the way research is carried out, resulting in the emergence of a fast-paced algorithm. Data has to be effectively processed in order to convert raw data into meaningful information. See Figure below for details. Figure 1: Raw Data Conversion 94 Int'l Conf. Scientific Computing | CSC'15 |

An Effective Methodology for Processing and Managing ...worldcomp-proceedings.com/proc/p2015/CSC2919.pdf · of service is difficult to be provided to raw remote sensing images. In

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Effective Methodology for Processing and Managing ...worldcomp-proceedings.com/proc/p2015/CSC2919.pdf · of service is difficult to be provided to raw remote sensing images. In

An Effective Methodology for Processing and Managing Massive Spacecraft Datasets

Haydar Teymourlouei Department of Computer Science

Bowie State University, Bowie, MD, USA

Abstract - The emergence of enormous and complex datasets has made existing data processing methods more strenuous. The growth in datasets continues to increase vastly. Despite the Interpolation technologies out there to manipulate efficient large data searching methods, the task to search for datasets expeditiously is still an obstacle. However, this research offers a more effective method to quickly search a large dataset within a timely manner. For this method to work, we used a technique where we create a directory file to catalogue and to retrieve data. The directory file is where the file can be acquired based on its time and date. The proposed method is intended to alleviate the process of searching a data’s content entirely and to scale down the search time in order to find the data file.

Keywords: Big data, data processing, data set, interpolation search, raw data, unsorted data

1 Introduction As technology is expanding, so is the rapid acceleration of complex and diverse types of data results in the emergence of a fast paced algorithm. The sheer amount of data generated that must be ingested, analyzed, and managed is of relevant importance and must be considered when attempting to propose useful tools. The speed in which data must be received and be processed must also be considered. The rise of information coming from new sources has taken a toll on IT . Therefore, data management is a much more difficult task using only the traditional methodologies. Data has attained the form of continuous data streams rather than finite stored data sets, posing barriers to users that wish to obtain results at a preferred time. Data prescribed in this manner displays no bounds or limitations; thus, a delay in the retrieval of data can be expected. With today’s overflowing datasets, data management and analysis challenges are on a rise. Large data is described as a substantial amount of data accumulated over time that makes it difficult to examine and to process using various algorithms. Data analysis is the process of understanding the meaning of the data we have acquired, catalogued, and exhibited in representation’s form such as a table or a line graph. Working with millions or even billions of datasets has become problematic for researchers. If we can sort, approach, allocate, and evaluate the datasets competently, we can alleviate the trouble in searching for data. Many researchers

believe that using various indexing methods to search for data expedites the search mechanism [2].

2 Data Processing The term data is typically described as information. Data processing is basically conversion of raw data into meaningful information through process. Data is first gathered, then it gets processed. Large datasets usually refer to voluminous data beyond the capabilities of the current database technology. Data is used to refer to the vast amounts of information in a standardized format. Generally, data can include numbers, letters, equations, images, dates, figures, maps, documents, media files, and much more information. Particularly, data processing is a distinctive step in the information processing cycle. In information processing, “data is acquired, entered, validated and processed, stored and outputted, either in response to queries or in the form of routine reports. Data processing refers to the act of recording or otherwise handling one or more sets of data” [5]. The processing of data requires to be displayed in an understandable and efficient form. The levels must be given consecutively in order from the pathway to the result where is its readable to the reader. Nonetheless, large datasets are transforming the way research is carried out, resulting in the emergence of a fast-paced algorithm. Data has to be effectively processed in order to convert raw data into meaningful information. See Figure below for details.

Figure 1: Raw Data Conversion

94 Int'l Conf. Scientific Computing | CSC'15 |

Page 2: An Effective Methodology for Processing and Managing ...worldcomp-proceedings.com/proc/p2015/CSC2919.pdf · of service is difficult to be provided to raw remote sensing images. In

To obtain large data from satellite and to transform raw data into a simple and utilizable form requires tedious data processing. In order to perform such procedures, dynamic component of satellite operations are needed. It is potent to articulate how to analyze the data in order to extract intriguing trends and patterns as huge amounts of data are being formed. Level 1 and level 2 data processing proceeds by operating on the raw data and edited data. Level 3 and level 4 edited data is being calibrated then resampled. Level 5 and level 6 data is obtained from maps, reports, or graphics, etc. then ancillary data is calibrating or resampling data sets. Level 7 and level 8 used correlative data to interpret space-based data sets and then user description data is available for secondary user to extract information from the data.

Table 1: Data Processing Level [7]

If the volume of raw data is too large, making more than a single pass over the data may not be achievable. Processing levels offer adequate approaches to dissect useful information from huge data sets by generating small passes over the data. Lots of information can be gathered simply from making a single pass over the data or a small number of passes over the data. The extent of data processing enforced to a data product establishes massive significant characteristics of the product. It also establishes if the particular metadata elements or data services are suitable to the product. For georectified example, geographic coordinate - based data sub-setting can be easily implemented in the georectified raster data, but the same type of service is difficult to be provided to raw remote sensing images. In order to facilitate the data management and standardize the metadata and data services, data products in EOSDIS are classified into five levels according to the degree of processing. The higher the level, the higher the degree of processing [3].

2.1 Unsorted Data Accessing unsorted data is not the issue anymore;

instead it is about extracting valuable information from the unsorted data. Before data can become “information,” the data

needs to be extracted, organized, and at times analyzed and formatted to be presented. Unsorted data is new and original data that has yet to be touched or modified, in other words unprocessed and unorganized. Raw data can be anything from a series of numbers, the way those numbers are sequenced, even the way they are spaced, but they can yield very important information. A computer interprets this information in a way that attempts to make sense to the reader [9]. Raw data is a data that has been captured from spacecraft which is not in presentable form. It needs to be processed to gather meaningful and relevant information and is also known as source or atomic data. The data is completely unrecognized and needs to be processed manually or by the machine. Raw data could be either hex, binary data, or characters. A computer may interpret this information and give a readout that then may make sense to the reader. Once raw data is collected, it goes into the database which later becomes available for supplementary processing and examining. A good example would be hex unsorted data. As you see below, hex unsorted data is very complex to read. To make the hex data readable, the data has to be processed through several levels.

Figure 2: Example of Hex Unsorted Data

Once data has been generated, processed, and stored, it can then be made available in a more useful form for scientists’ research.

2.2 Massive Data Challenges Due to the massive amount of data that is being collected, storage space, management, and analyzing the data are big challenges. Other problems also include the increasing of volume, velocity (speed), and variety (data type). For these problems to be resolved, scalable computing and analyzing methods must improve. Massive data is the size in which common software is not capable of handling. According to NASA, they gathered approximately 1.73 gigabytes of data from our nearly 100 currently active missions! We do this every hour, every day, every year and the collection rate is growing exponentially. Handling, storing, and managing this

Int'l Conf. Scientific Computing | CSC'15 | 95

Page 3: An Effective Methodology for Processing and Managing ...worldcomp-proceedings.com/proc/p2015/CSC2919.pdf · of service is difficult to be provided to raw remote sensing images. In

data is a massive challenge. Big data is very simply a collection of data sets so large and complex that your legacy IT systems cannot handle them. Approaching the big data challenge often necessitates Interpolation algorithms, infrastructure, and frameworks [8]. Data gathering comes in a variety of different sources. Typically, there are two major concerns involved with analyzing massive datasets: (i) the period of time required to search through data files and (ii) how to effectively identify relationships between data. Certainly, this requires knowledge of what the scientist is looking for; for instance, what is considered to be an anomaly (e.g. calibration problems with instruments). Therefore, the algorithms must be able to search through and efficiently manipulate massive data sets [9]. In order for such certainty to vanish, data has to be analyzed reasonably rapid in order to present users with the capability to gather data in a short amount of time. NASA faces major challenges throught out daily activities’s every day. NASA’s big data challenge is not just a global challenge, but often appears as an unorthodox challenge. NASA describes many of their “Big Data” sets to be substantial metadata. The term metadata is often referred to as data that describes other data, which can make finding and working with particular data more efficient and easier. When NASA engages in spacecraft missions, they have two very different types of spacecraft. One deep space spacecraft that allows them to send back data in MB/s, which is the equivalent speed of 1,000,000 bits of information being sent back per second. When NASA is not using a spacecraft to retrieve data, they also send out earth orbiters, allowing them to send back data in GB/s per second. NASA typically uses these two types of spacecraft’s because they regularly engage in missions where data is continually streaming from one spacecraft on Earth and in space, faster than they can store, manage, and interpret it. In NASA’s current mission, they are allowing data to be transferred through radio frequency, which is currently the most inefficient frequency due to its slow speed. NASA is working on employing a new type of technology that uses Optical Laser communication to increase their transfer, which would result in 1000x (times) more increase in the volume of data. Allowing this type of data transfer is much more than what they can handle today, but in preparing for new technology today, NASA will be prepared for the future. NASA is planning future missions today that will allow them to stream more than 24 terabytes a day. For example, if NASA has the ability to easily stream 24 TB a day, that’s the equivalent speed of roughly traveling 2.4 times the entire Library of congress every day just for one mission. Allowing data to be transferred at those speeds, it is still relatively expensive to transfer one bit of information from a spacecraft. Once data travels to their data centers, being able to store, mange, visualize and analyze the data becomes a concern. For example, since everything changes with time, the projected number of the climate change data sources are expected to grow nearly 350 Petabytes in 2030, which is only 15 years away. Even though that may seem like a long time

from now, a change in five petabytes a year is equivalent to the number of letters delivered by the US postal service in one year. One awesome sample of the remarkable test that we confront with overseeing space information is simply beginning to be exhibited by the Australian Square Kilometer Array Pathfinder (ASKAP). The venture, which is an extensive show made up of 36 receiving wires, every 12 meters in distance across, spread out more than 4,000 square meters however; cooperating as a solitary instrument to open the riddles of our universe. Furthermore, the spacecraft is by all account not the only source of our information, because of a perpetually developing supply of cell phones, ease sensors, and online stages. The size of the enormous information challenge for NASA, in the same way as other associations, is overwhelming. As you can likely foresee, the expanding information volumes are not our only difficulties. As our abundance of information expands, the test of indexing, seeking, and exchanging, thus on all increment exponentially also. Moreover, the expanding multifaceted nature of instruments and algorithm’s, expanding rate of innovation invigorate, and the diminishing plan environment, all play a critical factor in our approach.

3 Interpolation Search Interpolation search is a method of retrieving a preferred data by key in an ordered file by using the value of the key and the statistical distribution of keys. It is an algorithm for discovering a given key in a sorted array. It tries to anticipate where the key would lie in the search space through a straight insertion, decreasing the search space to the part before or after the assessed position if the key is not found there. This technique will work if figuring on a distinction between key qualities are sensible. It is a modified algorithm of binary search, reducing the complexity. While in binary search, algorithm finds the position of a specified input value “Search "key" within an array sorted by the key value. For binary search, the array should be arranged in ascending or descending order. Binary search constantly selects the middle number for comparison, discarding one half of the search space. It is an algorithm to efficiently find the indexed array that has been ordered by the values of the key. In each search step it calculates where in the remaining search space the wanted item might be, based on the key values at the bounds of the search space and the value of the wanted key. The key value found at the estimated position is then compared to the key value being searched. The remaining search space is then reduced to the part before or after the estimated position based on the comparison if it is not equal. Interpolation search is an alternative to the binary search that exploits information about the primary supply of data that is being searched. By utilizing this extra data, interpolation search can be as quick as O (log(log (n))), where n is the size of the array. Interpolation search models how people seek a word reference better than a binary search. For example, on the grounds that if a human were to search “Yellow", they

96 Int'l Conf. Scientific Computing | CSC'15 |

Page 4: An Effective Methodology for Processing and Managing ...worldcomp-proceedings.com/proc/p2015/CSC2919.pdf · of service is difficult to be provided to raw remote sensing images. In

would instantly flip towards the end of the dictionary to find that statement, instead of flipping to the center. This is the basic thought of how interpolation search functions. Interpolation search is a search algorithm impacted by binary search that, for some information sets, performs asymptotically better. Both binary and interpolation searches oblige the information to be sorted, and utilize the sordidness to rule out sections of the data from consideration. They work by picking a component at an arbitrary position, contrasting it with the component being referred to, then choosing whether to proceed with the search on the left or the right. The key distinction is that binary search always works by splitting the input range perfectly in half, which guarantees a runtime of O (lg n). Interpolation search works by assuming that the data is distributed uniformly, then doing a linear interpolation between the endpoints to guess where the element ought to be. Assuming the data is distributed uniformly, it can be shown that interpolation search runs in expected O (lg lg n) time, exponentially faster than binary search.

3.1 Interpolation Search Algorithm

3.2 Example of Interpolation Search

3.3 Interpolation Complexity Search Complexity is often divided in two ways, time complexity and space complexity. Time complexity is the amount of time the computer requires executing the algorithm. Space complexity is an algorithm that computes the amount of memory space the computer requires. The complexity of an algorithm is a function g(n) that gives the greater guarantee number of operations performed by an algorithm when the

input size is n. In the case of algorithm searching, this is the process to finding the location of the given data elements in the data structure. The Interpolation algorithm search for random order data and conquer running time is O(lg lg n), that is better than binary search.

Best/Average case complexity is O(log log N) where N is the number of keys if they are uniformly distributed.

Worst case complexity is O(n) example searching for 1000 in 1,2,3…..,888,1000,109

The figure below shows the comparison of the interpolation and binary search complexity algorithms. The results show there is a huge time difference between the two searches of finding the desired value. The Interpolation complexity search algorithm uses log(log (n)) where the result was found at about five seconds, meanwhile binary search uses log (n) where the desired key was found at about 27 seconds.

Figure 3: Interpolation vs. Binary Search Complexity

4 Methodology The motivation behind the proposed algorithm is to present the user with the ability to search specific data in a timely fashion. This method is efficient for obtaining only a selected partition of the scientific data. In other words, after data has been reduced to selected files, the algorithm carries out a search on these files. One is looking for a certain information from numerous files where each item is stored as a different entity of files. Followed by gathering feedbacks from reviewers, one then wants to examine all incidents of the specific item that appears in the multiple files. To simplify the searching, a directory file will be necessary. A directory file can point to the file index start and end time of the search data within a matter of time. A directory file can help catalogue and retrieve data much faster. A file directory is a place where files are stored in a computer. File retrieval is found based on its time and data stored in a directory file. With this method, data retrieval process is reduced and the amount of time to locate the selected file is condensed. Each data file holds a start time and an end time.

Int'l Conf. Scientific Computing | CSC'15 | 97

Page 5: An Effective Methodology for Processing and Managing ...worldcomp-proceedings.com/proc/p2015/CSC2919.pdf · of service is difficult to be provided to raw remote sensing images. In

Table 2: Sample of Directory File

More specifically, this study proposes the development of an algorithm that generates a directory file that contains the following information:

• File index; • File name; • Start and end time for every file; • Start address and end address (position of the file).

This will provide a tool for rapidly accessing data within these parameters.

4.1 Experiment (Binary Search, Linear Search, and Interpolation Search)

Search algorithms are used to check and find an element from a very large list of elements. There are many different search algorithms but we will be doing comparison between binary, linear, and interpolation search algorithm.

Binary search: search the complete sorted list which is divided into two parts. First compare an input value to the middle element of the array. This limits us to check only the second list in subject to if the input value comes from the

right or left of the middle element. This will decrease the length that has to be searched to search for an element from the complete sorted list. This algorithm searches minimum possible comparisons. This makes the binary search more efficient than the linear search.

Linear search: is a basic and simple way of searching by finding a certain value in a list that contains of checking every one of its elements, one at a time and in order, until the search is found. Mainly, each element in an array is read sequentially and then compared with others elements. A successful search will be once all the elements are read and the preferred element is not found. Interpolation search: also refer as extrapolation search. This algorithm searches for a given key value in an indexed array which has been ordered by values of the key. It uses doubly

logarithmic with the values in {ai} distributed relatively equal, to have good time complexity.

Table 3: Comparisons of Search Algorithms Result

5 Results The proposed algorithm can adequately extract relevant

information from a vast quantity of data, and this is done with less iteration than the existing methods. When examining the linear search algorithm, we took into consideration the number of iterations that occurred and determined ways to decrease them. The greater the number of iterations, the longer the delay for the retrieval of data; therefore, the number of iterations must be reduced to provide quicker replies. The attained outcomes suggest that this algorithm is effective in making data available at any time. The table above compares three search algorithms: binary search, linear search, and interpolation search. Both linear and binary search algorithms generate a massive number of iterations. However, the proposed interpolation search algorithm has successfully reduced the number of iterations.

Table 4: Sample Data File

98 Int'l Conf. Scientific Computing | CSC'15 |

Page 6: An Effective Methodology for Processing and Managing ...worldcomp-proceedings.com/proc/p2015/CSC2919.pdf · of service is difficult to be provided to raw remote sensing images. In

Table 5: Interpolation vs. Binary Iteration Results

The above table shows the proposed algorithm takes less time and iteration than binary search, which is more appropriate for researchers’ purpose. Binary search algorithm is used widely to search and sort data in an efficient manner. The binary search algorithm took ten iterations to arrive at the selected data whereas the proposed algorithm took only two iterations to arrive at the desired data.

Figure 4: Interpolation vs. Binary Multiple Data Iteration Results

The figure above shows that the number of iterations in binary search are much higher than interpolation search. The results presented by the interpolation algorithm shows that the selected files were found easily and effectively. The algorithm was instructed to locate the closest points that match the user’s searched query with vast amounts of data. Interpolation search took less iteration to find the data file whereas binary search took many iterations. Researchers want to access the results in a timely manner and get the most relevant data. Therefore, the proposed algorithm is quite suitable for such a case. This

algorithm has made a significant advancement from existing binary search algorithm.

Figure 5: Interpolation vs. Binary Multiple Data Average Iteration Results

Figure 5 compares the binary and interpolation search

algorithm’s average number of iterations. Essentially, choosing the right program for data analysis can save more time and frustration. Working with the right program not only helps scientists gather more suitable results and more revealing graphics, but it also allows researchers to organize their data effectively. Thus, the recommended method helps researchers gain such ability.

6 Conclusion The proposed algorithm presented will help researchers to

develop a range of tools for searching, retrieving, and processing data. Large datasets continue to rapidly increase in size with time. Therefore, through better analysis of the large volumes of data, there is a potential of making faster advances and improving the profitability and success of many enterprises. Due to a significant reduction of processing time achieved by the proposed algorithm, researchers can manage and obtain the desired data at a preferred time in the field of computing. This algorithm is not limited to studies conducted by NASA or scientists in general. It can also be utilized in several data centers as well as in the medical field. For instance, in the field of medicine, the processing of medical data is playing an increasingly important role, e.g. computer tomography, magnetic resonance imaging, and so forth. These data types are produced persistently in hospitals and are increasing at a very high rate. Therefore, the need for systems that can provide efficient retrieval of medical data that is of a particular interest is becoming very high. The suggested algorithm can be utilized. In this case, to ease the burden of data retrieval and to assess the relevant data retrieval process. The algorithm can manage data in all of its aspects, including

Int'l Conf. Scientific Computing | CSC'15 | 99

Page 7: An Effective Methodology for Processing and Managing ...worldcomp-proceedings.com/proc/p2015/CSC2919.pdf · of service is difficult to be provided to raw remote sensing images. In

data in ASCII formats, binary codes, compressed data, uncompressed data, and so forth. 7 References [1] Bustrace technologies llc. (n.d.). Retrieved from

http://www.bustrace.com/bustrace7/manual/HTML/06_buscapture/3_layout/5_io_details/4_raw_data.htm

[2] Chou, J. (2011). Parallel index and query for large scale data analysis large scale data analysis. Manuscript submitted for publication, Retrieved from http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=06114446&tag=1

[3] Di, L. (2000). Nasa standards for earth remote sensing data. International archives of photogrammetry and remote sensing, Amsterdam. Retrieved from http://www.isprs.org/proceedings/&iii/congress/part2/147_XXXIII-part2.pdf

[4] Facts, K. A. (2002). Key Aqua Facts. Delta, 73-93. [5] Leung, A. W. (2009). Organizing, Indexing, and Searching

Large-Scale File Systems. [6] Member, S., Ortega, A. & Shen, G. (2010). Transform-

Based Distributed Data Gathering. [7] Rossi, R., & Witasse, O. (n.d.). Introduction to pds and esa

data archives. Informally published manuscript, Retrieved from http://sci.esa.int/science-e/www/object/doc.cfm?fobjectid=46886

[8] Skytland, N. (2012, October 04). [Web log message]. Retrieved from http://open.nasa.gov/blog/2012/10/04/what-is-nasa-doing-with-big-data-today/

[9] Teymourlouei, H. (2013, March). An effective methodology for processing and analyzing massive datasets. 2nd international conference on computational techniques and artificial intelligence, Dubai. Retrieved from http://psrcentre.org/images/extraimages/313057.pdf

100 Int'l Conf. Scientific Computing | CSC'15 |