Improved Software Implementation for Montgomery Elliptic Curve Cryptosystem

: The last decade witnessed rapid increase in multimedia and other applications that require transmitting and protecting huge amount of data streams simultaneously. For such applications, a high-performance cryptosystem is compulsory to provide necessary security services. Elliptic curve cryptosystem (ECC) has been introduced as a considerable option. However, the usual sequential implementation of ECC and the standard elliptic curve (EC) form cannot achieve required performance level. Moreover, the widely used Hardware implementation of ECC is costly option and may be not affordable. This research aims to develop a high-performance parallel software implementation for ECC. To achieve this, many experiments were performed to examine several factors affecting ECC performance including the projective coordinates, the scalar multiplication algorithm, the elliptic curve (EC) form, and the parallel implementation. The ECC performance was analyzed using the different factors to tune-up them and select the best choices to increase the speed of the cryptosystem. Experimental results illustrated that parallel Montgomery ECC implementation using homogenous projection achieves the highest performance level, since it scored the shortest time delay for ECC computations. In addition, results showed that NAF algorithm consumes less time to perform encryption and scalar multiplication operations in comparison with Montgomery ladder and binary methods. Java multi-threading technique was adopted to implement ECC computations in parallel. The proposed multithreaded Montgomery ECC implementation significantly improves the performance level compared to previously presented parallel and sequential implementations. has lower computational complexity, and hence requires less field operations. This research studied the performance level of sequential and parallel implementation of Montgomery ECC. The parallel implementation was done using Java multithreading techniques. Three projective coordinates were implemented


Introduction
Elliptic Curve cryptosystem (ECC) is a next-generation approach to public key cryptosystems that uses relatively small keys to provide the same or greater level of security compared to the other public-key cryptosystems [1]. The use of smaller key sizes led to a considerable improvement on the speed of the cryptosystem [2]. Therefore, ECC received increasing interest in last decade, especially for security applications that demand high-performance cryptosystem, such as multimedia and military applications. Recently, the dramatic increase in the amount of data being processed by those applications, and the need to provide higher security levels make it compulsory to improve the performance of ECC to satisfy these requirements.
The performance of ECC is affected by the choices of implementation environment; software and hardware implementations, and the domain parameters such as field representation, Elliptic curve (EC) form, algorithm for scalar multiplication, and the coordinates system. Every one of the aforementioned parameters plays important role in the ECC performance [3].
The current research investigates the ECC performance using the different parameters, and selects the best possible parameters' choices and scalar multiplication algorithms to optimize the cryptosystem's performance. This study considers software implementation of ECC since it is cost-effective and consumes less resources in comparison with hardware implementation. The rest of this article is organized as follows: Section 2 mathematical background, Section 3 related works, Section 4 proposed cryptosystem design, Section 5 results and discussion, and finally the conclusion in Section 6.

Mathematical Background
ECC is a type of public key cryptography that depends on the discrete logarithm problem for EC. ECC has two levels of computations; upper and lower levels of computations. The main operation in the upper level is the scalar multiplication, which consists of two operations; point doubling and point addition (called point operations). The lower level, on the other hand, includes finite field (called Galois field GF) computations, which are modular addition, subtraction, multiplication, and division operations. The latter is the most time consuming operation because it requires finding the multiplicative inverse [4].
The scalar multiplication algorithm performs either one or both of point operations in each iteration. The point addition operation adds two different points G and Q on the EC to obtain another EC point R. The point addition can be calculated via computing the coordinates (x 3 , y 3 ) of the resulting point, as follows: G(x 1 , y 1 ) + Q(x 2 , y 2 ) = R (x 3 , y 3 ), where x 1 = x 2 The slope (M) = y 2 − y 1 Alternatively, the point doubling operation adds the point to itself (Q = G) and can be calculated as follows: G(x 1 , y 1 ) + G(x 1 , y 1 ) = R(x 3 , y 3 ), where x 1 = 0 The slope (M) = 3x 2 1 + A 2y 1 It worth mentioning that the EC form does not affect the point addition calculations, hence it relies only on the two points on the curve. On the contrary, point doubling calculation changes according to the EC form and the use of coordinate systems, since it requires to derive the slope equation from the elliptic curve equation.
There are two common types of finite fields in which ECC computations are applied; the prime field GF(P) and the binary field GF(2 m ). The GF(P) consist of the set of integers F = {0, 1. . ., P-1} where all the operations performed using modular arithmetic (mod P) where P is the prime number.
Several elliptic curve forms over GF(P) were presented previously. Among the most important forms is the Montgomery curve since it has less computational complexity, which makes it appropriate for security applications that need high speed ECC.
The Montgomery curve equation is defined as follows: E: By 2 = x 3 + Ax 2 + x where the value of A and B are predefined, and B.(A 2 − 4) = 0, and x, y, A and B are elements of the GF(P).
To find all points on the Montgomery curve over GF(P), substitute the value of x = 0, 1,. . .P in the Montgomery curve equation then take the square roots to find the value of y. The EC is distinguished from other curves since it is symmetric about the x-axis; every point on the elliptic curve has a mirrored point on the other side of the x-axis. For any non-vertical line drawn through the elliptic curve, it will intersect exactly at three points. The EC has a point called point at infinity [5].
Points on a curve can be represented using different types of coordinates systems. The standard coordinates is affine coordinates P(x, y). Alternatively, the use of projective coordinates introduced to improve the performance of elliptic curve computations due to its ability to avoid the time-consuming inversion operation. In projective coordinates a point (x, y) can represented by triples (x, y, z) with x, y and z are elements in a finite field. There are three Known projective coordinate systems, which are Homogenous, Lopez-Dahab and Jacobian. The relation of homogenous projective to Affine is (x 1 : y 1 : z 1 ) → x z , y z . The relation of Lopez-Dahab projective to Affine is (x 1 : y 1 : z 1 ) → x z , y z 2 . The relation of Jacobin projective to Affine is (x 1 : y 1 : z 1 ) → x z 2 , y z 3 [6]. The main operation in ECC encryption is the scalar multiplication, which is usually used to judge the cryptosystem performance. Three main algorithms were used to perform scalar multiplication, which are the Montgomery ladder, NAF, and Binary left to right (LTR). The scalar multiplication algorithms vary in terms of speed and security levels [7]. Their characteristics will be studied in this research to determine the most efficient algorithm for developing high-speed ECC.

Related Works
There has been an intense amount of researches done in the field of ECC since it was introduced in 1985. The majority of previous studies investigated possible factors to improve ECC performance such as the use of certain EC form, the use of projective coordinates, the parallel and concurrent execution of elliptic curve computations, and the use of efficient scalar multiplication algorithms. Again, all these factors affect the ECC performance and should be considered when designing high-speed cryptosystem. Other research works focused on safeguarding ECC against certain types of attacks such as simple time attack (STA). This section surveys the most important research works that studied the possible improvements on the performance and security of ECC.
EC point operations (point doubling (ECDBL) and point addition (ECADD)) include the time expensive modular Inversion when using affine coordinates. Many researchers proposed using projective coordinates systems to eliminate the modular Inversion [8]. The studies presented in [9,10] proposed several parallel hardware designs for ECC point addition and point doubling respectively using the standard EC over GF(P). Both studies investigated the use of three known projective coordinates to eliminate the modular inversion operation. Researchers also used parallel hardware implementation to utilize the inherent parallelism in ECC computations and hence improve the speed of ECC encryption. Authors of [11] proposed parallel data flows for EC computations over GF(P) using several projective coordinates. The mathematical equations of Montgomery curve point operations were developed and then transformed into the best parallel data flow that exploits the maximum parallelism degree. It has been suggested by this study that Montgomery curve represents efficient choice for developing a high-speed EC crypto-processor. It achieved better performance results compared to the standard EC form.
There is a plenty of studies that examined the hardware implementation of ECC forms. Some of them used sequential implementation [12,13], where others investigated potential enhancements that could be obtained using parallel hardware implementations [14][15][16].
However, the major disadvantage of hardware implementations is that they require a dedicated crypto-processor and hardware components, entailing significant increase in the cost of designing and implementing the cryptosystem. This makes hardware implementation of ECC not affordable and complicated choice for many security applications, especially those applications with limited resources. Software implementations for ECC, on the other hand, have low development costs, easy to upgrade, and more flexible [17]. In addition, the majority of previous studies examined the cryptosystem performance with only one scalar multiplication algorithm. This research uses the software approach instead of hardware implementation, and investigates the cryptosystem performance with three different and efficient scalar multiplication algorithms. Both sequential and parallel software implementations will be studied in this research.
Authors of [18,19] surveyed the scalar multiplication algorithms that are common in EC cryptography field. These algorithms were compared in terms of the execution time, the hamming weight of the scalar k, the number of doubling operations, and the required precomputations. Both studies stated that the addition and subtraction algorithm is more efficient than the binary scalar algorithm because it uses signed binary representation of the scalar called Non Adjacent Form (NAF), which is proven to have less hamming weight. The window method is suitable for less constrained memory since it uses a table of pre-computed points. Another study [20] conducted a performance analysis on five different scalar multiplication algorithms using the Weierstrass curve over the two finite fields. The study stated that the NAF Algorithm was shown to have the least execution time GF(P) compared to other algorithms. Moreover, the results confirmed that the prime field is more efficient for software implementations; the scalar multiplication over the prime field is faster than the binary field.
In [21], researchers conducted a performance analysis on Java implementations of EC operations over both standard finite fields. The study suggested that the java BigInteger class is more efficient for the software implementations of EC operations in the prime field more than the binary field. The research work published in [22] presented a description for the software implementation of ElGamal ECC over GF(P). In addition, the study provided detailed description about the main steps of ElGamal ECC. The author used the general Weierstrass EC equation, and the affine coordinates for point representation, in addition to the NAF algorithm to perform the scalar multiplication. Authors in [23] proposed a new ECC that eliminates the need for the mapping operation that has been used in previous studies, which contributes in increasing the speed of the encryption process. This was achieved by using the ASCII values for each character rather than a dedicated algorithm for implementing the mapping procedure. The study presented in [24] provided implementation details for the proposed software ECC that is able to encrypt/decrypt text and images.
Authors of [25] proposed ECC software implementations using a new scalar multiplication algorithm. In particular, the binary trees were used to carry out the results of point multiplication. The proposed technique was based on adding the point to itself repeatedly and the divide and conquer approach. The research work [26] introduced an EC cryptosystem that runs in a multithreading environment. The proposed parallel cryptosystem runs different mathematical algorithms for point multiplication on the standard Weierstrass curve. The point multiplication algorithms used are Karatsuba and Montgomery algorithms. Again, the results showed that the prime field is faster than the binary field, which makes it appropriate choice for software applications.
The study presented in [27] provided an explanation on how to implement a java EC cryptosystem over the prime and binary fields. The paper implemented the key pairs generation, keys exchange, and ECDSA using SunEC provider. Another study [28] implemented the EC Digital Signature Algorithm using the Java Development Kit (JDK) version 1.2. The developed system was tested using the following key sizes 192,239, and 256 bits. The experimental results showed that key pair generation using 256-bits takes 13.6 ms, and the signing process takes 13.7 ms, and the verifying takes 13.7 ms.
It has been noticed that majority of previous research works did not conduct performance testing on their software implementations of ECC to highlight the efficiency of their proposed cryptosystems. In addition, previous studies focused on the sequential software implementation for ECC computations and considered one scalar multiplication algorithm per each proposed ECC implementation.
Furthermore, the reviewed research works that studied the software implementation of ECC focused on the standard EC form and did not investigate the use of newly introduced EC representations such as Montgomery, and tripling oriented curves. Another issue in previous works is that they are predominately use the affine coordinates system to represent the EC points, which leads to a significant drawback in ECC performance. This is because the use of affine coordinates requires finding the multiplicative inverse to perform the modular division operation. Computing the inverse is the most time consuming operation in EC cryptography.
Although, the projective coordinates were used in some studies, those researches did not benefit from the parallel and concurrent execution of ECC computations with projective coordinates. Instead, the vast majority of reviewed researches used the sequential implementation of ECC, in which only one finite field operation can be performed in each level of computations. Again, this increases the time delay of ECC operations and exposes the cryptosystem to the risk of side channel attacks. The sequential software implementation of ECC in not efficient for security applications that require high-speed cryptosystems such as multimedia and military applications.
This research seeks to develop improved software implementation for ECC. To this end, the different factors affecting the ECC performance will be investigated, which are the use of projective coordinates, the use of EC forms with less computations complexity, the scalar multiplication algorithm, and the use of parallel software implementation to perform ECC computations. In the proposed ECC's implementations, the different factors are tuned up and combined to reduce the time delay. The ECC performance is evaluated and analyzed to find the best combination of these factors that achieves the optimum performance level. This aims to develop a high-performance and efficient ECC. To our knowledge, this combination of factors and optimization of the software implementation of ECC have never been proposed and studied previously.

Proposed Cryptosystem Design
This section presents the equations and methods used to implement ECC computations. The parallel computational schemes, which are required for parallel software implementations, are designed in this section. Eventually, this section introduces the enhanced software implementation of ECC.
This research uses the recently introduced form of EC called Montgomery curve, which was presented in Section 2. The Montgomery EC is selected since it has relatively low computational complexity. EC points are represented using projective coordinates instead of usual affine form to avoid the modular division operation. Three types of projective coordinates (Homogenous, Lopez Dahab, and Jacobian) were tested in this research to find the one that gives the best performance.
The computations of point doubling and addition operations were implemented in parallel manner via utilizing the inherent parallelism in ECC computations. Different parallelization levels were tested to find out the best level that achieves the least time delay.
The major scalar multiplication algorithms were examined in this research. In particular, we evaluated the performance of ECC using the Binary method, the NAF algorithm, and the Montgomery scalar algorithm. Proposed ECC implementations are evaluated and compared in terms of time consumption of the scalar multiplication operation. The Java programming language is used to implement the proposed ECC and perform encryption and decryption operations. The following section presents the underlying computations for Montgomery point operations using the three coordinates systems.

Equations and Methods
In this section, the ECC point computations using projective coordinates are illustrated. The main EC point operations are point doubling and point addition. Each one of them requires performing a number of finite filed computations; multiplication, addition, and subtraction. The projective coordinates use three dimensions (x, y, and z) to represent an EC point instead of two dimensions. The point addition operation adds two different EC points to find a third point, while the point doubling operation duplicates an EC point to find another point. Equations of point doubling and addition are presented in Section 2.

Homogeneous Projection System
The Homogeneous coordinates were used to represent the points on the Montgomery curve. The point doubling operation computes the third point represented by the coordinates X3, Y3, and Z3. This can be done using the following equations:

Lopez-Dahap Projection System
In the Lopez-Dahap coordinates, the EC point is represented using the coordinates system. The equations used to compute the result of Montgomery point doubling using Lopez-Dahap coordinates are as follows:

Jacobean Projection System
The Jacobean coordinates system uses the following equations to find the result of the Montgomery point doubling operation: The point addition computations, on the other hand, are similar for all ECs over GF(P) and are not affected by the change in EC equation. The point addition computations are presented in great details in [3]. The following section introduces the computational scheme designs that are used to implement ECC computations in parallel manner.

Computational Scheme Designs
In this section, the ECC computational schemes to implement finite field arithmetic are designed. All possible design choices to perform ECC points computations were investigated; starting from the sequential design tell reaching the design with maximum parallel operations. The rationale behind parallelizing ECC computations is to reduce the time delay as much as possible and hence improving the cryptosystem performance. This study focused on the parallel ECC design that achieves the shortest time delay. The main disadvantage of sequential implementation of ECC is that it consumes longer time to implement the cryptosystem's computations and wastes the CPU capability by executing only one modular operation per each step.
The current study aims to improve the speed of ECC through utilizing the CPU's capability to allow the concurrent execution of multiple modular operations simultaneously. In this section we present the data flows from inputs to outputs of the Montgomery ECC point Doubling over GF(P). These designs created by mapping the computational operation to a parallel software designs. Figs. 1-3 show the computational schemes for point doubling computations using homogeneous, Lopez-Dahab, and Jacobian projections respectively.
The proposed designs use four parallel multiplications (4-PM) and two parallel additions (2-PA), since it is considered the best parallelization level that yields the highest performance for point doubling. In each level of computations, either addition or multiplication operations are performed. The operations in each level are executed in a parallel manner. This was achieved through using the multi-threading technique in software implementation as will be clarified in Section 4.3.
Note that the modular operations within each sequential level cannot begin unless the operations from the previous level are fully executed.  It can be noticed from Fig. 3 that applying Montgomery ECC computations using Jacobian projection requires the least number of multiplication operations. Thus, it is efficient choice for sequential ECC implementations. On the other hand, applying the cryptosystem computations using Homogenous projection presented in Fig. 3 requires less number of parallel multiplication levels, which yields the shortest time delay for points operations. Therefore, it is considered the most efficient coordinates system when using parallel implementation for Montgomery ECC.

Software Implementation
This section elaborates on the software implementation of proposed parallel and highspeed GF(P) ECC using Montgomery curve. First, we illustrate the implementation of modular arithmetic; multiplication, addition, subtraction, and inversion operations. Then, the parallel multi-threaded implementation of point doubling and addition operations is presented. The software implementation of the scalar multiplication using the Montgomery ladder, the NAF, and the Binary algorithms is shown in this section. Finally, an ECC encryption and decryption experiments are performed to verify the cryptosystem efficiency and evaluate its performance level.

ECC Modular Arithmetic
For Modular arithmetic operations, the java BigInteger class is used to allow very large integer (typically 256-bit) calculations beyond the limits of all primitive data types, which cannot handle values greater than 64 bits. In addition, the java BigInteger class provides a set of methods to support the execution of modular addition, modular subtraction, modular multiplication and modular inversion. All of these operations are implemented using built-in methods in the Big-Integer class. For example, the modular multiplication is implemented by using the multiply and mod methods of BigInteger class. The following code segment shows an example for the modular multiplication operation used in the proposed cryptosystem. System.out.println("The result of A * B mod P is "+C); In this research, we examined all possible design choices for Montgomery ECC using the three main projections. The next section presents the software implementations for the most efficient design choices used to perform ECC points operations.

ECC Point Operations
To facilitate the points operations on an elliptic curve, we implemented Point and Elliptic-Curve Classes. The Point Class has three private members, representing the X and Y and Z coordinates of a point. These members are objects of the BigInteger class.
The EllipticCurve class handles the domain parameters and have public functions pointAddition( ) and PointDoubling ( ) and scalarMultplication( ). The pointAddition( ) and PointDoubling ( ) methods have sequential and multithreaded java implementations for each of the three coordinates systems.
While the sequential implementation of point operations is straightforward and performs one operation per each level, the parallel implementation of point operations reflects the parallel designs presented in Section 4.2 and exploit the inherited parallelism in ECC computations. To achieve this, the multi-threading technique in java was used to allow the parallel execution of ECC computations. The Java threads are independent and save time; multiple operations can be performed simultaneously and separately. Three main methods from the Java Thread class were used in the proposed implementation, as follows: • Run ( ): It is used to do an action for a thread.
• start ( ): It is used to trigger the execution of the finite field operation within the thread. • join ( ): It is used to wait for the results of the finite field operation within a specific thread.
This method can be used for the caller thread to wait for the completion of called thread.
At each level, multiple threads are created; one thread for each finite field operation within the level. Within the bracket of the Run method, any segment of Java code can be placed. In our implementation, a prime finite field operation using BigInteger Class's object and procedures is used. For the purpose of controlling the order of the executions of the levels within the parallel design, the join ( ) method used, to make the main thread waits for the completion of all called threads that perform the field operations within a particular level. The code segment below showcases the creation and start of a Java thread used in the parallel Implementations of ECC. Within the bracket of the Run method, any segment of Java code can be placed. In our implementation, a prime finite field operation using BigInteger Class's object and procedures is used. And for the purpose of controlling the order of the executions of the levels within the parallel design, the join ( ) method used, to make the main thread waits for the completion of all called threads that performs the finite field operations within the same level. For instance, in the code of the parallel implementation for Montgomery Point Doubling using Homogeneous projection, four thread objects were created at the first multiplication level and named thread1, thread2, thread3, and thread4. In this level, each thread responsible for carrying out the result of modular multiplication assigned to it within the design presented in Fig. 3. After each thread creation, the thread immediately started using the start method. The following code segment presents the parallel implementation of the first level of multiplications for Montgomery ECC point doubling.
At the end of a particular level of computations, each thread was created in this level will call the join method to ensure that each thread completes its calculations before starting the next level. Other levels of computations for ECC points are implemented in parallel using similar mechanism. Sixteen java threads were created to implement the Montgomery point doubling using Homogeneous Projective system. thread4.start( ); thread1.join( ); thread2.join( ); thread3.join( ); thread4.join( );

Implementation of Scalar Multiplication Algorithms
The scalar multiplication is the most critical operation in ECC. The current research investigates the three main algorithms used for performing the scalar multiplication, which are the Montgomery Ladder, the NAF, and the Binary algorithms. This aims to determine the fastest algorithm that provides best performance level for ECC encryption/decryption processes. Experimental results showed that the NAF algorithm is faster than the Montgomery ladder and the Binary Left-to-Right method, because it requires less number of point addition operations. The following code segment is used to implement the NAF algorithm after computing the signed binary representation (NAF) of the scalar K through the function (computeNAF(k)).

ECC Encryption and Decryption Experiments
In order to test the proposed cryptosystem, and verify its functionality, a number of encryption and decryption experiments were conducted. Moreover, this contributes in evaluating the ECC performance more precisely, and assists developers to build high-speed and efficient cryptosystems.  The process of implementing the ECC encryption and decryption went through the following phases:

1) Cryptosystem setup:
To set up an ECC, both senders and receivers are required to select and agree upon the same EC domain parameters, i.e., the underlying finite field F p , the EC coefficients 'a' and 'b,' the base point G in the curve (generator), the order of cyclic subgroup n, and the cofactor h = #E(F p )/n. Also, they are required to agree upon the lookup table and the same set of algorithms for representing the EC points and implementing the EC operations.

2) Key generation and exchange:
Each side selects a random integer as its private keys and then computes its public key. The selectPrivatekey, getRandomNumber, and computePublic_key methods were used for the key generation purposes.

3) Message encoding:
In this phase, each char in the message is mapped to its corresponding point in the lookup table that the senders and receivers have agreed upon. The encoding method takes an array of char types and returns an array of Point types. The method loop the input array to map each char element with its EC Point using the ASCII value of the char, and then this value is used as an index for the point representing this char.

4) ECC Encryption
In this phase, the cryptosystem encrypts the point array representing the message after encoding to produce a cipher array. The encryption function is a part of the EllipticCurve class, and takes as an input; the Private Key, the message points, and the other Party's Public Key. This method returns a cipher array of points for each char in the message after encryption. The encryption process is implemented through the following code segment:

5) Decryption
In the decryption phase, the cryptosystem decrypts the ciphertext using a member method of the EllipticCurve Class. The Decryption method takes the cipher array of points as an input and returns the message mapped points back. The decryption process inverses the encryption to restore the plaintext.

6) Message Decoding
In the message decoding phase, each elliptic curve Point in the cipher array is mapped back to its corresponding char using the lookup table that both sender and receiver have agreed upon at the setup phase. The decoding method takes an array of Point types and returns an array of Char types representing the plaintext.
The following section presents the experimental results for the proposed ECC, and discussion about the performance evaluation.

Results and Discussions
Tab. 1 shows the average execution time for the finite field operations. The experimental results confirmed that modular inversion is the most expensive in terms of time delay. The results also showed that modular multiplication and addition operations are the second and third time-consuming operations respectively. To calculate the running time for proposed ECC implementations precisely, the java command System.nanoTime was used to get the current time at the beginning (startTime) and the end (endTime) of the encryption process. Then, the actual time consumption is calculated by subtracting the start time from the end time Tab. 2 presents a comparison in terms of time consumption for ECC point operations when implemented using sequential and parallel designs. The implementation results also highlight the time delay for each projective coordinates system used to apply EC computations. It can be observed from the results presented in Tabs. 1-3 that parallel implementation significantly improves the ECC performance compared to the sequential implementation. The best reported improvement percentage is 253% and was achieved using the Homogenous projection. Moreover, this projection achieved the shortest time delay (191111 ns) for the parallel (multithreaded) implementation of point doubling operation. For sequential ECC implementation, Jacobian coordinates system achieved the shortest time delay. This was expected since Jacobian projection requires less number of modular multiplications. The scalar multiplication is the main operation in ECC encryption, and usually used to evaluate the cryptosystem performance. Tab. 4 presents a performance (time consumption in millisecond (ms)) comparison between the sequential and parallel ECC implementations using three known algorithms; Montgomery ladder, NAF, and Binary (LRT) method. It can be noticed from Tab. 4 that NAF algorithm achieved the least time consumption results (5.335 ms) for Montgomery ECC encryption when applied using the parallel (multi-threaded) design and Homogenous projection. For sequential ECC implementation, NAF algorithm also achieved the shortest time delay (12.005 ms) when used with Jacobian coordinates. The low humming wait of the NAF algorithm might be the main reason for its good performance levels. The Binary (LTR) and Montgomery ladder algorithms achieved the second and third place respectively in respect to the performance level. Fig. 4 shows the results of the running time of all scalar multiplication algorithms on Montgomery Curve using sequential and parallel implementations. The results clearly show that NAF with the parallel implementation and the use of homologous projection takes the least time compared to the other implementations. On the other side, the sequential ECC implementation using Montgomery ladder and Lopez_Dahab projection consumes the longest (worst) time delay.
Tab. 5 presents a comparison with ECC implementations presented in previous research works. It can be noticed that proposed parallel ECC implementation overcomes others in terms of performance for the main EC operations; including Encryption, scalar multiplication (ECSM), point addition (ECADD), and point doubling (ECDBL).

Conclusion
The explosive increase in the amount of data being transmitted over communication networks makes it necessary to develop a high-speed cryptosystem. Elliptic curve cryptosystem (ECC) has been proposed as an efficient public key cryptosystem since it can provide comparable security level to other systems with using shorter key length. However, ECC using its usual sequential implementation and affine coordinates cannot achieve adequate performance level to protect the huge amount data processes by modern applications such as multimedia and military. Moreover, the majority of previous studies investigated the hardware implementation of ECC, which is costly option, and used only one standard curve.
In this article, a high-speed parallel software implementation for ECC is developed. It studied and tuned-up the main factors playing significant role in improving ECC performance. A newly introduced form called Montgomery curve is studied in this research because it has lower computational complexity, and hence requires less field operations. This research studied the performance level of sequential and parallel implementation of Montgomery ECC. The parallel implementation was done using Java multithreading techniques. Three projective coordinates were implemented and examined due to their ability to avoid the long time inversion operation. It has been shown the proposed multi-threaded ECC implementation using Homogenous projection achieved the best performance results. Such implementation is efficient choice for applications that need fast encryption process. For sequential Montgomery ECC, Jacobian projection obtained the shortest time delay. As a future research, other forms of EC and extensions of projective coordinates can be studied, since it may reduce the time complexity. Parallelizing the upper level of computations for ECC could be considered to increase the speed of encryption process as well.