logo

Abstracts of the Third HiTech Conference

Embracing Creativity and Innovation and Exiting the Bottleneck: the Solution is in Funding and Investment

Estimating the heart rate reference value using an expert fuzzy system and the corrected QT interval in the ECG signal

Inverse Differential Control of a 6-Dof Parallel Manipulator Using Neural Networks

A Review Study on Investing Deep Learning in The Field of Virtual Dressing

INTRODUCTION

For quite some time now, the world has been witnessing a radical shift towards the virtual world in various fields, where financial and commercial transactions and many other fields are now carried out entirely through electronic applications, even official and government transactions, most of which are also took place via Internet. The recent circumstances that the world has witnessed (Corona, wars…) have greatly demonstrated the importance of virtual reality, especially artificial intelligence and deep learning, as these circumstances imposed the importance of many human activities not being linked to the real presence of people. One of the most important activities affected by this new trend is commercial activities, as the localization of e-commerce technology has become of great importance and necessity. On the other hand, the new information fields of artificial intelligence and deep learning provide many advanced tools to localize this technology. There is no doubt that the development in the field of digital transformation has made great strides in many fields, and many activities have now been carried out entirely over Internet using appropriate applications. It is natural that the expansion of digital transformation’s control over commercial activity is related to the development of tools and algorithms that enable us creating applications suitable for each field. Another important justification for this study is the lack of programs that support remote shopping in an interactive way that simulates the process of traditional presence in the store. Based on a review of previous studies in the field of virtual clothing [1], it has become clear that most of the proposed models have limitations related to the data on the target clothing and the people wearing those clothing. In this review, we conduct a comparative study on investing deep learning and deep fake algorithms to choose the best three-dimensional digital models of colored clothing and reveal their attributes [2][3][4]. This topic needs to address several axes, the most important of which are studying the digital models used in deep learning algorithms, studying algorithms for human body detection and determining reference points that correspond to digital models [5][6], and studying deep fake algorithms to give a real impression about clothing these models [7].

GENERAL METHODOLOGY

According to studies, virtual fitting rooms adopt AI techniques, deep learning algorithms, and neural networks to develop virtual try-on clothes. One of the most widely used learning algorithms is convolutional neural networks (CNN) (Figure 1), which is used to prepare 3D networks of the human body and can be generalized to different body shapes and poses.

Figure 1: An example of the main layers of a CNN
                                                                          Figure 1: An example of the main layers of a CNN

The general architecture of generative models based on CNN consists of:

  • Dataset including multiple meshes of clothed human scans, different types of outfits, and different poses.
  • Encoder
  • Decoder
  • Discriminator

CNN is compatible with the famous Skinned Multi-Person Linear Model (SMPL) body model, which is defined as a realistic 3D model of the human body based on extracting and blending shapes through thousands of 3D body scans [8]. SMPL is more accurate than other models and is compatible with existing graphic pipelines. It is viewed as skinned vertex-based model that accurately represents a wide variety of body shapes in natural human poses. One of the latest network models is the Clothed Auto Person Encoding (CAPE) that provides SMPL mesh-registered 4D scans of people wearing cloths, along with recorded scans of real body shapes under clothing [9]. CAPE adds two stages to the general architecture as follows (Figure 2):

Figure 2: CAPE network architecture [9]
                                                                 Figure 2: CAPE network architecture [9]
  • Condition module: For pose θ, CAPE removes non-clothing parameters, e.g. head, hands, fingers, feet and toes, resulting in 14 valid joints from the body. The pose parameters from each joint are represented by the flattened rotational matrix. This results in the overall pose parameter R, which feeds into a small fully-connected network θ. The clothing type C refers to the type of “outfit”, i.e. a combination of upper body clothing and lower body clothing. As the types of clothing are discrete by nature, CAPE represents them using a one-hot vector, C, and feeds it into a linear layer.
  • Conditional residual block (CResBlock): CAPE adopts the residual block (ResBlock) from Kolotouros et al. [11] which includes ensemble normalization [13], nonlinearity, a graph convolutional layer and a graph linear layer. After input to the residual block, CAPE appends a state vector to each input node along the feature channel. ResBlock is the graph residual block from outputs on each node.

RELATED WORK

The related works can be classified into three axes:

Reconstructing 3D humans:

Reconstruction of 3D human’ bodies from 2D images and videos is a classical computer vision problem. Most approaches [18, 32] output 3D body meshes from images, but not clothing. This ignores image evidence that may be useful. To reconstruct clothed bodies, methods use volumetric [20, 30] or bi-planar depth representations [12] to model the body and garments as a whole. While these methods deal with arbitrary clothing topology and preserve a high level of details, the reconstructed clothed body is not parametric, which means that the pose, shape, and clothing of the reconstruction cannot be controlled or animated. Another group of methods is based on SMPL [25, 31]. They represent clothing as an offset layer from the underlying body as proposed in Cloth Cap [28] as shown in (Figure 3). These methods can change the pose and shape of the reconstruction using the deformation model of SMPL. This assumes that clothing deforms like an undressed human body; i.e. clothing shape and wrinkles do not change as a function of pose.

Figure 3: ClothCap approach [28]
                                                                Figure 3: ClothCap approach [28]

Parametric models for 3D bodies and clothes

Statistical 3D human body models learned from 3D body scans [23, 32] capture body shape and pose and they are an important building block for multiple applications. Most of the time, people are dressed and these models do not represent clothing. In addition, clothes deform as we move, producing changing wrinkles at multiple spatial scales. While clothing models learned from real data exist, few can be generalized to new poses. For example, Neophytou and Hilton [34] proposed to learn a layered garment model from dynamic sequences, but generalization to novel poses is not demonstrated. Yang et al. [27] trained a neural network to regress a PCA-based representation of clothing, but they proved the generalization on the same sequence or on the same subject. Lahner et al. [29] proposed to learn a garment-specific pose-deformation model by regressing low-frequency Principal Components Analysis (PCA) components and high frequency normal maps. While the visual quality was good, the model is garment-specific and does not provide a solution for full-body clothing. Similarly, Alldieck et al. [25] as shown in (Figure 4) used displacement maps with a UV parametrization to represent surface geometry, but the result was only static. Wang et al. [24] allowed manipulation of clothing with sketches in a static pose. The Adam model proposed in [23] can be considered clothed but the shape is very smooth and not pose-dependent. Clothing models have been learned from physics simulation of clothing [19, 33], but the visual reliability was limited by the quality of the simulations.

Figure 4: Displacement maps with a UV parametrization to represent surface geometry [25]
                             Figure 4: Displacement maps with a UV parametrization to represent surface geometry [25]
Generative models on 3D meshes

CAPE model predicts clothing displacements on the graph defined by the SMPL mesh using graph convolutions [10]. However, there is an extensive recent literature on methods and applications of graph convolutions such as [21, 26]. Most relevant here, Ranjan et al. [26] proposed to learn a convolutional auto encoder using graph convolutions with mesh down- and up-sampling layers [13]. Although it worked well for faces, the mesh sampling layer made it difficult to capture the local details, which are key in clothing, while CAPE captures local details by extending the PatchGAN [22] architecture to 3D meshes (Figure 2).

COMPARISON AND DISCUSSION

Comparing between public 3D clothed human datasets according to six points (Captured, Available Body Shape, Registered, Large Pose Variation, Motion Sequence, High Quality Geometry) leads to the following results:

  • Inria Dataset presented an approach to automatically estimate the human body shape under motion based on a 3D input sequence showing a dressed person in possibly loose clothing.
  • It has no registered 3D meshes of clothed human scans. It has limited variation in pose and low quality geometry [14] (Figure 5).
  • BUFF Dataset introduced a method to estimate a detailed body shape under clothing from a sequence of 3D scans. This method exploits the information in a sequence by merging all clothed registrations into a single frame as shown in (Figure 6).
  • BUFF Dataset is like Inria but has high quality geometry [15].

Figure 5: Inria Dataset approach [14]
                                                 Figure 5: Inria Dataset approach [14]
Figure 6: Qualitative pose estimation results on BUFF dataset [15] Left to right: scan, Yang et al. [27], BUFF result
Figure 6: Qualitative pose estimation results on BUFF dataset [15] Left to right: scan, Yang et al. [27], BUFF result

Adobe Dataset key insight is to use skeletal pose estimation for gross deformation followed by iterative non-rigid shape matching to fit the image data.

  • Adobe Dataset does not have human body shapes. It has limited variation in pose and low quality geometry [12]. 3D people dataset proposed a new algorithm to perform spherical parameterizations of elongated body parts, and introduced an end-to-end network to estimate human body and clothing shape from single images, without relying on parametric models, (Figure 7).
  • 3D People Dataset contains all the points but lacks the ability to capture and convert data from different sources [10].
Figure 7: Annotations of the 3D People Dataset [10]
                                         Figure 7: Annotations of the 3D People Dataset [10]
  • CAPE Dataset contains all the points, as shown in (Figure 8). Given a SMPL body shape and pose (a), CAPE adds clothing by randomly sampling from a learned model (b, c), and can generate different clothing types — shorts in (b, c) vs. long-pants in (d). The generated clothed humans can generalize to diverse body shapes (e) and body poses (f).
Figure 8: CAPE model for clothed humans [9]
                                           Figure 8: CAPE model for clothed humans [9]

Characterized by accurate alignment, consistent mesh topology, ground truth body shape scans, and a large variation of poses, CAPE features make it suitable not only for studies on human body and clothing, but also for the evaluation of various Graph CNNs. However, CAPE differs from the other methods in learning a parametric model of how clothing deforms with pose. Furthermore, all the methods of Parametric models are regressors that produce single point estimates. In contrast, CAPE is generative, which allows to sample clothing. A conceptually different approach infers the parameters of a physical clothing model from 3D scan sequences was proposed in [17]. This can be generalized to novel poses, but the inference problem is difficult and, unlike CAPE, the resulting physics simulator is not differentiable with respect to the parameters. Since the presented results confirmed the investment of deep fake in the field of virtual dressing, we are on the way towards investing in artificial intelligence and neural networks in this field, in parallel with the very rapid development in electronic clothing marketing, and in response to the requirements of the local and global market in this field.

CONCLUSION

Although the results of the previous studies are significant, the research remains open due to the limitations of the approved methods, which can be summarized as follows:

  • The limitation of the offset representation for clothing such as skirts and open jackets differ from the body topology and cannot be represented by offsets as shown in (Figure 9). Mittens and shoes can technically be modelled by the offsets, but their geometry is sufficiently different from that of fingers and toes, making this impractical.
  • Dynamics issues: the approved models take a long time to train the algorithms, because the generated clothing depends on pose, and does not depend on dynamics. This does not cause a severe problem for most slow motions but cannot be generalized to faster motions.

Future work will address models of clothing, but instead of scanning the entire body, we propose to consider sufficient features on the body to estimate the shape of the body which may be sufficient for virtual dressing. Therefore, one will not need to photograph the body completely naked or with a minimum amount of clothing, as we propose to conduct an investigation of certain points of the body. Another restriction can be added to make the work easier is to deal with the human body as two parts, upper and lower, based on the International Standard Organization (ISO) standards to assign points to the human body [16].

Figure 9: Qualitative results on fashion images [9]SMPL [8] results are shown in green, CAPE results are in blue
                                          Figure 9: Qualitative results on fashion images [9]
                               SMPL [8] results are shown in green, CAPE results are in blue

Image Super-Resolution Using Complex-Valued Deep Convolutional Neural Network

INTRODUCTION

Single image super-resolution (SISR) is a fundamental task in computer vision aimed at recovering high-resolution details from low-resolution input images [1]. It plays a crucial role in various applications, including surveillance, medical imaging, and satellite imagery, where obtaining high-quality images is essential [1]. Over the years, numerous methods have been developed to tackle the SISR problem, with the Super-Resolution Convolutional Neural Network (SRCNN) being one of the most prominent approaches [1]. SRCNN revolutionized the field of SISR by leveraging the power of deep neural networks to learn the mapping between low-resolution and high-resolution image patches [1]. By training on a large dataset of paired low-resolution and high-resolution images, SRCNN demonstrated impressive results in terms of reconstructing detailed and perceptually pleasing high-resolution images. However, SRCNN operates using real-valued neural networks, which may not fully capture the complex nature of image data [2]. In recent years, there has been growing interest in complex-valued neural networks as a potential enhancement to traditional real-valued networks. Complex-valued neural networks extend the capabilities of their real-valued counterparts by incorporating complex numbers as part of their computations [2]. This extension allows complex-valued networks to capture and process both magnitude and phase information present in complex data distributions. When applied to SISR, complex-valued neural networks offer several potential advantages. Firstly, they can better model the complex relationships and structures inherent in high-resolution images. By considering both real and imaginary components, complex-valued networks can effectively capture the intricate details and textures that contribute to the high-frequency information in an image. This property is particularly beneficial when handling images with fine textures, edges, and patterns [3]. Secondly, complex-valued neural networks have the potential to improve the preservation of image content during the super-resolution process. The ability to represent both magnitude and phase information enables complex-valued networks to better handle the phase shift problem that often arises in SISR. This issue occurs when the high-frequency components of an image are not accurately aligned during the upscaling process, leading to blurry or distorted results. Complex-valued networks can potentially mitigate this problem by explicitly modeling the phase information and preserving the integrity of the image content [4]. In short, the main contributions in this research paper are as follows:

  1. We propose a transformation of the SRCNN model by incorporating complex-valued operations. This includes defining complex-valued convolutional layers, activation functions, and loss functions.
  2. We exhaustively evaluate the performance of the Complex-valued SRCNN model on a variety of benchmark datasets. Experimental results demonstrate that the Complex-valued SRCNN model outperforms the traditional SRCNN model on all metrics.
  3. Noting that the SRCNN model is a base model for many CNN-based SISR models, this work may be very helpful in developing more advanced and effective SISR models.

The remainder of this paper is organized as follows. In Section 2, we provide an overview of related works in a single image super-resolution and complex-valued neural network. Section 3 presents the methodology, describing our case study SRCNN [1] and complex-valued neural network. In Section 4, we present the experimental setup and evaluate the performance of our method on a benchmark dataset. Finally, Section 5 concludes the paper.

RELATED WORK

In recent years, single-image super-resolution (SISR) techniques have garnered significant attention in the field of computer vision, aiming to enhance the resolution and quality of low-resolution images [1]. Traditional approaches relying on real-valued networks have faced inherent limitations in capturing complex image structures and relationships, prompting researchers to explore the potential benefits of complex-valued networks in SISR and other image-processing tasks [2]. Dong et al. [1] introduced the Super-Resolution Convolutional Neural Network (SRCNN), pioneering the application of deep learning for SISR and showcasing significant improvements in image reconstruction. Building upon SRCNN, subsequent approaches such as the Fast Super-Resolution Convolutional Neural Network (FSRCNN) [3] optimized network architectures for faster processing without compromising reconstruction quality. Additionally, very deep architectures like the Very Deep Super-Resolution (VDSR) network [4], the Enhanced Deep Super-Resolution Network (EDSR) [5], and the Residual Channel Attention Network (RCAN) [6] have achieved remarkable performance by leveraging residual learning and attention mechanisms. Advancements in generative adversarial networks (GANs) have led to the development of the Super-Resolution Generative Adversarial Network (SRGAN) [7], focusing on generating perceptually realistic high-resolution images. More recently, the exploration of complex-valued neural networks has shown promise, with studies by Li et al. [8], Xu et al. [9], and Zhang et al. [10], demonstrating superior performance in capturing complex image structures and relationships. Li et al. [11] investigated the use of complex-valued networks for image de-noising, demonstrating their effectiveness in modeling complex noise patterns. Xu et al. [12] proposed a complex-valued neural network architecture for SISR, preserving phase information during the super-resolution process for sharper and more accurate reconstructions. Moreover, Zhang et al. [13] introduced a complex-valued residual network for SISR, facilitating the learning of more expressive representations and achieving improved performance. The potential of complex-valued networks extends beyond SISR, as evidenced by studies in image painting [14] and multi-modal image fusion [15]. These findings underscore the promising role of complex-valued networks in overcoming the limitations of traditional real-valued networks and enhancing various image-processing tasks.

METHODOLOGY

The Super-Resolution Convolutional Neural Network (SRCNN) is a deep learning-based technique designed for single image super-resolution (SISR). Originally introduced by Dong et al. in 2016 [1], SRCNN aims to learn the mapping between low-resolution (LR) and high-resolution (HR) image patches using a three-layer convolutional neural network (CNN). The network architecture of SRCNN comprises three primary stages (Fig 1): patch extraction and representation, non-linear mapping, and reconstruction. Each stage is detailed below:

Patch Extraction and Representation

In this initial stage, the low-resolution input image is divided into overlapping patches. These patches are the inputs to the SRCNN model and are represented as high-dimensional feature vectors. Let ILR denote the low-resolution input image, and f1 be the first convolutional layer with filter size f1​×f1 ​and n1 filters. The output of this layer is:

F1​=σ(W1​∗ILR​+b1​)

where W1​ and b1​ are the weights and biases of the first layer, respectively, denotes convolution, and σ is the activation function.

Non-linear Mapping

The high-dimensional feature vectors from the first stage are input into a second convolutional layer that performs non-linear mapping. This layer uses a set of learnable filters to capture the complex relationships between the low-resolution and high-resolution patches. Let f2 ​denotes the filter size of the second convolutional layer with n2 filters. The output is given by:

F2​=σ(W2​∗F1​+b2​)

where W2​ and b2are the weights and biases of the second layer.

Reconstruction

In the final stage, the feature maps from the non-linear mapping layer are processed by a third convolutional layer that aggregates the information to generate the high-resolution output. Let f3 denotes the filter size of the third convolutional layer with n3 filters. The high-resolution output image IHR​ is obtained as:

IHR​=W3​∗F2​+b3

where W3​ and b3 are the weights and biases of the third layer. The final high-resolution image is reconstructed by combining the outputs from all patches in the overlapping regions. By leveraging these three stages, SRCNN effectively transforms low-resolution images into high-resolution counterparts, enhancing image quality and details [1].

Fig 1 : SRCNN architecture
                                                              Fig 1: SRCNN architecture

In this research, we propose a novel complex-valued neural network architecture for enhancing single image super-resolution (SISR). This approach builds upon the success of the Super-Resolution Convolutional Neural Network (SRCNN) model, but with a key distinction: we transformed the SRCNN architecture to incorporate complex-valued operations.

Complex-Valued Convolutional Neural Networks (CVNNs)

Traditional CNNs for SISR rely on real-valued numbers for computations. While these models have achieved significant results, recent research explores the potential of Complex-Valued Neural Networks (CVNNs) in this domain. CVNNs leverage complex numbers, which hold both magnitude and phase information, potentially offering advantages over real-valued approaches [16].

Preserving Phase Information

Natural images contain crucial phase information alongside magnitude. Standard CNNs primarily focus on magnitude, potentially losing details during super-resolution. CVNNs, by incorporating complex numbers, can explicitly handle both aspects, leading to potentially sharper and more accurate reconstructions [16].

Mitigating Phase Shift Problems

Traditional SISR methods often suffer from phase shifts, introducing artifacts and distortions [17]. CVNNs, by explicitly dealing with phase information, can address this issue and generate more realistic super-resolved images.

Our Proposed Complex-Valued SISR Network

The SRCNN model serves as a strong foundation for SISR tasks due to its effectiveness. However, to enable complex-valued computations within the network, we introduce several modifications and transformations to the original SRCNN architecture. These modifications are detailed below:

  • Complex Inputs: The first step is to transform the input images into complex-valued representations. This can be achieved by augmenting the real-valued image with zeros in the imaginary component. Mathematically, if ILR​ is the low-resolution real-valued image, the complex-valued input  can be represented as:

where j is the imaginary unit [16].

  • Complex-valued Convolutional Layers: The first modification involves replacing the real-valued convolutional layers in SRCNN with complex-valued convolutional layers. Complex-valued convolutional layers operate on complex numbers, allowing the network to capture both magnitude and phase information. These layers consist of complex-valued filters that convolve with the input image patches, producing complex-valued feature maps. The operation can be expressed as:

Where and  are the complex-valued weights and biases of the first layer, ∗ denotes convolution, and σ is the activation function applied to complex numbers [16, 17].

  • Activation Functions: In our complex-valued network, we employ activation functions that can handle complex-valued inputs and outputs. One commonly used activation function for complex-valued networks is the Complex Rectified Linear Unit (CReLU), which operates on both real and imaginary components of complex numbers separately. CReLU helps introduce non-linearity to the network and facilitates the modeling of complex relationships between features [18]. Mathematically, CReLU is defined as:

where: d represents a complex-valued input.Re{d} and Im{d} denote the real and imaginary parts of d, respectively. j is the imaginary unit. ReLU is the standard Rectified Linear Unit activation function. CReLU’s simplicity and effectiveness make it a prevalent choice for complex-valued neural networks.

  • Complex-valued Upsampling: In the super-resolution process, we need to upscale the low-resolution input image to the desired high-resolution output. To achieve this, we utilize complex-valued upsampling techniques, such as Complex Bilinear Interpolation or Complex Convolutional Upsampling. These methods allow the network to generate complex-valued feature maps at a higher resolution by preserving the phase information and effectively capturing fine details [18].

Nearest Neighbor Upsampling (Mathematical Definition): For each new pixel location in the upsampled output, this method simply replicates the value of the nearest neighboring pixel in the original complex-valued input.

Bilinear Interpolation for Complex Data:

Bilinear interpolation for complex-valued data builds upon the concept of standard bilinear interpolation used for real-valued images. Here’s a breakdown of the general approach:

  • Separate Real and Imaginary Parts:The complex-valued input (represented as a single complex number per pixel) is divided into real and imaginary parts (two separate matrices).
  • Upsample Each Part Independently:Bilinear interpolation is applied to both the real and imaginary parts individually. Bilinear interpolation considers the values of four neighboring pixels in the original low-resolution image and their distances to calculate a weighted average for the new pixel location in the higher-resolution image.
  • Combine Upsampled Parts:After upsampling both the real and imaginary parts, they are recombined to form a new complex number representing the upsampled pixel in the higher-resolution complex-valued feature map.
  • Loss Function: Training complex-valued neural networks requires a loss function that considers both the magnitude and phase information of complex numbers. Complex Mean Squared Error (CMSE) is a popular choice for this purpose [14]. The CMSE loss function is represented as:

where: L represents the CMSE loss value. Δx and Δy represent the differences between the real parts (x) and imaginary parts (y) of the predicted (zpredicted​) and ground truth (zgt​) complex numbers, respectively. By incorporating these modifications and transformations, our complex-valued neural network should be able to effectively capture the complex structures and relationships present in high-resolution images. This architecture enables the network to leverage the benefits of complex numbers and provide enhanced super-resolution capabilities compared to traditional real-valued networks.

Experimental Results

Dataset and Evaluation Metrics

To evaluate the performance of our complex-valued network (C-SRCNN) and compare it with the traditional SRCNN, we utilize standard benchmark datasets commonly used in single image super-resolution (SR) tasks. These datasets consist of diverse, high-resolution (HR) images paired with their corresponding low-resolution (LR) counterparts. The datasets are typically split into training, validation, and testing sets to ensure robust model evaluation.

We specifically chose a set of benchmark datasets encompassing various image types, including natural scenes, objects, and textures. This selection considers the potential increase in parameter size for our C-SRCNN due to its complex-valued nature compared to the real-valued SRCNN. This diversity allows us to assess the trade-off between achieving high reconstruction quality and model complexity. Additionally, it enables us to evaluate the generalization capability of C-SRCNN for handling different image content compared to the traditional SRCNN.

Here’s a detailed breakdown of the chosen benchmark datasets:

Set5: Contains 5 pairs of LR and HR images with a resolution of 256×256 pixels.

Set14: Contains 14 pairs of LR and HR images with a resolution of 512×512 pixels.

BSD100: Contains 100 HR images with a resolution of 512×512 pixels. Commonly used for SR tasks and other image processing applications.

Urban100: Contains 100 HR images with a resolution of 512×512 pixels captured from urban scenes.

To quantify the performance of our models, we employ standard metrics used in SR tasks: Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). Higher PSNR values indicate better image quality by measuring the ratio between the reconstructed image’s signal and noise. SSIM goes beyond just intensity differences and considers structural similarities between the reconstructed image and the ground truth, providing a more comprehensive evaluation.

Training and Evaluation

We used the T91 training set, which consists of 91 images, to train C-SRCNN. To ensure the generalizability of our C-SRCNN, we evaluated it on multiple standard benchmark datasets commonly used in single image super-resolution (SR). These datasets include Set5, Set14, BSD100, and Urban100. We compared the performance of our C-SRCNN against its real-valued counterpart, R-SRCNN. The training was conducted on a computer running Linux Ubuntu 18 with an Nvidia GTX 1060 GPU.steps =200000 and epochs = 219 patch = 128 . The quantitative results are presented in Table 1 and Table 2. Our proposed method achieved superior PSNR and SSIM scores on all four datasets compared to SRCNN-915. The PSNR results indicate an improvement of 0.435 for CSRCNN-915.

In addition, the SRCNN network was trained using different filter sizes in each layer. The number of parameters for each network was compared, and the results showed that the C-SRCNN network outperformed despite having fewer parameters. This highlights the effectiveness of the complex-valued network architecture in extracting image features, leading to superior performance. In conclusion, the results in Table 3 strongly support the efficacy of the C-SRCNN architecture for enhancing single-image super-resolution. This approach demonstrates the potential of complex-valued networks to achieve high-quality results with more efficient parameter usage.

Subjective visual assessment demonstrates the effectiveness of the C-SRCNN algorithm in enhancing the quality of single-image super-resolution (SISR) images. This algorithm excels in preserving detail sharpness, color accuracy, and noise reduction compared to the Real-SRCNN algorithm.

By utilizing these evaluation metrics, we can objectively assess the performance of our complex-valued network in enhancing single-image super-resolution. We compare the results obtained from our complex-valued network with those achieved by the traditional SRCNN. This comparison allows us to determine the effectiveness of the complex-valued network architecture in capturing complex image structures and improving the overall quality of super-resolved images.

RESULTS AND ANALYSIS

The transformation of SRCNN with complex-valued neural networks offers several benefits. Firstly, it allows for better preservation of fine details and textures during the super-resolution process. The complex-valued convolutions enable the network to capture subtle variations in color and texture, resulting in more visually appealing and realistic high-resolution images. Additionally, the complex-valued network can handle complex-valued input data, making it suitable for applications involving complex image representations. However, there are also challenges associated with complex-valued neural networks, including increased computational complexity and the interpretability of complex-valued networks may be more challenging compared to real-valued networks. In the future, researchers can focus on developing more efficient training algorithms and exploring novel architectures that leverage the power of CVNNs for SISR. Furthermore, the creation of comprehensive complex-valued data sets can facilitate the training and evaluation of CVNN-based SISR models.

CONCLUSIONS

Complex-valued CNNs present a promising avenue for advancing the field of image super-resolution. By incorporating complex-valued representations and making the necessary modifications, we can enhance the capabilities of CNNs for various image-processing tasks. The inclusion of magnitude and phase information, along with rotation and shift invariance, empowers the CV-CNNs to produce more accurate and visually appealing results. While challenges exist, further research and exploration of complex-valued CNNs hold great potential for improving image analysis and processing techniques.

 

Arabic Sentiment Analysis Using Mixup Data Augmentation Mixup

INTRODUCTION

In recent years, deep learning models have exhibited remarkable performance in numerous Natural Language Processing (NLP) tasks, such as parsing [1], text classification [2], [3] and machine translation [4]. These models are typically characterized by their substantial parameter count, often reaching millions, necessitating extensive data for training to prevent overfitting and enhance generalization capabilities. However, collecting a sizable annotated dataset proves to be a laborious and costly endeavour. To mitigate the data-hungry nature of deep learning models, an approach known as automatic data augmentation has emerged. This technique involves generating synthetic data samples to augment the training dataset, effectively serving as regularization for the learning models. Data augmentation has been actively and successfully employed in computer vision [5], [6], [7] and speech recognition tasks [8], [9]. In these domains, methods frequently rely on human knowledge to apply label-invariant data transformations, such as image scaling, flipping, and rotation. However, natural language processing presents a different challenge, as there are no straightforward rules for label-invariant transformations in textual data. Even slight changes in a word within a sentence can drastically alter its meaning. Consequently, popular data augmentation techniques in NLP focus on transforming text through word replacements, either using synonyms from manually curated ontologies, such as WordNet [10] or leveraging word similarity measures [11], [12]. Nonetheless, this synonym-based approach can only be applied to a limited portion of the vocabulary since finding words with precisely or nearly identical meanings is rare. Furthermore, certain NLP data augmentation methods are specifically designed for particular domains, rendering them less adaptable to other domains [13]. As a result, developing more versatile and effective data augmentation techniques remains a significant research challenge in the field of NLP. In recent researches, a straightforward yet highly impactful data augmentation technique called Mixup [7] has been introduced, demonstrating remarkable effectiveness in improving the accuracy of image classification models. This method operates by linearly interpolating the pixels of randomly paired images along with their corresponding training targets, thereby generating synthetic examples for the training process. The application of Mixup as a training strategy has proven to be highly effective in regularizing image classification networks, leading to notable performance improvements. Mixup methodologies can be classified into input-level Mixup [14], [15], [16] and hidden-level Mixup [17] depending on where the mix operation occurs. In the context of natural language processing (NLP), applying Mixup poses greater challenges compared to computer vision due to the discrete nature of text data and the variability in sequence lengths. As a result, prior efforts in Mixup for textual data [18], [19] have put forth two strategies for its application in text classification: one involves performing interpolation on word embedding, while the other applies it to sentence embedding. This incentive drives us to explore Mixup text techniques for low-resource languages, specifically concentrating on Arabic sentiment classification. Our study involves a comparative analysis of basic LSTM classification models, both with and without the incorporation of Mixup techniques. Furthermore, we conduct experiments on diverse datasets, spanning sample sizes varying from hundreds to thousands per class. Additionally, we perform an ablation study to investigate the effects of different Mixup parameter values. To the best of our knowledge, this represents the pioneering research utilizing Mixup in the context of Arabic text classification.

RELATED WORKS

Data augmentation is a methodology employed to enhance the diversity and quality of data without the need to collect additional samples directly. The concept of data augmentation was initially applied in computer vision [20], where the manipulation of individual pixels in an image allows for modifications such as adding noise, cropping, padding, or flipping without compromising the underlying information. In various domains, such as image classification [21], [22]. [23] and sound classification [24], augmenting datasets with perturbed replicas of their samples has proven to be highly effective. However, the application of data augmentation techniques in Natural Language Processing remains relatively unexplored. Unlike image-related techniques that can generate new images with preserved semantic information through flipping and rotation, such approaches cannot be directly applied to text due to potential disruptions in syntax, grammar, and changes in the original sentence’s meaning. Moreover, while noise injection is commonly used to enhance audio signal data [8], [25], [26], its direct suitability for text is limited by the categorical nature of word and character tokens. Text data augmentation can be categorized into two main approaches: Feature space including Mixup [7] and data space [27]. In the data space, augmentation involves transforming raw data into readable textual data. The data space, as described in [27], is further divided into four categories: Character Level, word Level, phrase sentence Level, and document Level. In the feature space, Mixup opts for virtual embeddings rather than generating augmented samples in natural language form. This methodology leverages existing data to sample points in the virtual vector space, potentially resulting in sampled data with labels distinct from those of the original dataset. The inception of Mixup traces back to the domain of image processing, which originated from the work of [7]. Expanding upon this foundation, two variations of Mixup tailored introduced specifically for sentence classification by [18]. Mixup has witnessed widespread adoption in recent Natural Language Processing (NLP) research. In the field of neural machine translation, [28] pioneered the construction of adversarial samples, utilizing the methodology introduced by [29]. They subsequently implemented two Mixup strategies, namely Padv and Paut. Padv involves interpolation between adversarial samples, while Paut interpolates between the two corresponding original samples. Concurrently, [30] incorporated Mixup into Named Entity Recognition (NER), presenting both Intra-LADA and InterLADA approaches. In [31] introduced Mixup-Transformer, a novel approach integrating Mixup with transformer-based pre-trained architecture. The researchers evaluated its efficacy by assessing its performance across various text classification datasets. While the proposed method Saliency-Based Span Mixup in [32], SSMix, distinguishes itself by performing operations directly on input text rather than on hidden vectors, as seen in previous approaches. From the available literature, it appears that only a limited number of recent Arabic research studies have primarily focused on data space. For instance, Duwairi et al. [33] employed a set of rules to modify or swap branches of parse trees according to Arabic syntax, generating new sentences with the same labels. Similarly, in Named Entity Recognition, Sabty et al. [34] explored various data augmentation techniques, such as Word random insertion, swap and deletion within sentences, sentence back-translation, and word embedding substitution, which have also been utilized in other research, like [35], for spam detection.

MATERIALS AND METHODS

Mixup Concept

Define abbreviations and acronyms the first time they are used in the text, even after they have been defined in the abstract. Abbreviations such as IEEE, SI, MKS, CGS, sc, dc, and rms do not have to be defined. Do not use abbreviations in the title or heads unless they are unavoidable. The concept of Mixup involves creating a synthetic sample through linear interpolation of a pair of training samples and their corresponding model targets. To elaborate, let us consider a pair of samples denoted as (xi, yi) and (xj, yj), where x represents the input data, and y is the one-hot encoding representation of the respective class label for each sample. The process of generating the synthetic sample is as follows:where λ could be either fixed value in [0; 1] or it is sampled from Beta distribution with a hyper-parameter Beta (α; α). The synthetic data generated using this approach are subsequently introduced into the model during training, aiming to minimize the loss function, such as the cross-entropy function typically employed in supervised classification tasks. To achieve computational efficiency, the mixing process involves randomly selecting one sample and pairing it with another sample drawn from the same mini-batch.

Mixup for text classification

In contrast to images that comprise pixels, sentences are composed of sequences of words. Consequently, constructing a meaningful sentence representation involves aggregating information from this word sequence. In typical CNN or LSTM models, a sentence is initially represented as a sequence of word embedding and then processed through a sentence encoder. Commonly used sentence encoders include CNN and LSTM architectures. The resulting sentence embedding, generated by either CNN or LSTM, is subsequently passed through a softmax layer to generate the predictive distribution encompassing the possible target classes for making predictions. In [18], Guo introduced two variations of Mixup tailored for sentence classification. The first variant, referred to as wordMixup, employs sample interpolation within the word embedding space. The second variant, known as senMixup, performs interpolation on the final hidden layer of the network just before it is fed into the standard softmax layer to generate the predictive distribution across classes. Specifically, in the wordMixup technique, all sentences are first zero-padded to a uniform length. Subsequently, interpolation is performed for each dimension of every word within a sentence. Let us consider a given text, such as a sentence consisting of N words, which can be represented as a matrix B in an N × d form. Here, each row t of the matrix corresponds to an individual word, denoted as Bt, and is represented by a d-dimensional vector obtained either from a pre-trained word embedding table or randomly generated. To formalize the process, let (Bi, yi) and (Bj, yj) be a pair of samples, where Bi and Bj represent the embedding vectors of the input sentence pairs, and yi and yj correspond to their respective class labels, represented in a one-hot format. For a specific word at the t-th position in the sentence, the interpolation procedure is applied. The process can be formulated as:

The obtained novel sample〖(B ̃〗^ij;y ̃^ij) is subsequently employed for training purposes. As for senMixup, the hidden embeddings for both sentences, having identical dimensions, are initially generated using an encoder like CNN or LSTM. Following this, the pair of sentence embeddings, f(Bi) and f(Bj), is linearly interpolated. In more detail, let f represent the sentence encoder; thus, the sentences Bi and Bj are first encoded into their corresponding sentence embedding, f(Bi) and f(Bj), respectively. In this scenario, the mixing process is applied to each kth dimension of the sentence embedding as follows.The sentMixup usually applies Mixup directly before the softmax while we experimented with an additional Mixup type that works on the hidden layers output similar to [17] applying Mixup before the final linear layer. Th proposed models’ structures are represented in Fig. 1.

Datasets

We performed experiments using 8 Arabic sentiment classification benchmark datasets: ArSarcasm v1 [36] & v2 [37], SemEval2017 [38], ArSenTD-LEV [39], AJGT [40], ASTD-3C [41], MOV [42]. The training sets differ in size (from 12548 to 1524), and in number of labels (2 to 5). The used datasets are summarized in Table 1. Preprocessing

The effectiveness of sentiment analysis models greatly depends on the quality of data preprocessing, which is equally critical as the model’s architectural design. Preprocessing involves cleaning and preparation of the text data for the classification process. Textual data, particularly when sourced from the internet, tends to be unstructured, necessitating additional processing steps for proper classification. The noise removal step during text data preprocessing involves  eliminating several elements to enhance data quality. These elements encompass punctuation marks, numbers, non-Arabic text, URL links, hashtags, emojis, extra characters, diacritics, and elongated letters. Regular expressions serve as the primary technique for noise removal, effectively filtering out unwanted text.

Experimental environment and hardware

The experiments were developed using Python 3.9.7. The experiments, including their development, implementation, execution, and analysis, were conducted on an ASUS ROG G531GT Notebook. This machine runs Windows 11 and is equipped with a 9th generation Intel Core i7 processor, 32GB of RAM, a 512GB NVMe SSD, and an NVIDIA GeForce GTX 1650 4GB graphics card. The software libraries used in this study include PyTorch, Scikit-learn, Pandas, Gensim, and NumPy.

Model

We conducted an evaluation of wordMixup and senMixup using both CNN and LSTM architectures for sentence classification. In our setup, we employed filter sizes of 3, 4, and 5, each configured with 100 feature maps, and a dropout rate of 0.5 for the baseline CNN. For the LSTM model, we utilized 3 hidden layers, each comprising 100 hidden state dimensions, with the activation function set to tanh. Additionally, the mixing policy parameter is set to the default value of one. In cases where datasets lacked a standard test set, we adopted cross-validation with a k-fold value of 5 and reported the average performance metrics. Our training process utilized the Adam optimizer [36] with mini-batches of size 32 with 30 epochs and a learning rate of 1e-3. For word embedding, we employed 100-dimensional Aravec embedding.

RESULT

The four variations of models evaluated are None (without Mixup), Mix-embed (Mixup with word embedding), Mix-encoder (Mixup at the encoder level), and Mix-output (Mixup at the output level). Table 2 and Table 3 present the results of the different experiments using LSTM and CNN models respectively. We can observe a general improvement on the performance of different Mixup models on various datasets.

 

DISCUSSION

Across the datasets, it is evident that applying Mixup techniques generally leads to slight improvements in accuracy compared to the baseline None model. However, the effectiveness of Mixup varies depending on the dataset. For instance, on the AJGT dataset, all Mixup variants consistently outperform the None model, with Mix-encoder and Mix-output achieving the highest accuracy of 85.2%. On the other hand, for the SemEval2017 and ArSenTD-LEV datasets, Mixup provides only marginal gains, suggesting that the impact of Mixup might be more prominent in certain scenarios. Additionally, while Mixup seems to be beneficial in some cases, it does not necessarily lead to performance improvements across all datasets. For instance, on the MOV dataset, the Mixup variants show comparable or slightly worse results compared to the None model. Furthermore, it is worth noting that the Mix-encoder and Mix-output models tend to perform better than the Mix-embed model in most cases. This could be attributed to the advantage of applying Mixup at the higher levels of the model architecture, which allows the model to capture more abstract and meaningful patterns. Mixup augments data by interpolating sequences, which can create new variations that capture a broader range of sequential patterns. LSTMs, with their capability to understand and generalize sequences over long contexts, can leverage these variations more effectively than CNNs, which focus more on local patterns and may not fully utilize the sequential nature of the augmented data. Overall, these results demonstrate that Mixup techniques can be advantageous to sentiment analysis tasks, but their effectiveness is influenced by the dataset characteristics and the specific Mixup strategy used.

CONCLUSIONS AND RECOMMENDATIONS

Taking inspiration from the promising results of Mixup, a data augmentation technique based on sample interpolation used in image recognition and text classification, we conducted an investigation into three variations of Mixup for Arabic sentiment classification task, which is the first study on Mixup in Arabic to our knowledge. Our experiments demonstrate that the application of Mixup leads to improvements in accuracy and Macro F1 scores for both CNN and LSTM text classification models. Notably, our findings highlight the effectiveness of interpolation strategies as a domain-independent regularizer, effectively mitigating the risk of overfitting in sentence classification. These results underscore the potential of Mixup as a valuable tool in the field of NLP for enhancing model generalization and performance across various sentence classification tasks. In our future research endeavors, we have outlined our intentions to explore and examine further proposed variations of Mixup. Among these variants are AutoMix [44], a method that adaptively learns a sample mixing policy by leveraging discriminative features, SaliencyMix [32], which synthesizes sentences while maintaining the contextual structure of the original texts through span-based mixing and EMTCNN [45], an Enhanced Mixup that leverage transfer learning to address challenges in Twitter sentiment analysis. We are also interested in questions related to the visual appearance of mixed sentences and the underlying mechanisms responsible for the efficacy of interpolation in sentence classification. These inquiries will provide valuable insights into the potential applications and benefits of various Mixup techniques, contributing to the advancement of NLP tasks, particularly those focused on sentence classification.

Syrian Expatriate Research Conference (SERC): An Opportunity to Help!

For five consequent years, the Higher Commission for Scientific Research (HCSR) has held the Syrian Expatriate Researcher Conference (SERC) starting in 2019, with slogan “Toward a Syrian knowledge-based economy”. This event is where Syrian Researchers from all over the world meet physically or virtually to discuss science and propose different ways to develop productive and service sectors in Syria contributing to rebuilding. This year, the sixth version of SERC will be held in Damascus between 29 and 31 July 2024. Since 2019, SERC has turned into an annual scientific forum where close to 100 expatriate Syrian researchers from over 25 countries meet and discuss science with hundreds of their counterparts who work at Syrian universities and research centers, exchanging ideas and expertise in multidisciplinary sessions and topics, and shedding light onto novel advancement in frontier technologies including nanotechnology, biotechnology, ICT, artificial intelligence, renewable energy, environment, construction, etc. SERC also underscored the eagerness of Syrian expatriate researchers to offer the knowledge they have to help repair the severely damaged infrastructure and contribute to the urgent need for the rebuilding of Syria. In fact, SERC is an opportunity for both Syrian officials as well as researchers abroad. The former should provide whatever it takes to facilitate the return of those Syrian intellectuals back home, or at least guarantee their sustainable contribution to the advancement of science in Syria. The latter though, including thousands of researchers and scientists who left Syria before and after the crisis, should continue to support regardless of their political, social, or economical stand … for both parties, it is an opportunity to help and preserve pride and dignity!  

Book Review: Empowering Knowledge and Innovation.. Challenges for the Arab Countries

Artificial Intelligence in Gastrointestinal Endoscopy

About The Journal

Journal:Syrian Journal for Science and Innovation
Abbreviation: SJSI
Publisher: Higher Commission for Scientific Research
Address of Publisher: Syria – Damascus – Seven Square
ISSN – Online: 2959-8591
Publishing Frequency: Quartal
Launched Year: 2023
This journal is licensed under a: Creative Commons Attribution 4.0 International License.