We provide IT Staff Augmentation Services!

Bigdata Engineer Resume

3.00/5 (Submit Your Rating)

Plano, TX

SUMMARY

  • 9+ Years of experience in design and deployment of Enterprise Application Development, Web Applications, Client - Server Technologies, Web Programming using Java and Big data technologies.
  • Expertise on Hadoop architecture and ecosystem such as HDFS, MapReduce, Pig, Hive, Sqoop Flume and Oozie.
  • Complete Understanding on Hadoop daemons such as Job Tracker, Task Tracker, Name Node, Data Node and MRV1 and YARN architecture.
  • Experience in installation, configuration, Management, supporting and monitoring Hadoop cluster using various distributions such as Apache Hadoop, Cloudera Hortonworks, and various cloud services like AWS, GCP.
  • Experience in Installation and Configuring Hadoop Stack elements MapReduce, HDFS, Hive, Pig,Sqoop, Flume, Oozie and Zookeeper.
  • Experience in data processing and analysis using Spark, HiveQL, and SQL.
  • Extensive experience in Writing User Defined Functions (UDFs) in Hive and Spark.
  • Worked on ApacheSqoop to perform importing and exporting data from HDFS to RDBMS/NoSQL DBs and vice-versa.
  • Worked wif NoSQL databases such as HBase and Spark-Redis.
  • Experience in workflow scheduling and Job Designer wif the help of Oozie and Airflow.
  • Experience in using cloud services Amazon Web Services (AWS) including EC2, S3, AWS Lambda, Atana and EMR, used Redshift for migration.
  • Involved in developingDocker imagesand deployingDocker containersin swarm.
  • Conducted ETL Data Integration, Cleansing, and Transformations using AWS glue Spark script
  • Extensive knowledge on Amazon AWS services like EMR and EC2 for achieving fast and efficient processing of large chunks of Data and along wif Machine Learning models.
  • Worked extensively over semi-structured data (fixed length & delimited files)for data sanitation, report generation and standardization.
  • Involved in troubleshooting and performance tuning of reports and resolving issues wifinTableauServer and Reports.
  • Extensive experience working wif AWS Cloud services and AWS SDK’s to work wif services like AWS API Gateway, Lambda, S3, IAM and EC2.
  • Experienced in monitoring Hadoop cluster using Cloudera Manager and Web UI.
  • Excellent understanding of Zookeeper for monitoring and managing Hadoop jobs.
  • Experience wif NumPy, Matplotlib, Pandas, Seaborn, and Plotly python libraries. Worked on large datasets by using PySpark, NumPy and pandas.
  • Utilized machine learning algorithms such as linear regression, multivariate regression, naive bayes, Random Forests, K-means, & KNN for data analysis.
  • Developed Scripts and Batch Job to schedule various Hadoop Program. Used TensorFlow to train the model from insightful data.
  • Used GIT, ANT/Maven for project dependency / build / deployment.
  • Populated HDFS wif vast amounts of data using Apache Kafka and Flume.
  • Experience in Kafka installation & integration wif Spark Streaming.
  • Extensive experience across both relational databases and non-relational databases Oracle, PL/SQL, SQL Server, MySQL, and DB2.
  • Good Team Player, Strong Interpersonal, Organizational and Communication skills combined wif Self-Motivation, Initiative and Project Management Attributes.

TECHNICAL SKILLS

Hadoop Core Services: HDFS, Map Reduce, Spark, Spark SQL, PySpark, YARN.

Hadoop Distribution: Cloudera Hortonworks, Apache Hadoop.

Databases: HBase, Spark-Redis, Cassandra, Oracle, MySQL, Postgress, Teradata.

Data Services: Hive, Pig, Impala, Sqoop, Flume, Kafka.

Scheduling Tools: Zookeeper, Oozie.

Monitoring Tools: Cloudera Manager

Cloud Computing Tools: AWS, GCP

Programming Languages: C, Java, Scala, Python, R, SQL, PL/SQL, Pig Latin, HiveQL, Unix, Java Script, Shell Scripting.

Operating Systems: UNIX, Windows, LINUX.

Build Tools: Jenkins, Maven, Docker, ANT, Git.

Development Tools: Eclipse, NetBeans, Microsoft SQL Studio, Toad.

PROFESSIONAL EXPERIENCE

Confidential, Plano, TX

BigData Engineer

Responsibilities:

  • Developed simple and complex spark jobs in python for Data Analysis across different data formats.
  • Developed upgrade and downgrade scripts in SQL that filter corrupted and records wif missing values along wif identifying unique records based on different criteria.
  • DevelopedKafkaproducer and consumers,SparkandHadoop MapReducejobs.
  • Involved in Infrastructure Development and Operations. Designed and deployed applications using AWS services like EC2, S3, Glue, Lambda, EMR, VPC, RDS, Auto scaling, Cloud Formation, Cloud Watch, Redshift, Atana and Kinesis Data Firehose and Data Streams.
  • Configured and launched various AWS EC2 instances, also created AWS Route53 to route traffic between different regions.
  • Worked on cloud deployments using maven, docker and Jenkins.
  • Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.
  • Maintained Tableau functional reports based on user requirements.
  • Design and Develop Spark code using Scala, PySpark& Spark SQL for high speed data processing to meet critical business requirement
  • Developed pipeline for constant information ingestion utilizing Kafka, Sparkstreaming.
  • Performed ETL operations usingPython, SparkSQL, S3andRedshifton terabytes of data to obtain customer insights.
  • Fetched live data fromOracledatabase usingSpark Streamingand AmazonKinesisusing the feed fromAPI Gateway RESTservice.
  • Analyzed the SQL scripts and designed the solution to implement using PySpark
  • Automated CI/CD pipeline using Jenkins, build-pipeline-plugin, Maven, and GIT.
  • Implemented custom Data Types, Input Format, Record Reader, Output Format, Record Writer for Spark jobs computationto handle custom business requirements.
  • Worked on Parquetfiles, CSV files, Map side joins, bucketing, partitioning for hive performance enhancement and storage improvement.
  • Implemented daily cron jobs across AWS EC2 instances for automating parallel tasks for loading the data into SQL tables and Spark-Redis database.
  • Connected Redshift to Tableau for creating dynamic dashboard for analytics team.
  • Used airflow for scheduling and monitoring workflows and architecting complex data pipelines.
  • Responsible for performing extensive data validation using SQL.
  • Worked wif Sqoop import and export functionalities to handle large data set transfer between Oracle database and HDFS.
  • Involved in submitting and tracking spark jobs using Dkron.
  • Involved in creating Dkron workflow and Coordinate jobs to kick off the jobs on time and data availability.
  • Developed scripts using Spark which are used to load the data from Hive to Amazon RDS(Aurora) at a faster rate.
  • Involved in loading the created SQL tables data into Spark-Redis for faster access of large customer base wifout taking Performance hit.
  • Implemented Hive Generic UDF's to implement business logic.
  • Coordinated wif end users for designing and implementation of analytics solutions for User Based Recommendations using Python as per project proposals.
  • Created Python / SQL scripts, to transform Databricks notebooks from Redshift table into Snowflake S3 buckets.
  • Worked wif AWS services like S3, Glue, EMR, SNS, SQS, Lambda, EC2, RDS and Atana to automate and maintain data pipeline for the downstream customers.
  • Involved in converting Hive/SQL queries into Spark (RDDs, Dataframe and Dataset) using Python and Scala.
  • Involved in creating microservices using Scala programming.
  • Involved in handling Hive queries using Spark SQL that integrate Spark environment.
  • Implemented test scripts to support test driven development and continuous integration.
  • J-unit framework was used to perform unit and integration testing.
  • Configured build scripts for multi module projects wif Maven and Jenkins CI.
  • Involved in story-driven agile development methodology and actively participated in daily scrum meetings.

Environment: Spark, Spark SQL, Pyspark, Spark Streaming, Tableau, Scala, Hadoop, Hive, Sqoop, Play framework,Data Pipeline, Apache Ranger, ETL, S3, EMR, EC2, Redshift,Cloud Watch, SNS, SQS, Lambda,MapReduce, Kafka, Snowflake, Zeppelin, Kinesis Firehose, Kinesis Data Streams, Python,Atana, Docker, Jenkins, RDS, Rundeck and AWS Glue, Git.

Confidential, Plano, TX

Data Engineer

Responsibilities:

  • Developed ETL data pipelines using Sqoop,Spark, Spark SQL, Scala, and Oozie.
  • UsedSpark for interactive queries, processing of streaming data and integrated wif popular NoSQL databases
  • Worked on Amazon AWS Cloud Services, (EC2, S3, EBS, ELB, Cloud Watch, Elastic IP, RDS, SNS, SQS, Glacier, IAM, VPC, Cloud Formation, Atana, Redshift, Route53,Cloud Front, SQS, Cloud Trail, ELB).
  • Loaded data intoS3buckets usingAWS GlueandPySpark.
  • Performed interactive Analytics like cleansing, validation and quality checks on data stored inS3buckets usingAWS Atana
  • Managed security groups on AWS, focusing on high-availability, fault tolerance, and auto-scaling.
  • Converted existing Terraform modules that had version conflicts to utilize Cloud Formation templates during deployments, worked wif Terraform to create stacks in AWS.
  • Conducted ETL Data Integration, Cleansing, and Transformations using AWS glue Spark script
  • Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations.
  • Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.
  • Maintained Tableau functional reports based on user requirements.
  • Used PySpark, Spark MLlib to perform Classification, Regression and Clustering on data.
  • DevelopedSpark code using Scala andSpark-SQL for faster processing of data.
  • Involved in migratingMapReduce jobsintoSparkjobs and usedSpark SQLand load structured and semi-structured data intoSpark clusters.
  • Imported real time and batch data from various sources into S3 and used AWS lambda for processing applications in Snowflake.
  • Developed pipeline for constant information ingestion utilizing Kafka, Sparkstreaming.
  • Wrote Sqoop scripts for importing large data sets from Teradata into HDFS.
  • Connected Redshift to Tableau for creating dynamic dashboard for analytics team.
  • Performed Data Ingestion from multiple internal clients using Apache Kafka.
  • Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment wif Linux/Windows for big data resources.
  • Responsible for analyzing and cleansing raw data by performing Hive queries and running scripts for operations like duplicate check, null check etc. on data.
  • Developed pipelines for auditing the metrics of all applications using GCP Cloud functions, and Dataflow for a pilot project.
  • Developed end-to-end pipeline, which exports the data from parquet files in Cloud Storage to GCP Cloud SQL.
  • Maintained fully automated CI/CD pipelines for code deployment (Gitlab/ Jenkins/ IBM UC Deploy).
  • Created Oozie workflow engine to run multiple Spark jobs.
  • Developed file cleaners using Python libraries and made it clean.
  • Performed K-means clustering, Multivariate analysis and Support Vector Machines in Python and R.
  • Worked on NOSQL databases like Cassandra.
  • Exploring wif Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark-SQL, Data Frame, pair RDD's, Spark YARN.
  • Used terraform scripts which automates the step execution in EMR to load the data to Scylla DB.
  • De-normalizing the data as part of transformation which is coming from Netezza and loading it to No SQL Databases and MySQL.

Environment: Spark, Spark SQL, Spark Streaming, Tableau,Pyspark,Python, Scala, Cloudera Hadoop, NoSQL, S3, EC2, IAM, Cloud Watch, EMR, Lambda, Glue, Redshift, Atana, Pipeline, ETL, Snowflake, MapReduce, Kafka, HDFS, MySQL, Hortonworks, Cloudera Manager, Hive, Pig, Sqoop, Oozie, Flume, Linux, AWS,GCP, Zookeeper, LDAP, Git.

Confidential, Dallas, TX

Data Engineer

Responsibilities:

  • Worked wif the keystakeholdersof different business groups to identify the core requirements in building the next generation analytic solution usingimpalaas the processing framework and Hadoop for storage on the current dealer data lake.
  • Involved in migrating MapReduce jobsintoSparkjobs and usedSpark SQLand load structured and semi-structured data intoSpark clusters.
  • Developed Kafka producer and consumers,SparkandHadoop MapReducejobs.
  • Orchestrated hundreds ofSqoopscripts,Pythonscripts,hivequeries usingOozieworkflows and sub- workflows.
  • Analyzed HBase data inHiveby creating external partitioned and bucketed tables.
  • Involved in usingHCATLOGto accessHivetable metadata fromMapReduceAndPig Code.
  • Uploaded streaming data fromKafkatoHdfs,HBaseandHiveby integrating Wif Storm.
  • Written generic extensive data quality check framework to be used by the application usingimpala.
  • Design and Develop Spark code using Scala, PySpark& Spark SQL for high speed data processing to meet critical business requirement
  • Performance tuning inHive,Impalausing multiple methods but not limited toDynamic partitioning,bucketing,indexing,file compressions,vectorization,and cost based optimization.
  • Implemented Fair Scheduler on thejob trackerto allocate the fair amount of resources to small jobs.
  • Implemented PySpark scripts to classified data organizations into different classified based on different types of records. Assisted monitoring Spark cluster.
  • DevelopedOozieworkflow for scheduling and orchestrating the ETL process.
  • Data and financial histories intoHdfsfor Used ApacheHueweb interface to monitor the Hadoop cluster and run the jobs.
  • Ingested the Log data intoETLpipeline which transforms and loads the text format data toHdfs.
  • Implemented automatic failoverZookeeperandzookeeperfailover controller.
  • Developed data pipeline usingFlume, Sqoop, pig and MapReduce to ingest customer behavioral.
  • Participated in daily scrum meetings and iterative development.
  • Involved in troubleshooting and performance tuning of reports and resolving issues wifinTableauServer and Reports.
  • Used AWS cloud services to launch Linux and windows machines, created security groups and written basic PowerShell scripts to take backups and mount network shared drives.
  • Performed S3 buckets creation, policies and also on the IAM role based polices and customizing the JSON template.
  • Used Amazon IAMto grant fine access of AWS resources to users. Also managed roles and permissions of users to AWS account through IAM.
  • Maintain build profiles in Team Foundation Server and Jenkins for CI/CD pipeline.
  • Configured and implemented the Amazon EC2 instances for our application teams.
  • Extensively used Cloud Formation templates for deploying the infrastructures. Written the Cloud Formation scripts for datalake components that uses various AWS services such as Data pipeline, Lambda, Elastic Beanstalk, SQS, SNS and RDS database.
  • Configured Ansible to manage AWS environments and automate the build process for core AMIs used by all application deployments including Autoscaling, and Cloud formation scripts.
  • Used DNS management in Route53,Amazon S3to backup database instances to save snapshots of data and Manage Network allocation inVPCto create new public networks.
  • Creating alarms in Cloud Watch service for monitoring the server's performance, CPU Utilization, and disk usage.
  • Used Gitfor version control andJIRAfor project tracking
  • Used Jira as ticket tracking and work flow tool.
  • Execute and maintain internal and external SLAs developed wif business stakeholders

Environment: Hadoop, Spark,Pyspark,Spark Streaming, Scala, MapReduce, EC2, Hive, HDFS, PIG, Pipeline, Sqoop, Flume, Kafka, HBase, Spark SQL, Zookeeper,AWS, MYSQL, Impala, Python, S3, IAM, EC2, ETL, Lambda, Route 53, SNS, UNIX, Git.

Confidential, Houston, TX

BigData Egineer

Responsibilities:

  • Involved in loading and transforming large sets of structured, semi structured and unstructured data from relational databases into HDFS using Hive.
  • Creating data model and database design as per technical requirement.
  • Envisioning and designing initial models and end to end workflow for Risk Application by interacting wif clients, product managers.
  • Defining data extraction and data ingestion strategies from legacy applications into Big Data Platform as part of migration.
  • Developed Python scripts to import export data from relational sources and handled incremental loading on the customer, transaction data by date.
  • Migrated existing java application into microservices using spring boot and spring cloud.
  • Working knowledge in different IDEs like Eclipse,Spring Tool Suite.
  • Used GIT, ANT/Maven for project dependency / build / deployment.
  • DevelopedSparkcode using Scala andSpark-SQL/Streaming for faster testing and processing of data.
  • Import the data from different sources like HDFS/HBase intoSparkRDD.
  • Experienced wif batch processing of data sources using Apache Spark.
  • Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
  • Worked on partitioning HIVE tables and running the scripts in parallel to reduce run-time of the scripts.
  • Built statistical models on AWS EMR by uploading data in S3 and creating instance on EC2
  • Used PySpark data frame to read text data, CSV data, and image data from HDFS, S3 and Hive.
  • Worked on Data Serialization formats for converting Complex objects into sequence bits by using AVRO, PARQUET, JSON, CSV, RC formats.
  • Responsible for analyzing and cleansing raw data by performing Hive queries and running scripts for operations like duplicate check, null check etc. on data.
  • Responsible for developing data pipeline wif Amazon AWS to extract the data from weblogs and store in HDFS.
  • Involved in Administration, installing, upgrading and managing distributions of Hadoop, Hive, HBase.
  • Involved in performance of troubleshooting and tuning Hadoop clusters.
  • Used Jenkins as a Continuous Integration (CI) tool.
  • Created Hive tables, loaded data and wrote Hive queries that run wifin the map.
  • Implemented business logic by writing Hive UDFs in Java.
  • Developed Shell scripts and some of Perl scripts based on the user requirement.

Environment: Hadoop, Python, Pyspark,Spark Streaming, Data Pipeline, HDFS, Hive, SQL, Spark, Spark SQL,EMR, S3, EC2, Shell scripting, Avro, Cron Jobs, GIT, Ant, Maven, MapReduce, Scala, AWS.

Confidential, Irving,TX

Hadoop Developer

Responsibilities:

  • Installed, configured, and maintained Apache Hadoop clusters for application development and major components of Hadoop Ecosystem: Hive, Pig, HBase, Sqoop, Flume, Oozie and Zookeeper.
  • Implemented six nodes CDH4 Hadoop Cluster on CentOS.
  • Importing and exporting data into HDFS and Hive from different RDBMS using Sqoop.
  • Experienced in defining job flows to run multiple Map Reduce and Pig jobs using Oozie.
  • Importing log files using Flume into HDFS and load into Hive tables to query data.
  • Monitoring the runningMap Reduceprograms on the cluster.
  • Responsible for loading data from UNIX file systems to HDFS.
  • Used HBase-Hive integration, written multiple Hive UDFs for complex queries.
  • Involved in writing APIs to ReadHBasetables, cleanse data and write to anotherHBasetable.
  • Created multiple Hive tables, implemented Partitioning, Dynamic Partitioning and Buckets in Hive for efficient data access.
  • Written multiple Map Reduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
  • Experienced in running batch processes using Pig Scripts and developed Pig UDFs for data manipulation according to Business Requirements.
  • Experienced in writing programs using HBase Client API.
  • Involved in loading data into HBase using HBase Shell, HBase Client API, Pig and Sqoop.
  • Experienced in design, development, tuning and maintenance of NoSQL database.
  • Written Map Reduce program in Python wif the Hadoop streaming API.
  • Developed unit test cases for Hadoop Map Reduce jobs wif MRUnit.
  • Excellent experience in ETL analysis, designing, developing, testing, and implementing ETL processes including performance tuning and query optimizing of database.
  • Worked wif application teams to install operating system, Hadoop updates, patches, version upgrades as required.
  • Automated build and deployment using Jenkins to reduce human error and speed up production processes.
  • Used Maven as the build tool and SVN for code management.
  • Worked on writing RESTful web services for the application.
  • Used GIT as a version control tool.
  • Implemented testing scripts to support test driven development and continuous integration.

Environment: Hadoop, Map Reduce, NoSQL,Python,HDFS,Sqoop, ETL, HBase, Hive, Impala, Pig, Java, SQL, Ganglia, Scoop, Flume, Oozie, Unix, Java, Java Script, Git, Maven, Jenkins, Eclipse.

We'd love your feedback!