Big Data Engineer/spark Resume
St Louis, MO
SUMMARY
- Around 8+ years of professional IT experience involving project development, implementation, deployment and maintenance using Bigdata technologies in designing and implementing complete end - to-end Hadoop based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase.
- Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake.
- Working experience with Linux lineup like Redhat and CentOS.
- Experience on ETL concepts using Informatica Power Center, AB Initio.
- Good Knowledge on AWS components like EC2 Instance, S3 and EMR.
- Comprehensive knowledge of Software Development Life Cycle (SDLC).
- Exposure to Waterfall, Agile and Scrum models.
- Highly adept at promptly and thoroughly mastering new technologies with a keen awareness of new industry developments and the evolution of next generation programming solutions.
- Experienced Orchestrating, scheduling, and monitoring job tools like Oozie, and Airflow.
- Monitored Event viewer, MSSQL Error logs and Log File Viewer for software and hardware related errors.
- Expertise in building PySpark and Scala applications for interactive analysis, batch processing & stream processing.
- Good experience in using Relational databasesOracle, Devops, SQL, GCP Server andPostgreSQL.
TECHNICAL SKILLS
BigData/Hadoop Technologies: MapReduce, Spark, SparkSQL,Azure,Spark Streaming, Kafka, Airflow, PySpark,, Pig, Hive,HBase, Flume, Yarn, Oozie, Zookeeper, Hue, Ambari Server
Languages: HTML5,DHTML, WSDL, CSS3, C, C++, XML,R/R Studio, SAS Enterprise Guide, SAS, R (Caret, Weka, ggplot), Perl, MATLAB, Mathematica, FORTRAN, DTD, Schemas, Json, Ajax, Java, Scala, Python (NumPy, SciPy, Pandas, Gensim, Keras), Java Script, Shell Scripting
NO SQL Databases: Cassandra, HBase, MongoDB, MariaDB
Web Design Tools: HTML, CSS, JavaScript, JSP, jQuery, XML
Development Tools: Microsoft SQL Studio, IntelliJ,Azure Databricks, Eclipse, NetBeans.
Public Cloud: EC2, IAM, S3, Autoscaling, CloudWatch, Route53, EMR, RedShift
Development Methodologies: Agile/Scrum, UML, Design Patterns, Waterfall
Build Tools: Jenkins, Toad, SQL Loader,PostgreSql, Talend,Maven, ANT, RTC, RSA, Control-M, Oziee, Hue, SOAP UI
Reporting Tools: MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS, cognos.
Databases: Microsoft SQL Server 2008,2010/2012, MySQL 4.x/5.x, Oracle 11g, 12c, DB2, Teradata, Netezza, GraphDB
Operating Systems: All versions of Windows, UNIX, LINUX, Macintosh HD, Sun Solaris
PROFESSIONAL EXPERIENCE
Confidential, St.Louis, MO
Big Data Engineer/Spark
Responsibilities:
- Evaluating client needs and translating their business requirement to functional specifications thereby onboarding them onto the Hadoop ecosystem.
- Extracted and updated the data into HDFS using Sqoop import and export.
- Developed HIVE UDFs to in corporate external business logic into Hive script and Developed join data set scripts using HIVE join operations.
- Created various hive external tables, staging tables and joined the tables as per the requirement. Implemented static Partitioning, Dynamic partitioning and Bucketing.
- Worked with various HDFS file formats like Parque, IAM,Json for serializing and deserializing.
- Developed end to end automation by automating data movement between servers and cloud solutions (GCP and Azure).
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Azure, PySpark, Impala, Tealeaf, Pair RDD's, Nifi, Devops, Spark YARN.
- Develop Python Scripts for Unit Testing of Data Link Software, according to software requirements.
- Worked on creating data pipelines with Airflow to schedule PySpark jobs for performing incremental loads and used Flume for weblog server data. Created Airflow Scheduling scripts in Python.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Worked with NoSQL databases like HBase in creating HBase tables to load largae sets of semi-structured data coming from various sources.
- Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
- Worked on data migration from legacy tools to Azure and GCP.
- Used Spark Data Frames Operations to perform required Validations in the data and to perform analytics on the Hive data.
- Experience in database design and development using SQL Azure, Microsoft SQL Server, GraphDB
- Experience in migration of Data from Excel, Flat file, Oracle to MSQL server by using SSIS.
- Implemented effective data access from GraphDB nodes and Triples from Turtle file using SPARQL.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
- Installed Kafka Producer on different severs and Scheduled to produce data for every 10 seconds
- Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing
- Developed Apache Spark applications by using spark for data processing from various streaming sources.
- Strong Knowledge on architecture and components of TeaLeaf, and efficient in working with Spark Core, SparkSQL. Designed and developed RDD Seeds using Scala and Cascading. Streaming data to Spark streaming using Kafka
- Exposure to Spark, Spark Streaming, Spark MLlib, snowflake,Scala and Creating the Data Frames handled in Sparkwith Scala.
- Provide guidance to development team working on PySpark as ETL platform
- Good experience on NoSQL databases and hands on work experience in writing applications No SQL Databases HBase, Cassandra and MongoDB.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.
- Extensively worked on the naming standards which incorporated the enterprise data modeling.
- Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
- Migrated Map reduce jobs to Spark jobs to achieve better performance.
- Working on designing the MapReduce and Yarn flow and writing MapReduce scripts, performance tuning and debugging.
- Designed and created different ETL packages using SSIS and transferred data from Oracle source to MSSQL server destination.
- Developed a NIFI Workflow to pick up the data from SFTP server and send that to Kafka broker.
- Developed an Oozie workflow engine to run multiple Hive, Pig, Tealeaf,Mongo DB, Git, Sqoop and Spark jobs.
- Installed application on AWS EC2 instances and configured the storage on S3 buckets.
- Deployed and designed pipelines through Azure data factory and debugged the process for errors.
- Stored data in AWS S3 like HDFS and performed EMR programs on data stored.
- Used the AWS-CLI to suspend an AWS Lambda function. Used AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS.
Environment: Hadoop( HDFS, MapReduce), Scala, Databricks, Yarn, IAM, PostgreSql, Spark, Impala, Azure, Hive, Kafka, Airflow, Mongo DB, Pig, Devops, HBase, Oozie, Hue, Sqoop, Flume, GCP, Oracle, NIFI, Git, AWS Services (Lambda, EMR, Autoscaling). Data Modeling Tools:Erwin r9, GraphDB, Erwin r8, Erwin r7.1/7.2, Rational Rose 2000, ER Studio and Oracle Designer. MSSQL 2008.
Confidential, St Louis, Missouri
Hadoop developer
Responsibilities:
- Involved in complete Big Data flow of the application starting from data ingestion upstream to HDFS, processing the data in HDFS and analyzing the data and involved.
- Configured Flume to extract the data from the web server output files to load into HDFS.
- Used Flume to collect, aggregate and store the web log data from different sources like web servers, mobile and network devices and pushed into HDFS.
- Understood basic business analysts concepts for logical data modeling, data flow processing and data base design.
- Migration of on premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake Store(ADLS) using Azure Data Factory(ADF V1/V2).
- Developed Spark applications using Pyspark and Spark-SQL fordataextraction, transformation and aggregation from multiple file formats.
- Worked on analyzing Hadoop Cluster and different big data analytic tools including Pig, Hive.
- Worked on airflow as a scheduling and orchestration tool.
- Working experience with data streaming process with Kafka, Apache Spark, Hive.
- Worked with various HDFS file formats like Avro, Sequence File, Nifi,Json andv arious compression formats like Snappy, bzip2.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka.
- Worked on Migrating SQL database toAzure data Lake, Azure data lake Analytics,Azure SQL Database, Data BricksandAzure SQL Data warehouseand Controlling and granting database accessandMigrating On premise databases toAzure Data lake storeusing Azure Data factory.
- Loaded data into S3 buckets using AWS Glue and PySpark. Involved in filtering data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Analyzed the SQL scripts and designed the solution to implement using Scala.
- Used Spark-SQL to Load JSON data and create Schema R DD and loaded it into Hive Tables and handled structured data using Spark SQL.
- Worked on migration of data from On-prem SQL server to Cloud databases(Azure Synapse Analytics (DW) & Azure SQL DB).
- Migrated from Oozie to Apache Airflow. Involved in developing Oozie and Airflow.
- Tested Apache Tez for building high performance batch and interactive data processing applications on Pig and Hive jobs.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL,postgreSQL,Scala,Data Frame,Impala,OpenShift, Talend,pair RDD's.
- Have good experience working with Azure BLOB andData lakestorage and loading data intoAzure SQL Synapse analytics (DW).
- Designed Columnar families in Cassandra and Ingested data from RDBMS, performed transformations and exported the data to Cassandra.
- Leading the testing efforts in support of projects/programs across a large landscape of technologies ( Unix, Angular JS, AWS, sauseLABS, Cucumber JVM, Mongo DB, GITHub,BitBucket,SQL, NoSQL database, API, Java, Jenkins
- Data warehouse solutions using polybase/external table on Azure Synapse/Azure SQL Data warehouse (Azure DW), Using Azure Data Lake as source.
- Rewriting exiting SSAS cubes to Azure Synapse/Azure SQL Data warehouse (Azure DW).
- Designing and maintaining reports in Power BI, built on top of Azure Synapse/Azure Data Warehouse, Azure Data lake, Azure SQL.
- Developed workflow in Oozie to automate the tasks of loading the data into Nifi and pre-processing with Pig.
- Worked on Apache NIFI to decompress and move JSON files from local to HDFS.
- Experience on moving raw data between different systems using Apache NIFI.
- Involved in loading data from the UNIX file system to HDFS using Shell Scripting.
- Used Elasticsearch for indexing/full text searching.
- Worked on MSSQL Server Integration Services (SSIS),T-SQL skills, stored procedures, triggers.
- Code and developed a custom Elastic Search java-based wrapper client using the JEST API.
- Hands-on experience in using AWS services like EC2, S3, Mongo DB, Nifi, Talend,Autoscaling and DynamoDB.
Environment: Hadoop(HDFS,MapReduce),Databricks,Spark,Talend,Impala,Hive,postgresql,Jenkins,Nifi,Scala,Mongo DB, Cassandra, Python, Pig, Sqoop, Devops, Hibernate, spring, Oozie, GCP, AWS Services EC2, S3, Autoscaling, scala,Azure,Elastic Search, DynamoDB, UNIX Shell Scripting, TEZ.
Confidential, Coppell, Texas
Data Engineer
Responsibilities:
- Participated in Data Acquisition with Data Engineer team to extract clinical and imaging data from several data sources like flat file and other databases.
- Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoopclusters which are set up in AWS EMR.
- Performed Data Preparation by using Pig Latin to get the right data format needed.
- Used python pandas,Nifi,Jenkins,nltk, and textblobto finish the ETL process of clinical data for future NLP analysis.
- Utilized the clinical data to generate features to describe the different illnesses by using LDA Topic Modelling.
- Used PCA to reduce dimension and compute eigenvalue and eigenvector and used OpenCV to analysis the CT scan pictures to figure out the disease in CT scan.
- Processed the image data through the Hadoop distributed system by using Map and Reducethen stored into HDFS.
- Created Session Beans and controller Servlets for handling HTTP requests from Talend
- Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports including chars, summaries, and graphs to interpret the findings to the team and stakeholders.
- Worked with installing, designing and managing MSSQL Server 2008 and 2008R2.
- Wrote documentation for each report including purpose, data source, column mapping, transformation, and user group.
- Utilized Waterfall methodology for team and project management.
- Used Git for version control with Data Engineer team and Data Scientists colleagues.
Environment: Ubuntu 16.04,hadoop 2.0,Spark (PySpark, Nifi, Jenkins, Talend, SparkSQL, SparkMLlib),Pig 0.15,Python 3.x(Nltk, Pandas),MSSQL server, Tableau 10.3,GitHub,AWS EMR/EC2/s# and OpenCV.
Confidential
Data Analyst
Responsibilities:
- Collaborated with Business Analysts, SMEsacross departments to gather business requirements, and identify workable items for further development.
- Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up-to-date for reporting purpose by Pig.
- Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
- Processed some simple statistic analysis of data profiling like cancel rate, var, skew, kurt of trades, and runs of each stock every day group by 1 min, 5 min, and 15 min.
- Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL,postgreSQL,Data Frame,OpenShift, Talend,pair RDD's
- Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
- Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
- Boosted the performance of regression models by applying polynomial transformation and feature selectionand used those methods to select stocks.
- Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
- Utilized Agile and Scrum methodology for team and project management.
- Designed and Implemented Database auditing solution for Oracle, MSSQL, Sybase and DB2 on 10000+ servers.
- Used Git for version control with colleagues.
Environment: Spark (PySpark, SparkSQL, SparkMLIib), Oracle,MSSQL server,Python 3.x (Scikit-learn, Numpy, Pandas), Tableau 10.1, GitHub, AWS EMR/EC2/S3/Redshift, and Pig.